You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2014/01/25 10:07:12 UTC
svn commit: r895366 - in /websites/staging/mahout/trunk/content: ./
general/faq.html
Author: buildbot
Date: Sat Jan 25 09:07:11 2014
New Revision: 895366
Log:
Staging update by buildbot for mahout
Modified:
websites/staging/mahout/trunk/content/ (props changed)
websites/staging/mahout/trunk/content/general/faq.html
Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Sat Jan 25 09:07:11 2014
@@ -1 +1 @@
-1560371
+1561277
Modified: websites/staging/mahout/trunk/content/general/faq.html
==============================================================================
--- websites/staging/mahout/trunk/content/general/faq.html (original)
+++ websites/staging/mahout/trunk/content/general/faq.html Sat Jan 25 09:07:11 2014
@@ -299,6 +299,10 @@
<li><a href="todo">What algorithms are missing from Mahout?</a></li>
<li><a href="hadoop">Do I need Hadoop to run Mahout?</a></li>
</ol>
+<p><em>Hadoop specific questions</em></p>
+<ol>
+<li><a href="split">Mahout just won't run in parallel on my dataset. Why?</a></li>
+</ol>
<h1 id="answers"><em>Answers</em></h1>
<h2 id="general">General</h2>
<p><a name="whatIs"></a></p>
@@ -352,6 +356,11 @@ a couple of modules that require no Hado
vector and matrix serialisation). For recommendation those packages don't have Hadoop as
part of their namespace. For classification checkout the sgd package. For clustering checkout
the new kmeans++ stuff.</p>
+<h2 id="hadoop-specific-questions">Hadoop specific questions</h2>
+<p><a href="split"></a></p>
+<h3 id="mahout-just-wont-run-in-parallel-on-my-dataset-why">Mahout just won't run in parallel on my dataset. Why?</h3>
+<p>If you are running training on a Hadoop cluster keep in mind that the number of mappers started is governed by the size of the input data and the configured split/block size of your cluster. As a rule of thumb,
+anything below 100MB in size won't be split. In addition files compressed e.g. with gzip aren't splitable neither. For a more detailed discussion of the topic see als <a href="https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W265aa64a4f21_43ee_b236_c42a1c875961/page/MapReduce%20-%20Tuning%20the%20number%20of%20map%20tasks">IBM InfoSphere Split Size article</a></p>
</div>
</div>
</div>