You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by bu...@apache.org on 2014/01/25 10:07:12 UTC

svn commit: r895366 - in /websites/staging/mahout/trunk/content: ./ general/faq.html

Author: buildbot
Date: Sat Jan 25 09:07:11 2014
New Revision: 895366

Log:
Staging update by buildbot for mahout

Modified:
    websites/staging/mahout/trunk/content/   (props changed)
    websites/staging/mahout/trunk/content/general/faq.html

Propchange: websites/staging/mahout/trunk/content/
------------------------------------------------------------------------------
--- cms:source-revision (original)
+++ cms:source-revision Sat Jan 25 09:07:11 2014
@@ -1 +1 @@
-1560371
+1561277

Modified: websites/staging/mahout/trunk/content/general/faq.html
==============================================================================
--- websites/staging/mahout/trunk/content/general/faq.html (original)
+++ websites/staging/mahout/trunk/content/general/faq.html Sat Jan 25 09:07:11 2014
@@ -299,6 +299,10 @@
 <li><a href="todo">What algorithms are missing from Mahout?</a></li>
 <li><a href="hadoop">Do I need Hadoop to run Mahout?</a></li>
 </ol>
+<p><em>Hadoop specific questions</em></p>
+<ol>
+<li><a href="split">Mahout just won't run in parallel on my dataset. Why?</a></li>
+</ol>
 <h1 id="answers"><em>Answers</em></h1>
 <h2 id="general">General</h2>
 <p><a name="whatIs"></a></p>
@@ -352,6 +356,11 @@ a couple of modules that require no Hado
 vector and matrix serialisation). For recommendation those packages don't have Hadoop as
 part of their namespace. For classification checkout the sgd package. For clustering checkout
 the new kmeans++ stuff.</p>
+<h2 id="hadoop-specific-questions">Hadoop specific questions</h2>
+<p><a href="split"></a></p>
+<h3 id="mahout-just-wont-run-in-parallel-on-my-dataset-why">Mahout just won't run in parallel on my dataset. Why?</h3>
+<p>If you are running training on a Hadoop cluster keep in mind that the number of mappers started is governed by the size of the input data and the configured split/block size of your cluster. As a rule of thumb,
+anything below 100MB in size won't be split. In addition files compressed e.g. with gzip aren't splitable neither. For a more detailed discussion of the topic see als <a href="https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W265aa64a4f21_43ee_b236_c42a1c875961/page/MapReduce%20-%20Tuning%20the%20number%20of%20map%20tasks">IBM InfoSphere Split Size article</a></p>
    </div>
   </div>     
 </div>