You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mahout.apache.org by is...@apache.org on 2014/01/25 10:07:06 UTC

svn commit: r1561277 - /mahout/site/mahout_cms/trunk/content/general/faq.mdtext

Author: isabel
Date: Sat Jan 25 09:07:06 2014
New Revision: 1561277

URL: http://svn.apache.org/r1561277
Log:
Add FAQ entry concerning number of map tasks started.

Modified:
    mahout/site/mahout_cms/trunk/content/general/faq.mdtext

Modified: mahout/site/mahout_cms/trunk/content/general/faq.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/general/faq.mdtext?rev=1561277&r1=1561276&r2=1561277&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/general/faq.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/general/faq.mdtext Sat Jan 25 09:07:06 2014
@@ -17,6 +17,11 @@ Title: FAQ
 1. [What algorithms are missing from Mahout?](todo)
 1. [Do I need Hadoop to run Mahout?](hadoop)
 
+*Hadoop specific questions*
+
+1. [Mahout just won't run in parallel on my dataset. Why?](split)
+
+
 # *Answers*
 
 
@@ -92,4 +97,11 @@ Apart from the possibility of running Ha
 a couple of modules that require no Hadoop dependencies whatsoever (except maybe for
 vector and matrix serialisation). For recommendation those packages don't have Hadoop as
 part of their namespace. For classification checkout the sgd package. For clustering checkout
-the new kmeans++ stuff.
\ No newline at end of file
+the new kmeans++ stuff.
+
+## Hadoop specific questions
+<a href="split"></a>
+### Mahout just won't run in parallel on my dataset. Why?
+
+If you are running training on a Hadoop cluster keep in mind that the number of mappers started is governed by the size of the input data and the configured split/block size of your cluster. As a rule of thumb,
+anything below 100MB in size won't be split. In addition files compressed e.g. with gzip aren't splitable neither. For a more detailed discussion of the topic see als [IBM InfoSphere Split Size article](https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/W265aa64a4f21_43ee_b236_c42a1c875961/page/MapReduce%20-%20Tuning%20the%20number%20of%20map%20tasks)
\ No newline at end of file