You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Chad Hinton <ch...@gmail.com> on 2010/01/11 23:00:46 UTC

LDA only executes a single map task per iteration when running in actual distributed mode?

I saw two comments related to an actual distributed run of the LDA example
but no answer to this question. A previous message in the list confirms that
at least one other person has experienced this issue. I am submitting a map
reduce job to a 20 node Hadoop cluster as follows:

hadoop jar /root/mahout-core-0.2.job
org.apache.mahout.clustering.lda.LDADriver -i
hdfs://master/lda/input/vectors -o hdfs://master/lda/output -k 20 -v 10000
--maxIter 40

where lda/input/vectors is the vectors file generated from the stand alone
build-reuters.sh example. I can only get a single map task to execute while
approx. 57 task slots are available. Has anyone actually ran distributed LDA
successfully? This will help me figure out if I have a hadoop config issue
or if there is an actual algorithm implementation problem. The Hadoop
examples run successfully in distributed mode utilizing all available map
tasks. I'm not sure if there is an issue with the InputSplit for the
SequenceFile or something else... Any help is appreciated.

Chad

Re: LDA only executes a single map task per iteration when running in actual distributed mode?

Posted by Ted Dunning <te...@gmail.com>.

It should just happen if the file is large enough and the program is
configured for more than one mapper task and the file type is correct.

If you are reading an uncompressed sequence file you should be set.

On Mon, Jan 11, 2010 at 9:53 PM, David Hall <dl...@cs.berkeley.edu> wrote:

>  I can brush up on my hadoop foo to figure out how to have
> hadoop split up a single file, if you want.
>

-- 
Ted Dunning, CTO
DeepDyve

Re: LDA only executes a single map task per iteration when running in actual distributed mode?

Posted by David Hall <dl...@cs.berkeley.edu>.

On Mon, Jan 11, 2010 at 2:00 PM, Chad Hinton <ch...@gmail.com> wrote:
> I saw two comments related to an actual distributed run of the LDA example
> but no answer to this question. A previous message in the list confirms that
> at least one other person has experienced this issue. I am submitting a map
> reduce job to a 20 node Hadoop cluster as follows:
>
> hadoop jar /root/mahout-core-0.2.job
> org.apache.mahout.clustering.lda.LDADriver -i
> hdfs://master/lda/input/vectors -o hdfs://master/lda/output -k 20 -v 10000
> --maxIter 40
>
> where lda/input/vectors is the vectors file generated from the stand alone
> build-reuters.sh example. I can only get a single map task to execute while
> approx. 57 task slots are available. Has anyone actually ran distributed LDA
> successfully? This will help me figure out if I have a hadoop config issue
> or if there is an actual algorithm implementation problem. The Hadoop
> examples run successfully in distributed mode utilizing all available map
> tasks. I'm not sure if there is an issue with the InputSplit for the
> SequenceFile or something else... Any help is appreciated.

I myself haven't actually run LDA distributed (though I've spoken with
someone who has). The Reuters example is pretty simplistic, and
doesn't set any input splits for the single vectors file, and so it's
only going to run on one machine. If you shard the vectors it should
just work. I can brush up on my hadoop foo to figure out how to have
hadoop split up a single file, if you want.

-- David