You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Jeff Eastman <jd...@windwardsolutions.com> on 2010/05/19 21:49:39 UTC

Tuning LDA on Reuters

I ran the Reuters dataset against LDA yesterday on a 2-node cluster and 
it took a really long time to converge (100 iterations * 10 min ea) 
extracting 20 topics. I was able to reduce the iteration time by 50% by 
using just TF and SeqAccSparseVectors but it was still only using a 
single mapper and that was where most of the time was spent. Digging 
backwards, I found that there is only a single vector file produced by 
seqtosparse and also seqdirectory so that made sense.

I tried adding a '-chunk 5' param to seqdirectory but internally that 
got boosted up to 64 so I removed the boost code and am now able to get 
3 part files in tokenized-documents.

I've tried a similar trick with seqtosparse, but its chunk argument only 
affects the dictionary.file chunking. I also tried running it with 4 
reducers but I still get only a single part file in vectors. (It does 
seem that seqtosparse would produce multiple partial vector files if the 
dictionary were chunked, but the code then recombines those vectors to 
produce a single file.)

I cannot imagine how one could ever get LDA to scale if it is always 
limited to a single input vector file. Is there a way to get multiple 
output vector files from seqtosparse?

Re: Tuning LDA on Reuters

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Yes, digging into the slow performance, I noticed that the submission 
that replaced build-reuters.sh recently was doing tfidf. Changing that 
to tf and using sequential access vectors gave me an immediate 2x even 
with the single mapper. The other stuff in the patch gave me about 3x 
more using two data nodes and would surely scale better than any single 
mapper solution.

I think the problem lies more in the text preprocessing steps, which 
currently output only a single vector file. By increasing the number of 
reducers one obtains more parallelism in producing the vectors and also 
more vector files to feed to the final processing steps, whether LDA or 
k-Means, etc.

Tweaking the input split sizes in the final steps is a way to address 
the single-vector issue without fixing the preprocessing to give more 
files. The only thing I'm uncertain about is whether the patch 
introduces any unintended consequences if the dictionary gets big enough 
to be sharded.

On 5/19/10 10:10 PM, Grant Ingersoll wrote:
> You might find http://www.lucidimagination.com/search/document/39b53fbf4b525f2f/lda_only_executes_a_single_map_task_per_iteration_when_running_in_actual_distributed_mode#311eb323a8208e28 informative.
>
> (BTW, LDA is only meant to run w/ TF)
>
> -Grant
>
> On May 19, 2010, at 9:49 PM, Jeff Eastman wrote:
>
>    
>> I ran the Reuters dataset against LDA yesterday on a 2-node cluster and it took a really long time to converge (100 iterations * 10 min ea) extracting 20 topics. I was able to reduce the iteration time by 50% by using just TF and SeqAccSparseVectors but it was still only using a single mapper and that was where most of the time was spent. Digging backwards, I found that there is only a single vector file produced by seqtosparse and also seqdirectory so that made sense.
>>
>> I tried adding a '-chunk 5' param to seqdirectory but internally that got boosted up to 64 so I removed the boost code and am now able to get 3 part files in tokenized-documents.
>>
>> I've tried a similar trick with seqtosparse, but its chunk argument only affects the dictionary.file chunking. I also tried running it with 4 reducers but I still get only a single part file in vectors. (It does seem that seqtosparse would produce multiple partial vector files if the dictionary were chunked, but the code then recombines those vectors to produce a single file.)
>>
>> I cannot imagine how one could ever get LDA to scale if it is always limited to a single input vector file. Is there a way to get multiple output vector files from seqtosparse?
>>      
>
>

Re: Tuning LDA on Reuters

Posted by Grant Ingersoll <gs...@apache.org>.

You might find http://www.lucidimagination.com/search/document/39b53fbf4b525f2f/lda_only_executes_a_single_map_task_per_iteration_when_running_in_actual_distributed_mode#311eb323a8208e28 informative.

(BTW, LDA is only meant to run w/ TF)

-Grant

On May 19, 2010, at 9:49 PM, Jeff Eastman wrote:

> I ran the Reuters dataset against LDA yesterday on a 2-node cluster and it took a really long time to converge (100 iterations * 10 min ea) extracting 20 topics. I was able to reduce the iteration time by 50% by using just TF and SeqAccSparseVectors but it was still only using a single mapper and that was where most of the time was spent. Digging backwards, I found that there is only a single vector file produced by seqtosparse and also seqdirectory so that made sense.
> 
> I tried adding a '-chunk 5' param to seqdirectory but internally that got boosted up to 64 so I removed the boost code and am now able to get 3 part files in tokenized-documents.
> 
> I've tried a similar trick with seqtosparse, but its chunk argument only affects the dictionary.file chunking. I also tried running it with 4 reducers but I still get only a single part file in vectors. (It does seem that seqtosparse would produce multiple partial vector files if the dictionary were chunked, but the code then recombines those vectors to produce a single file.)
> 
> I cannot imagine how one could ever get LDA to scale if it is always limited to a single input vector file. Is there a way to get multiple output vector files from seqtosparse?

Re: Tuning LDA on Reuters

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Haven't tried that approach yet but it may have the same effect. 
Seq2sparse already has an option to set the number of reducers (-nr) it 
was just not propagating to the vector-generation last stage. Would the 
-D option override that option? Would it apply to all the hadoop jobs 
spawned by seq2sparse? If so then they are probably equivalent.

On 5/19/10 7:10 PM, Drew Farris wrote:
> Jeff,
>
> Just curious, have you tried:
>
> ./bin/mahout seq2sparse -Dmapred.reduce.tasks=2 -i
> ./examples/bin/work/reuters-out-seqdir/ -o
> ./examples/bin/work/reuters-out-seqdir-sparse -wt tf -seq
>
> The mahout script (MahoutDriver) allows arbitrary hadoop properties to be
> specified via -D arguments which are handled iirc by the
> GenericOptionsParser. Admittedly, I'm divided between 'add an explicit
> argument/job parameter to handle it' vs. 'use hadoop built in properties'.
> The former are nice because it is an explicit acknowledgement of the ability
> to set the number of reducers displayable via a -h method, but the latter
> results in less code to maintain. In this vein, -Dmapred.min.split.size can
> be tinkered with. Of course if GenericOptionsParser isn't involved in the
> job setup this is all a moot point.
>
> Also worth pointing out that either of these -D argument could be specified
> in the lda.props (lda-reuters.props?) file too via something like 'DmyProp =
> provalue'
>
> Of course this doesn't really address the root problem however -- why LDA on
> reuters is slow. How long is it taking to run?
>
> Drew
>
>
>
> On Wed, May 19, 2010 at 7:08 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>
>    
>> On 5/19/10 3:19 PM, Jeff Eastman wrote:
>>
>>      
>>>   I tried propagating numReducers into its makePartialVectors driver;
>>> however, but a single reducer is still all I get. I need to figure out how
>>> to tickle the elephant to give me more.
>>>
>>>        
>> Note to self: Use a real elephant. Running Hadoop in Eclipse is great for
>> debugging but it does not launch multiple mappers or reducers. Running on a
>> single-host Hadoop cluster; however, does and the elephant is now dancing
>> nicely.
>>
>> ./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o
>> ./examples/bin/work/reuters-out-seqdir-sparse -wt tf -seq -nr 2
>>
>> now produces two input vector files for LDA to munch on. Now to try it on a
>> real cluster...
>>
>>      
>

Re: Tuning LDA on Reuters

Posted by Drew Farris <dr...@gmail.com>.

On Wed, May 19, 2010 at 10:10 PM, Drew Farris <dr...@gmail.com> wrote:


> Of course this doesn't really address the root problem however -- why LDA
> on reuters is slow. How long is it taking to run?
>
> Drew


nm, saw it in the JIRA issue (5.5min vs. 1.5min)

Re: Tuning LDA on Reuters

Posted by Drew Farris <dr...@gmail.com>.

Jeff,

Just curious, have you tried:

./bin/mahout seq2sparse -Dmapred.reduce.tasks=2 -i
./examples/bin/work/reuters-out-seqdir/ -o
./examples/bin/work/reuters-out-seqdir-sparse -wt tf -seq

The mahout script (MahoutDriver) allows arbitrary hadoop properties to be
specified via -D arguments which are handled iirc by the
GenericOptionsParser. Admittedly, I'm divided between 'add an explicit
argument/job parameter to handle it' vs. 'use hadoop built in properties'.
The former are nice because it is an explicit acknowledgement of the ability
to set the number of reducers displayable via a -h method, but the latter
results in less code to maintain. In this vein, -Dmapred.min.split.size can
be tinkered with. Of course if GenericOptionsParser isn't involved in the
job setup this is all a moot point.

Also worth pointing out that either of these -D argument could be specified
in the lda.props (lda-reuters.props?) file too via something like 'DmyProp =
provalue'

Of course this doesn't really address the root problem however -- why LDA on
reuters is slow. How long is it taking to run?

Drew

On Wed, May 19, 2010 at 7:08 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> On 5/19/10 3:19 PM, Jeff Eastman wrote:
>
>>  I tried propagating numReducers into its makePartialVectors driver;
>> however, but a single reducer is still all I get. I need to figure out how
>> to tickle the elephant to give me more.
>>
> Note to self: Use a real elephant. Running Hadoop in Eclipse is great for
> debugging but it does not launch multiple mappers or reducers. Running on a
> single-host Hadoop cluster; however, does and the elephant is now dancing
> nicely.
>
> ./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o
> ./examples/bin/work/reuters-out-seqdir-sparse -wt tf -seq -nr 2
>
> now produces two input vector files for LDA to munch on. Now to try it on a
> real cluster...
>

Re: Tuning LDA on Reuters

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

On 5/19/10 3:19 PM, Jeff Eastman wrote:
>  I tried propagating numReducers into its makePartialVectors driver; 
> however, but a single reducer is still all I get. I need to figure out 
> how to tickle the elephant to give me more.
Note to self: Use a real elephant. Running Hadoop in Eclipse is great 
for debugging but it does not launch multiple mappers or reducers. 
Running on a single-host Hadoop cluster; however, does and the elephant 
is now dancing nicely.

./bin/mahout seq2sparse -i ./examples/bin/work/reuters-out-seqdir/ -o 
./examples/bin/work/reuters-out-seqdir-sparse -wt tf -seq -nr 2

now produces two input vector files for LDA to munch on. Now to try it 
on a real cluster...

Re: Tuning LDA on Reuters

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

On 5/19/10 1:49 PM, Drew Farris wrote:
> On Wed, May 19, 2010 at 3:49 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>
>
>    
>> I cannot imagine how one could ever get LDA to scale if it is always
>> limited to a single input vector file. Is there a way to get multiple output
>> vector files from seqtosparse?
>>
>>      
> I don't know offhand, but is the default input split (mapred.min.split.size)
> size too large for this particular use case? (if it is 0/unspecified it
> defaults to the block size, which is 64MB). I wonder if setting that smaller
> will allow more mappers to spawn.
>
>    
The single input file for Reuters is only 18mb and it is well below the 
default min.split.size so yes, 64mb is large for this use case. The 
preprocessing steps are only producing a single vector file for *any* 
set of input documents; however, and that seems to be inherently 
limiting. I was hoping to induce Hadoop to spawn more mappers by 
creating more input files, but lowering the split size would be another 
approach. I'd hate to hardwire a smaller size into the LDA job though 
because that would affect all applications of the algorithm. Another job 
parameter might resolve that problem but it seems pretty low level for a 
user to specify.

I'd be happier if the existing numReducers parameter to the 
SparseVectorsFromSeq job would impact the number of vector output files. 
It calls the DictionaryVectorizer to convert the tokenized input <docId, 
{token}> seqFiles into <docId, Vector> files and I discovered that 
driver currently only launches a single reducer (aha?). I tried 
propagating numReducers into its makePartialVectors driver; however, but 
a single reducer is still all I get. I need to figure out how to tickle 
the elephant to give me more.

Of course, another avenue is to look at why Reuters - a tiny dataset by 
Hadoop standards - doesn't give same-day-service even with a single 
mapper. That's an exercise in profiling which begs to be explored.

Kind of a long-winded reply to your suggestion, but it helps me to lay 
it all out on the table,
Jeff

Re: Tuning LDA on Reuters

Posted by Drew Farris <dr...@gmail.com>.

On Wed, May 19, 2010 at 3:49 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> I cannot imagine how one could ever get LDA to scale if it is always
> limited to a single input vector file. Is there a way to get multiple output
> vector files from seqtosparse?
>

I don't know offhand, but is the default input split (mapred.min.split.size)
size too large for this particular use case? (if it is 0/unspecified it
defaults to the block size, which is 64MB). I wonder if setting that smaller
will allow more mappers to spawn.