You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by 杨杰 <xt...@gmail.com> on 2010/05/22 14:40:53 UTC

Mahout LDA Parameter: maxIter

Hi, everyone

I'm trying mahout now. When running LDA on reuter corpus
(http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/),
A parameter refuses to work. This parameter is "maxIter", without
which, i cannot decide the iteration to run~

My CMD is:
bin/mahout.hadoop lda --input mahout/seq-sparse-tf/vectors --output
mahout/seq-sparse-tf/lda-out5 --numWords 34000 --numTopics 20
--maxIter 1

But got a exception:
10/05/22 20:32:11 ERROR lda.LDADriver: Exception
org.apache.commons.cli2.OptionException: Unexpected 2 while processing Options
	at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:100)
	at org.apache.mahout.clustering.lda.LDADriver.main(LDADriver.java:115)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)
...

What's the problem? I'm using version 0.3 & Hadoop 0.20.0.

Thank you!


-- 
Yang Jie（杨杰）
hi.baidu.com/thinkdifferent

Group of CLOUD, Xi'an Jiaotong University
Department of Computer Science and Technology, Xi’an Jiaotong University

PHONE: 86 1346888 3723
TEL: 86 29 82665263 EXT. 608
MSN: xtyangjie2004@yahoo.com.cn

once i didn't know software is not free, but found it days later; now
i realize that it's indeed free.

Re: Mahout LDA Parameter: maxIter

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Yes, but it takes another 80 iterations to get there and the results, on 
Reuters at least, don't seem to improve that much.

On 5/22/10 5:01 PM, Robin Anil wrote:
> David's rule of thumb was to let the iterations go until relative change in
> LL becomes around 10^-4
>
> Robin
>
> On Sat, May 22, 2010 at 9:12 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>
>    
>> I suggest you try running with a trunk checkout and upgrading to Hadoop
>> 0.20.2. Mahout is still in motion and I've run LDA on Reuters on trunk in
>> the last few days. The maxIter parameter should not be an issue; you could
>> try removing it entirely and LDA will default to running to convergence
>> (about 100 iterations which can take some time). I've found the Reuters
>> results don't change too much after 20. Even with a clean trunk checkout
>> Reuters will only use a single node and the iterations should take about 5
>> mins each. If you want to run on a multi-node cluster, install the patch in
>> MAHOUT-397 (
>>
>>
>> https://issues.apache.org/jira/browse/MAHOUT-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel)
>> and use the same arguments as in examples/bin/build-reuters.sh. Even on a
>> 3-node cluster this brings the iteration time down to about a minute and a
>> half which is worth doing.
>>
>> Hope this helps,
>> Jeff
>>
>> http://www.windwardsolutions.com
>>
>>
>>
>>
>> On 5/22/10 5:40 AM, 杨杰 wrote:
>>
>>      
>>> Hi, everyone
>>>
>>> I'm trying mahout now. When running LDA on reuter corpus
>>> (
>>> http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
>>> ),
>>> A parameter refuses to work. This parameter is "maxIter", without
>>> which, i cannot decide the iteration to run~
>>>
>>> My CMD is:
>>> bin/mahout.hadoop lda --input mahout/seq-sparse-tf/vectors --output
>>> mahout/seq-sparse-tf/lda-out5 --numWords 34000 --numTopics 20
>>> --maxIter 1
>>>
>>> But got a exception:
>>> 10/05/22 20:32:11 ERROR lda.LDADriver: Exception
>>> org.apache.commons.cli2.OptionException: Unexpected 2 while processing
>>> Options
>>>         at
>>> org.apache.commons.cli2.commandline.Parser.parse(Parser.java:100)
>>>         at
>>> org.apache.mahout.clustering.lda.LDADriver.main(LDADriver.java:115)
>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>         at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>         at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>         at
>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>         at
>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>         at
>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)
>>> ...
>>>
>>> What's the problem? I'm using version 0.3&   Hadoop 0.20.0.
>>>
>>> Thank you!
>>>
>>>
>>>
>>>
>>>        
>>
>>      
>

Re: Mahout LDA Parameter: maxIter (--numWords actually)

Posted by Ted Dunning <te...@gmail.com>.

getQuick is a real problem and usually very, very little performance
benefit.

I think that keeping it is important, but using is often not such a great
idea.  On reading new code, I would be quite suspicious of its use unless it
is completely obvious critical to performance.

Sean's suggestion that iterators be used instead is quite good.  Colt didn't
originally have good iterator patterns which is why getQuick was so
important to have (and use) back then.  Now that we have a more modern java,
we should strongly encourage the iterator style.

On Sun, May 23, 2010 at 3:47 PM, Sean Owen <sr...@gmail.com> wrote:

> I might hijack this to question the existence of getQuick(). It's for
> situations where the caller "knows" the dimension is in bounds, but
> perhaps it's too tempting to just call this even when that's not
> known. Here calling get() would have just resulted in a different and
> only slightly-better exception. In another implementation it might
> have silently failed by returning a zero or something though.
>
> One thought is to remove getQuick(). It has performance implications.
> Maybe many uses of getQuick() could be better constructed as
> iterations, I don't know.
>

Re: Mahout LDA Parameter: maxIter (--numWords actually)

Posted by Sean Owen <sr...@gmail.com>.

It might be worth thinking about since catching for this over a fair
bit of code might be catching other unrelated conditions. Is the issue
not that phiW may not have the same cardinality as nextGamma? if
that's it that's an easy check but maybe that's not it.


I might hijack this to question the existence of getQuick(). It's for
situations where the caller "knows" the dimension is in bounds, but
perhaps it's too tempting to just call this even when that's not
known. Here calling get() would have just resulted in a different and
only slightly-better exception. In another implementation it might
have silently failed by returning a zero or something though.

One thought is to remove getQuick(). It has performance implications.
Maybe many uses of getQuick() could be better constructed as
iterations, I don't know.

But another smaller changed would be to use get() and "assert" the
range checks. Unit tests would enable these; production code wouldn't.
They're off by default.

Is that compelling to anyone

On Sun, May 23, 2010 at 11:35 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
> Its caused by a getQuick in the bowels of infer(). Adding the test to every
> access inside the loop seems more expensive than just catching the exception
> in the mapper.
>
> On 5/23/10 3:29 PM, Sean Owen wrote:
>>
>> Switching to dev list.
>>
>> I don't want to belabor a small point, but I'm wondering whether it's
>> just better to check whatever array access causes the problem before
>> it's accessed?
>>
>> Meaning...
>>
>> if (foo>= array.length) {
>>   throw new IllegalStateException();
>> }
>> ... array[foo] ...
>>
>> instead of
>>
>> try {
>> ... array[foo] ...
>> } catch (ArrayIndexOutOfBoundsException aioobe) {
>>   throw new IllegalStateException();
>> }
>>
>> On Sun, May 23, 2010 at 11:26 PM, Jeff Eastman
>> <jd...@windwardsolutions.com>  wrote:
>>
>>>
>>> I added a try block in the mapper which catches the exception and
>>> outputs:
>>>
>>> java.lang.IllegalStateException: This is probably because the --numWords
>>> argument is set too small.
>>>    It needs to be>= than the number of words (terms actually) in the
>>> corpus
>>> and can be
>>>    larger if some storage inefficiency can be tolerated.
>>>    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:49)
>>>    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:1)
>>>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>>    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>    at
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>> Caused by: java.lang.ArrayIndexOutOfBoundsException: 3816
>>>    at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:77)
>>>    at
>>>
>>> org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:44)
>>>    at
>>>
>>> org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:205)
>>>    at
>>>
>>> org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:103)
>>>    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47)
>>>    ... 5 more
>>>
>>> I'll commit that for now while we explore a more elegant solution.
>>>
>>>
>>> On 5/23/10 2:45 PM, Sean Owen wrote:
>>>
>>>>
>>>> Even something as simple as checking that bound and throwing
>>>> IllegalStateException with a custom message -- yeah I imagine it's
>>>> hard to detect this anytime earlier. Just a thought.
>>>>
>>>> On Sun, May 23, 2010 at 6:29 PM, Jeff Eastman
>>>> <jd...@windwardsolutions.com>    wrote:
>>>>
>>>>
>>>>>
>>>>> I agree it is not very friendly. Impossible to tell the correct value
>>>>> in
>>>>> the
>>>>> options section processing. It needs to be>= than the actual number of
>>>>> unique terms in the corpus and that is hard to anticipate though I
>>>>> think
>>>>> it
>>>>> is known in seq2sparse. If it turns out to be the dictionary size (I'm
>>>>> investigating), then it could be computed by adding a dictionary path
>>>>> argument instead of the current option. Trouble with that is the
>>>>> dictionary
>>>>> is not needed for anything else by LDA.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Mahout LDA Parameter: maxIter (--numWords actually)

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Its caused by a getQuick in the bowels of infer(). Adding the test to 
every access inside the loop seems more expensive than just catching the 
exception in the mapper.

On 5/23/10 3:29 PM, Sean Owen wrote:
> Switching to dev list.
>
> I don't want to belabor a small point, but I'm wondering whether it's
> just better to check whatever array access causes the problem before
> it's accessed?
>
> Meaning...
>
> if (foo>= array.length) {
>    throw new IllegalStateException();
> }
> ... array[foo] ...
>
> instead of
>
> try {
> ... array[foo] ...
> } catch (ArrayIndexOutOfBoundsException aioobe) {
>    throw new IllegalStateException();
> }
>
> On Sun, May 23, 2010 at 11:26 PM, Jeff Eastman
> <jd...@windwardsolutions.com>  wrote:
>    
>> I added a try block in the mapper which catches the exception and outputs:
>>
>> java.lang.IllegalStateException: This is probably because the --numWords
>> argument is set too small.
>>     It needs to be>= than the number of words (terms actually) in the corpus
>> and can be
>>     larger if some storage inefficiency can be tolerated.
>>     at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:49)
>>     at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:1)
>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> Caused by: java.lang.ArrayIndexOutOfBoundsException: 3816
>>     at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:77)
>>     at
>> org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:44)
>>     at
>> org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:205)
>>     at
>> org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:103)
>>     at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47)
>>     ... 5 more
>>
>> I'll commit that for now while we explore a more elegant solution.
>>
>>
>> On 5/23/10 2:45 PM, Sean Owen wrote:
>>      
>>> Even something as simple as checking that bound and throwing
>>> IllegalStateException with a custom message -- yeah I imagine it's
>>> hard to detect this anytime earlier. Just a thought.
>>>
>>> On Sun, May 23, 2010 at 6:29 PM, Jeff Eastman
>>> <jd...@windwardsolutions.com>    wrote:
>>>
>>>        
>>>> I agree it is not very friendly. Impossible to tell the correct value in
>>>> the
>>>> options section processing. It needs to be>= than the actual number of
>>>> unique terms in the corpus and that is hard to anticipate though I think
>>>> it
>>>> is known in seq2sparse. If it turns out to be the dictionary size (I'm
>>>> investigating), then it could be computed by adding a dictionary path
>>>> argument instead of the current option. Trouble with that is the
>>>> dictionary
>>>> is not needed for anything else by LDA.
>>>>
>>>>
>>>>          
>>>
>>>        
>>
>>      
>

Re: Mahout LDA Parameter: maxIter (--numWords actually)

Posted by Sean Owen <sr...@gmail.com>.

Switching to dev list.

I don't want to belabor a small point, but I'm wondering whether it's
just better to check whatever array access causes the problem before
it's accessed?

Meaning...

if (foo >= array.length) {
  throw new IllegalStateException();
}
... array[foo] ...

instead of

try {
... array[foo] ...
} catch (ArrayIndexOutOfBoundsException aioobe) {
  throw new IllegalStateException();
}

On Sun, May 23, 2010 at 11:26 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
> I added a try block in the mapper which catches the exception and outputs:
>
> java.lang.IllegalStateException: This is probably because the --numWords
> argument is set too small.
>    It needs to be >= than the number of words (terms actually) in the corpus
> and can be
>    larger if some storage inefficiency can be tolerated.
>    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:49)
>    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:1)
>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 3816
>    at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:77)
>    at
> org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:44)
>    at
> org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:205)
>    at
> org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:103)
>    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47)
>    ... 5 more
>
> I'll commit that for now while we explore a more elegant solution.
>
>
> On 5/23/10 2:45 PM, Sean Owen wrote:
>>
>> Even something as simple as checking that bound and throwing
>> IllegalStateException with a custom message -- yeah I imagine it's
>> hard to detect this anytime earlier. Just a thought.
>>
>> On Sun, May 23, 2010 at 6:29 PM, Jeff Eastman
>> <jd...@windwardsolutions.com>  wrote:
>>
>>>
>>> I agree it is not very friendly. Impossible to tell the correct value in
>>> the
>>> options section processing. It needs to be>= than the actual number of
>>> unique terms in the corpus and that is hard to anticipate though I think
>>> it
>>> is known in seq2sparse. If it turns out to be the dictionary size (I'm
>>> investigating), then it could be computed by adding a dictionary path
>>> argument instead of the current option. Trouble with that is the
>>> dictionary
>>> is not needed for anything else by LDA.
>>>
>>>
>>
>>
>
>

Re: Mahout LDA Parameter: maxIter (--numWords actually)

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I added a try block in the mapper which catches the exception and outputs:

java.lang.IllegalStateException: This is probably because the --numWords 
argument is set too small.
     It needs to be >= than the number of words (terms actually) in the 
corpus and can be
     larger if some storage inefficiency can be tolerated.
     at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:49)
     at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:1)
     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
     at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3816
     at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:77)
     at 
org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:44)
     at 
org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:205)
     at 
org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:103)
     at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47)
     ... 5 more

I'll commit that for now while we explore a more elegant solution.


On 5/23/10 2:45 PM, Sean Owen wrote:
> Even something as simple as checking that bound and throwing
> IllegalStateException with a custom message -- yeah I imagine it's
> hard to detect this anytime earlier. Just a thought.
>
> On Sun, May 23, 2010 at 6:29 PM, Jeff Eastman
> <jd...@windwardsolutions.com>  wrote:
>    
>> I agree it is not very friendly. Impossible to tell the correct value in the
>> options section processing. It needs to be>= than the actual number of
>> unique terms in the corpus and that is hard to anticipate though I think it
>> is known in seq2sparse. If it turns out to be the dictionary size (I'm
>> investigating), then it could be computed by adding a dictionary path
>> argument instead of the current option. Trouble with that is the dictionary
>> is not needed for anything else by LDA.
>>
>>      
>

Re: Mahout LDA Parameter: maxIter

Posted by Sean Owen <sr...@gmail.com>.

Even something as simple as checking that bound and throwing
IllegalStateException with a custom message -- yeah I imagine it's
hard to detect this anytime earlier. Just a thought.

On Sun, May 23, 2010 at 6:29 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
> I agree it is not very friendly. Impossible to tell the correct value in the
> options section processing. It needs to be >= than the actual number of
> unique terms in the corpus and that is hard to anticipate though I think it
> is known in seq2sparse. If it turns out to be the dictionary size (I'm
> investigating), then it could be computed by adding a dictionary path
> argument instead of the current option. Trouble with that is the dictionary
> is not needed for anything else by LDA.
>

Re: Mahout LDA Parameter: maxIter

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Changing it in the LDAState works fine (at least it runs Reuters with 
--numWords=1) but the numWords is also used to initialize the state data 
in LDADriver.writeInitialState():

       double total = 0.0; // total number of pseudo counts we made
       for (int w = 0; w < numWords; ++w) {
         IntPairWritable kw = new IntPairWritable(k, w);
         // A small amount of random noise, minimized by having a floor.
         double pseudocount = random.nextDouble() + 1.0E-8;
         total += pseudocount;
         v.set(Math.log(pseudocount));
         writer.append(kw, v);
       }

I don't want to use Integer.MAX_VALUE here :)

On 5/23/10 2:14 PM, Jeff Eastman wrote:
> Yes it is a DenseMatrix. Providing a value that is too large just 
> wastes some space. I'll try the random access approach and see what 
> happens...
>
>
> On 5/23/10 2:09 PM, Ted Dunning wrote:
>> What happens if the number is too large?  Is this a dense matrix we are
>> talking about?
>>
>> Would it work to make it a random access sparse matrix with very, 
>> very large
>> bounds?
>>
>> On Sun, May 23, 2010 at 10:29 AM, Jeff Eastman
>> <jd...@windwardsolutions.com>wrote:
>>
>>> I agree it is not very friendly. Impossible to tell the correct 
>>> value in
>>> the options section processing. It needs to be>= than the actual 
>>> number of
>>> unique terms in the corpus and that is hard to anticipate though I 
>>> think it
>>> is known in seq2sparse. If it turns out to be the dictionary size (I'm
>>> investigating), then it could be computed by adding a dictionary path
>>> argument instead of the current option. Trouble with that is the 
>>> dictionary
>>> is not needed for anything else by LDA.
>>>
>>> On 5/23/10 9:38 AM, Sean Owen wrote:
>>>
>>>> Is there a way to catch that with a more descriptive error earlier? I
>>>> always
>>>> think AIOOBE looks bad.
>>>>
>>>> On May 23, 2010 4:11 PM, "Jeff Eastman"<jd...@windwardsolutions.com>
>>>>   wrote:
>>>>
>>>> Yes, your -numWords option is set too low and that's causing the array
>>>> exception. Try -v 50000.
>>>>
>>>>
>>>>
>>>> On 5/23/10 3:20 AM, 杨杰 wrote:
>>>>
>>>>
>>>>> Jeff and Robin,
>>>>>
>>>>> Thank you for your suggestion! There is anot...
>>>>>
>>>>>
>>>>
>>>
>

Re: Mahout LDA Parameter: maxIter

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Yes it is a DenseMatrix. Providing a value that is too large just wastes 
some space. I'll try the random access approach and see what happens...


On 5/23/10 2:09 PM, Ted Dunning wrote:
> What happens if the number is too large?  Is this a dense matrix we are
> talking about?
>
> Would it work to make it a random access sparse matrix with very, very large
> bounds?
>
> On Sun, May 23, 2010 at 10:29 AM, Jeff Eastman
> <jd...@windwardsolutions.com>wrote:
>
>    
>> I agree it is not very friendly. Impossible to tell the correct value in
>> the options section processing. It needs to be>= than the actual number of
>> unique terms in the corpus and that is hard to anticipate though I think it
>> is known in seq2sparse. If it turns out to be the dictionary size (I'm
>> investigating), then it could be computed by adding a dictionary path
>> argument instead of the current option. Trouble with that is the dictionary
>> is not needed for anything else by LDA.
>>
>> On 5/23/10 9:38 AM, Sean Owen wrote:
>>
>>      
>>> Is there a way to catch that with a more descriptive error earlier? I
>>> always
>>> think AIOOBE looks bad.
>>>
>>> On May 23, 2010 4:11 PM, "Jeff Eastman"<jd...@windwardsolutions.com>
>>>   wrote:
>>>
>>> Yes, your -numWords option is set too low and that's causing the array
>>> exception. Try -v 50000.
>>>
>>>
>>>
>>> On 5/23/10 3:20 AM, 杨杰 wrote:
>>>
>>>
>>>        
>>>> Jeff and Robin,
>>>>
>>>> Thank you for your suggestion! There is anot...
>>>>
>>>>
>>>>          
>>>
>>>        
>>
>>      
>

Re: Mahout LDA Parameter: maxIter

Posted by Ted Dunning <te...@gmail.com>.

What happens if the number is too large?  Is this a dense matrix we are
talking about?

Would it work to make it a random access sparse matrix with very, very large
bounds?

On Sun, May 23, 2010 at 10:29 AM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> I agree it is not very friendly. Impossible to tell the correct value in
> the options section processing. It needs to be >= than the actual number of
> unique terms in the corpus and that is hard to anticipate though I think it
> is known in seq2sparse. If it turns out to be the dictionary size (I'm
> investigating), then it could be computed by adding a dictionary path
> argument instead of the current option. Trouble with that is the dictionary
> is not needed for anything else by LDA.
>
> On 5/23/10 9:38 AM, Sean Owen wrote:
>
>> Is there a way to catch that with a more descriptive error earlier? I
>> always
>> think AIOOBE looks bad.
>>
>> On May 23, 2010 4:11 PM, "Jeff Eastman"<jd...@windwardsolutions.com>
>>  wrote:
>>
>> Yes, your -numWords option is set too low and that's causing the array
>> exception. Try -v 50000.
>>
>>
>>
>> On 5/23/10 3:20 AM, 杨杰 wrote:
>>
>>
>>> Jeff and Robin,
>>>
>>> Thank you for your suggestion! There is anot...
>>>
>>>
>>
>>
>
>

Re: Mahout LDA Parameter: maxIter

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I agree it is not very friendly. Impossible to tell the correct value in 
the options section processing. It needs to be >= than the actual number 
of unique terms in the corpus and that is hard to anticipate though I 
think it is known in seq2sparse. If it turns out to be the dictionary 
size (I'm investigating), then it could be computed by adding a 
dictionary path argument instead of the current option. Trouble with 
that is the dictionary is not needed for anything else by LDA.

On 5/23/10 9:38 AM, Sean Owen wrote:
> Is there a way to catch that with a more descriptive error earlier? I always
> think AIOOBE looks bad.
>
> On May 23, 2010 4:11 PM, "Jeff Eastman"<jd...@windwardsolutions.com>  wrote:
>
> Yes, your -numWords option is set too low and that's causing the array
> exception. Try -v 50000.
>
>
>
> On 5/23/10 3:20 AM, 杨杰 wrote:
>    
>> Jeff and Robin,
>>
>> Thank you for your suggestion! There is anot...
>>      
>

Re: Mahout LDA Parameter: maxIter

Posted by Sean Owen <sr...@gmail.com>.

Is there a way to catch that with a more descriptive error earlier? I always
think AIOOBE looks bad.

On May 23, 2010 4:11 PM, "Jeff Eastman" <jd...@windwardsolutions.com> wrote:

Yes, your -numWords option is set too low and that's causing the array
exception. Try -v 50000.

On 5/23/10 3:20 AM, 杨杰 wrote:
>
> Jeff and Robin,
>
> Thank you for your suggestion! There is anot...

Re: Mahout LDA Parameter: maxIter

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Yes, your -numWords option is set too low and that's causing the array 
exception. Try -v 50000.

On 5/23/10 3:20 AM, 杨杰 wrote:
> Jeff and Robin,
>
> Thank you for your suggestion! There is another problem: Compiled the
> source from trunk and applied the patch MAHOUT-397, I retried the lda
> experiment, but another exception was thrown:
>
> 10/05/23 17:01:52 INFO common.HadoopUtil: Deleting mahout/seq-sparse-tf/lda-out
> 10/05/23 17:01:55 INFO lda.LDADriver: Iteration 1
> 10/05/23 17:01:55 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the
> same.
> 10/05/23 17:01:56 INFO input.FileInputFormat: Total input paths to process : 1
> 10/05/23 17:01:56 INFO mapred.JobClient: Running job: job_201005231654_0001
> 10/05/23 17:01:57 INFO mapred.JobClient:  map 0% reduce 0%
> 10/05/23 17:02:10 INFO mapred.JobClient: Task Id :
> attempt_201005231654_0001_m_000000_0, Status : FAILED
> java.lang.ArrayIndexOutOfBoundsException: 123
> 	at org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:106)
> 	at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:45)
> 	at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:36)
> 	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
> 	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518)
> 	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303)
> 	at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> The COMMAND is the same as the former one except "-ow" which is "-w"
> in 0.3 distribution; dataset is also the same with mahout 0.3 (on
> which the experiment works ok except for *only one map* in each
> iteration~).
>
> Is it because of absence of some other patches? Or is there any other
> mistakes  in my operations?
>
> Thank you!
>
>
> On Sun, May 23, 2010 at 8:01 AM, Robin Anil<ro...@gmail.com>  wrote:
>    
>> David's rule of thumb was to let the iterations go until relative change in
>> LL becomes around 10^-4
>>
>> Robin
>>
>> On Sat, May 22, 2010 at 9:12 PM, Jeff Eastman<jd...@windwardsolutions.com>wrote:
>>
>>      
>>> I suggest you try running with a trunk checkout and upgrading to Hadoop
>>> 0.20.2. Mahout is still in motion and I've run LDA on Reuters on trunk in
>>> the last few days. The maxIter parameter should not be an issue; you could
>>> try removing it entirely and LDA will default to running to convergence
>>> (about 100 iterations which can take some time). I've found the Reuters
>>> results don't change too much after 20. Even with a clean trunk checkout
>>> Reuters will only use a single node and the iterations should take about 5
>>> mins each. If you want to run on a multi-node cluster, install the patch in
>>> MAHOUT-397 (
>>>
>>>
>>> https://issues.apache.org/jira/browse/MAHOUT-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel)
>>> and use the same arguments as in examples/bin/build-reuters.sh. Even on a
>>> 3-node cluster this brings the iteration time down to about a minute and a
>>> half which is worth doing.
>>>
>>> Hope this helps,
>>> Jeff
>>>
>>> http://www.windwardsolutions.com
>>>
>>>
>>>
>>>
>>> On 5/22/10 5:40 AM, 杨杰 wrote:
>>>
>>>        
>>>> Hi, everyone
>>>>
>>>> I'm trying mahout now. When running LDA on reuter corpus
>>>> (
>>>> http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
>>>> ),
>>>> A parameter refuses to work. This parameter is "maxIter", without
>>>> which, i cannot decide the iteration to run~
>>>>
>>>> My CMD is:
>>>> bin/mahout.hadoop lda --input mahout/seq-sparse-tf/vectors --output
>>>> mahout/seq-sparse-tf/lda-out5 --numWords 34000 --numTopics 20
>>>> --maxIter 1
>>>>
>>>> But got a exception:
>>>> 10/05/22 20:32:11 ERROR lda.LDADriver: Exception
>>>> org.apache.commons.cli2.OptionException: Unexpected 2 while processing
>>>> Options
>>>>         at
>>>> org.apache.commons.cli2.commandline.Parser.parse(Parser.java:100)
>>>>         at
>>>> org.apache.mahout.clustering.lda.LDADriver.main(LDADriver.java:115)
>>>>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>>         at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>>         at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>>         at java.lang.reflect.Method.invoke(Method.java:597)
>>>>         at
>>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>>         at
>>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>>         at
>>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)
>>>> ...
>>>>
>>>> What's the problem? I'm using version 0.3&    Hadoop 0.20.0.
>>>>
>>>> Thank you!
>>>>
>>>>
>>>>
>>>>
>>>>          
>>>
>>>        
>>      
>
>
>

Re: Mahout LDA Parameter: maxIter

Posted by Robin Anil <ro...@gmail.com>.

Its caused by not setting the correct word count I believe. Use the same
value as the dictionary count. It has to be fixed one of these days.


Robin
On Sun, May 23, 2010 at 3:50 PM, 杨杰 <xt...@gmail.com> wrote:

> Jeff and Robin,
>
> Thank you for your suggestion! There is another problem: Compiled the
> source from trunk and applied the patch MAHOUT-397, I retried the lda
> experiment, but another exception was thrown:
>
> 10/05/23 17:01:52 INFO common.HadoopUtil: Deleting
> mahout/seq-sparse-tf/lda-out
> 10/05/23 17:01:55 INFO lda.LDADriver: Iteration 1
> 10/05/23 17:01:55 WARN mapred.JobClient: Use GenericOptionsParser for
> parsing the arguments. Applications should implement Tool for the
> same.
> 10/05/23 17:01:56 INFO input.FileInputFormat: Total input paths to process
> : 1
> 10/05/23 17:01:56 INFO mapred.JobClient: Running job: job_201005231654_0001
> 10/05/23 17:01:57 INFO mapred.JobClient:  map 0% reduce 0%
> 10/05/23 17:02:10 INFO mapred.JobClient: Task Id :
> attempt_201005231654_0001_m_000000_0, Status : FAILED
> java.lang.ArrayIndexOutOfBoundsException: 123
>        at
> org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:106)
>        at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:45)
>        at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:36)
>        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> The COMMAND is the same as the former one except "-ow" which is "-w"
> in 0.3 distribution; dataset is also the same with mahout 0.3 (on
> which the experiment works ok except for *only one map* in each
> iteration~).
>
> Is it because of absence of some other patches? Or is there any other
> mistakes  in my operations?
>
> Thank you!
>
>
> On Sun, May 23, 2010 at 8:01 AM, Robin Anil <ro...@gmail.com> wrote:
> > David's rule of thumb was to let the iterations go until relative change
> in
> > LL becomes around 10^-4
> >
> > Robin
> >
> > On Sat, May 22, 2010 at 9:12 PM, Jeff Eastman <
> jdog@windwardsolutions.com>wrote:
> >
> >> I suggest you try running with a trunk checkout and upgrading to Hadoop
> >> 0.20.2. Mahout is still in motion and I've run LDA on Reuters on trunk
> in
> >> the last few days. The maxIter parameter should not be an issue; you
> could
> >> try removing it entirely and LDA will default to running to convergence
> >> (about 100 iterations which can take some time). I've found the Reuters
> >> results don't change too much after 20. Even with a clean trunk checkout
> >> Reuters will only use a single node and the iterations should take about
> 5
> >> mins each. If you want to run on a multi-node cluster, install the patch
> in
> >> MAHOUT-397 (
> >>
> >>
> >>
> https://issues.apache.org/jira/browse/MAHOUT-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> )
> >> and use the same arguments as in examples/bin/build-reuters.sh. Even on
> a
> >> 3-node cluster this brings the iteration time down to about a minute and
> a
> >> half which is worth doing.
> >>
> >> Hope this helps,
> >> Jeff
> >>
> >> http://www.windwardsolutions.com
> >>
> >>
> >>
> >>
> >> On 5/22/10 5:40 AM, 杨杰 wrote:
> >>
> >>> Hi, everyone
> >>>
> >>> I'm trying mahout now. When running LDA on reuter corpus
> >>> (
> >>>
> http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
> >>> ),
> >>> A parameter refuses to work. This parameter is "maxIter", without
> >>> which, i cannot decide the iteration to run~
> >>>
> >>> My CMD is:
> >>> bin/mahout.hadoop lda --input mahout/seq-sparse-tf/vectors --output
> >>> mahout/seq-sparse-tf/lda-out5 --numWords 34000 --numTopics 20
> >>> --maxIter 1
> >>>
> >>> But got a exception:
> >>> 10/05/22 20:32:11 ERROR lda.LDADriver: Exception
> >>> org.apache.commons.cli2.OptionException: Unexpected 2 while processing
> >>> Options
> >>>        at
> >>> org.apache.commons.cli2.commandline.Parser.parse(Parser.java:100)
> >>>        at
> >>> org.apache.mahout.clustering.lda.LDADriver.main(LDADriver.java:115)
> >>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >>>        at
> >>>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> >>>        at
> >>>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> >>>        at java.lang.reflect.Method.invoke(Method.java:597)
> >>>        at
> >>>
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> >>>        at
> >>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> >>>        at
> >>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)
> >>> ...
> >>>
> >>> What's the problem? I'm using version 0.3&  Hadoop 0.20.0.
> >>>
> >>> Thank you!
> >>>
> >>>
> >>>
> >>>
> >>
> >>
> >
>
>
>
> --
> Yang Jie（杨杰）
> hi.baidu.com/thinkdifferent
>
> Group of CLOUD, Xi'an Jiaotong University
> Department of Computer Science and Technology, Xi’an Jiaotong University
>
> PHONE: 86 1346888 3723
> TEL: 86 29 82665263 EXT. 608
> MSN: xtyangjie2004@yahoo.com.cn
>
> once i didn't know software is not free, but found it days later; now
> i realize that it's indeed free.
>

Re: Mahout LDA Parameter: maxIter

Posted by 杨杰 <xt...@gmail.com>.

Jeff and Robin,

Thank you for your suggestion! There is another problem: Compiled the
source from trunk and applied the patch MAHOUT-397, I retried the lda
experiment, but another exception was thrown:

10/05/23 17:01:52 INFO common.HadoopUtil: Deleting mahout/seq-sparse-tf/lda-out
10/05/23 17:01:55 INFO lda.LDADriver: Iteration 1
10/05/23 17:01:55 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the
same.
10/05/23 17:01:56 INFO input.FileInputFormat: Total input paths to process : 1
10/05/23 17:01:56 INFO mapred.JobClient: Running job: job_201005231654_0001
10/05/23 17:01:57 INFO mapred.JobClient:  map 0% reduce 0%
10/05/23 17:02:10 INFO mapred.JobClient: Task Id :
attempt_201005231654_0001_m_000000_0, Status : FAILED
java.lang.ArrayIndexOutOfBoundsException: 123
	at org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:106)
	at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:45)
	at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:36)
	at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
	at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303)
	at org.apache.hadoop.mapred.Child.main(Child.java:170)

The COMMAND is the same as the former one except "-ow" which is "-w"
in 0.3 distribution; dataset is also the same with mahout 0.3 (on
which the experiment works ok except for *only one map* in each
iteration~).

Is it because of absence of some other patches? Or is there any other
mistakes  in my operations?

Thank you!


On Sun, May 23, 2010 at 8:01 AM, Robin Anil <ro...@gmail.com> wrote:
> David's rule of thumb was to let the iterations go until relative change in
> LL becomes around 10^-4
>
> Robin
>
> On Sat, May 22, 2010 at 9:12 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>> I suggest you try running with a trunk checkout and upgrading to Hadoop
>> 0.20.2. Mahout is still in motion and I've run LDA on Reuters on trunk in
>> the last few days. The maxIter parameter should not be an issue; you could
>> try removing it entirely and LDA will default to running to convergence
>> (about 100 iterations which can take some time). I've found the Reuters
>> results don't change too much after 20. Even with a clean trunk checkout
>> Reuters will only use a single node and the iterations should take about 5
>> mins each. If you want to run on a multi-node cluster, install the patch in
>> MAHOUT-397 (
>>
>>
>> https://issues.apache.org/jira/browse/MAHOUT-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel)
>> and use the same arguments as in examples/bin/build-reuters.sh. Even on a
>> 3-node cluster this brings the iteration time down to about a minute and a
>> half which is worth doing.
>>
>> Hope this helps,
>> Jeff
>>
>> http://www.windwardsolutions.com
>>
>>
>>
>>
>> On 5/22/10 5:40 AM, 杨杰 wrote:
>>
>>> Hi, everyone
>>>
>>> I'm trying mahout now. When running LDA on reuter corpus
>>> (
>>> http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
>>> ),
>>> A parameter refuses to work. This parameter is "maxIter", without
>>> which, i cannot decide the iteration to run~
>>>
>>> My CMD is:
>>> bin/mahout.hadoop lda --input mahout/seq-sparse-tf/vectors --output
>>> mahout/seq-sparse-tf/lda-out5 --numWords 34000 --numTopics 20
>>> --maxIter 1
>>>
>>> But got a exception:
>>> 10/05/22 20:32:11 ERROR lda.LDADriver: Exception
>>> org.apache.commons.cli2.OptionException: Unexpected 2 while processing
>>> Options
>>>        at
>>> org.apache.commons.cli2.commandline.Parser.parse(Parser.java:100)
>>>        at
>>> org.apache.mahout.clustering.lda.LDADriver.main(LDADriver.java:115)
>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>        at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>        at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>        at
>>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>>        at
>>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>>        at
>>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)
>>> ...
>>>
>>> What's the problem? I'm using version 0.3&  Hadoop 0.20.0.
>>>
>>> Thank you!
>>>
>>>
>>>
>>>
>>
>>
>



-- 
Yang Jie（杨杰）
hi.baidu.com/thinkdifferent

Group of CLOUD, Xi'an Jiaotong University
Department of Computer Science and Technology, Xi’an Jiaotong University

PHONE: 86 1346888 3723
TEL: 86 29 82665263 EXT. 608
MSN: xtyangjie2004@yahoo.com.cn

once i didn't know software is not free, but found it days later; now
i realize that it's indeed free.

Re: Mahout LDA Parameter: maxIter

Posted by Robin Anil <ro...@gmail.com>.

David's rule of thumb was to let the iterations go until relative change in
LL becomes around 10^-4

Robin

On Sat, May 22, 2010 at 9:12 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> I suggest you try running with a trunk checkout and upgrading to Hadoop
> 0.20.2. Mahout is still in motion and I've run LDA on Reuters on trunk in
> the last few days. The maxIter parameter should not be an issue; you could
> try removing it entirely and LDA will default to running to convergence
> (about 100 iterations which can take some time). I've found the Reuters
> results don't change too much after 20. Even with a clean trunk checkout
> Reuters will only use a single node and the iterations should take about 5
> mins each. If you want to run on a multi-node cluster, install the patch in
> MAHOUT-397 (
>
>
> https://issues.apache.org/jira/browse/MAHOUT-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel)
> and use the same arguments as in examples/bin/build-reuters.sh. Even on a
> 3-node cluster this brings the iteration time down to about a minute and a
> half which is worth doing.
>
> Hope this helps,
> Jeff
>
> http://www.windwardsolutions.com
>
>
>
>
> On 5/22/10 5:40 AM, 杨杰 wrote:
>
>> Hi, everyone
>>
>> I'm trying mahout now. When running LDA on reuter corpus
>> (
>> http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/
>> ),
>> A parameter refuses to work. This parameter is "maxIter", without
>> which, i cannot decide the iteration to run~
>>
>> My CMD is:
>> bin/mahout.hadoop lda --input mahout/seq-sparse-tf/vectors --output
>> mahout/seq-sparse-tf/lda-out5 --numWords 34000 --numTopics 20
>> --maxIter 1
>>
>> But got a exception:
>> 10/05/22 20:32:11 ERROR lda.LDADriver: Exception
>> org.apache.commons.cli2.OptionException: Unexpected 2 while processing
>> Options
>>        at
>> org.apache.commons.cli2.commandline.Parser.parse(Parser.java:100)
>>        at
>> org.apache.mahout.clustering.lda.LDADriver.main(LDADriver.java:115)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
>>        at
>> org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>>        at
>> org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)
>> ...
>>
>> What's the problem? I'm using version 0.3&  Hadoop 0.20.0.
>>
>> Thank you!
>>
>>
>>
>>
>
>

Re: Mahout LDA Parameter: maxIter

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I suggest you try running with a trunk checkout and upgrading to Hadoop 
0.20.2. Mahout is still in motion and I've run LDA on Reuters on trunk 
in the last few days. The maxIter parameter should not be an issue; you 
could try removing it entirely and LDA will default to running to 
convergence (about 100 iterations which can take some time). I've found 
the Reuters results don't change too much after 20. Even with a clean 
trunk checkout Reuters will only use a single node and the iterations 
should take about 5 mins each. If you want to run on a multi-node 
cluster, install the patch in MAHOUT-397 (

https://issues.apache.org/jira/browse/MAHOUT-397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel) and use the same arguments as in examples/bin/build-reuters.sh. Even on a 3-node cluster this brings the iteration time down to about a minute and a half which is worth doing.

Hope this helps,
Jeff

http://www.windwardsolutions.com

On 5/22/10 5:40 AM, 杨杰 wrote:
> Hi, everyone
>
> I'm trying mahout now. When running LDA on reuter corpus
> (http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/),
> A parameter refuses to work. This parameter is "maxIter", without
> which, i cannot decide the iteration to run~
>
> My CMD is:
> bin/mahout.hadoop lda --input mahout/seq-sparse-tf/vectors --output
> mahout/seq-sparse-tf/lda-out5 --numWords 34000 --numTopics 20
> --maxIter 1
>
> But got a exception:
> 10/05/22 20:32:11 ERROR lda.LDADriver: Exception
> org.apache.commons.cli2.OptionException: Unexpected 2 while processing Options
> 	at org.apache.commons.cli2.commandline.Parser.parse(Parser.java:100)
> 	at org.apache.mahout.clustering.lda.LDADriver.main(LDADriver.java:115)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> 	at java.lang.reflect.Method.invoke(Method.java:597)
> 	at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> 	at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> 	at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)
> ...
>
> What's the problem? I'm using version 0.3&  Hadoop 0.20.0.
>
> Thank you!
>
>
>