You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Jeff Eastman <jd...@windwardsolutions.com> on 2010/05/24 00:26:18 UTC

Re: Mahout LDA Parameter: maxIter (--numWords actually)

I added a try block in the mapper which catches the exception and outputs:

java.lang.IllegalStateException: This is probably because the --numWords 
argument is set too small.
     It needs to be >= than the number of words (terms actually) in the 
corpus and can be
     larger if some storage inefficiency can be tolerated.
     at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:49)
     at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:1)
     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
     at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3816
     at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:77)
     at 
org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:44)
     at 
org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:205)
     at 
org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:103)
     at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47)
     ... 5 more

I'll commit that for now while we explore a more elegant solution.


On 5/23/10 2:45 PM, Sean Owen wrote:
> Even something as simple as checking that bound and throwing
> IllegalStateException with a custom message -- yeah I imagine it's
> hard to detect this anytime earlier. Just a thought.
>
> On Sun, May 23, 2010 at 6:29 PM, Jeff Eastman
> <jd...@windwardsolutions.com>  wrote:
>    
>> I agree it is not very friendly. Impossible to tell the correct value in the
>> options section processing. It needs to be>= than the actual number of
>> unique terms in the corpus and that is hard to anticipate though I think it
>> is known in seq2sparse. If it turns out to be the dictionary size (I'm
>> investigating), then it could be computed by adding a dictionary path
>> argument instead of the current option. Trouble with that is the dictionary
>> is not needed for anything else by LDA.
>>
>>      
>    


Re: Mahout LDA Parameter: maxIter (--numWords actually)

Posted by Ted Dunning <te...@gmail.com>.
getQuick is a real problem and usually very, very little performance
benefit.

I think that keeping it is important, but using is often not such a great
idea.  On reading new code, I would be quite suspicious of its use unless it
is completely obvious critical to performance.

Sean's suggestion that iterators be used instead is quite good.  Colt didn't
originally have good iterator patterns which is why getQuick was so
important to have (and use) back then.  Now that we have a more modern java,
we should strongly encourage the iterator style.

On Sun, May 23, 2010 at 3:47 PM, Sean Owen <sr...@gmail.com> wrote:

> I might hijack this to question the existence of getQuick(). It's for
> situations where the caller "knows" the dimension is in bounds, but
> perhaps it's too tempting to just call this even when that's not
> known. Here calling get() would have just resulted in a different and
> only slightly-better exception. In another implementation it might
> have silently failed by returning a zero or something though.
>
> One thought is to remove getQuick(). It has performance implications.
> Maybe many uses of getQuick() could be better constructed as
> iterations, I don't know.
>

Re: Mahout LDA Parameter: maxIter (--numWords actually)

Posted by Sean Owen <sr...@gmail.com>.
It might be worth thinking about since catching for this over a fair
bit of code might be catching other unrelated conditions. Is the issue
not that phiW may not have the same cardinality as nextGamma? if
that's it that's an easy check but maybe that's not it.


I might hijack this to question the existence of getQuick(). It's for
situations where the caller "knows" the dimension is in bounds, but
perhaps it's too tempting to just call this even when that's not
known. Here calling get() would have just resulted in a different and
only slightly-better exception. In another implementation it might
have silently failed by returning a zero or something though.

One thought is to remove getQuick(). It has performance implications.
Maybe many uses of getQuick() could be better constructed as
iterations, I don't know.

But another smaller changed would be to use get() and "assert" the
range checks. Unit tests would enable these; production code wouldn't.
They're off by default.

Is that compelling to anyone

On Sun, May 23, 2010 at 11:35 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
> Its caused by a getQuick in the bowels of infer(). Adding the test to every
> access inside the loop seems more expensive than just catching the exception
> in the mapper.
>
> On 5/23/10 3:29 PM, Sean Owen wrote:
>>
>> Switching to dev list.
>>
>> I don't want to belabor a small point, but I'm wondering whether it's
>> just better to check whatever array access causes the problem before
>> it's accessed?
>>
>> Meaning...
>>
>> if (foo>= array.length) {
>>   throw new IllegalStateException();
>> }
>> ... array[foo] ...
>>
>> instead of
>>
>> try {
>> ... array[foo] ...
>> } catch (ArrayIndexOutOfBoundsException aioobe) {
>>   throw new IllegalStateException();
>> }
>>
>> On Sun, May 23, 2010 at 11:26 PM, Jeff Eastman
>> <jd...@windwardsolutions.com>  wrote:
>>
>>>
>>> I added a try block in the mapper which catches the exception and
>>> outputs:
>>>
>>> java.lang.IllegalStateException: This is probably because the --numWords
>>> argument is set too small.
>>>    It needs to be>= than the number of words (terms actually) in the
>>> corpus
>>> and can be
>>>    larger if some storage inefficiency can be tolerated.
>>>    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:49)
>>>    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:1)
>>>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>>    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>>    at
>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>>> Caused by: java.lang.ArrayIndexOutOfBoundsException: 3816
>>>    at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:77)
>>>    at
>>>
>>> org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:44)
>>>    at
>>>
>>> org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:205)
>>>    at
>>>
>>> org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:103)
>>>    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47)
>>>    ... 5 more
>>>
>>> I'll commit that for now while we explore a more elegant solution.
>>>
>>>
>>> On 5/23/10 2:45 PM, Sean Owen wrote:
>>>
>>>>
>>>> Even something as simple as checking that bound and throwing
>>>> IllegalStateException with a custom message -- yeah I imagine it's
>>>> hard to detect this anytime earlier. Just a thought.
>>>>
>>>> On Sun, May 23, 2010 at 6:29 PM, Jeff Eastman
>>>> <jd...@windwardsolutions.com>    wrote:
>>>>
>>>>
>>>>>
>>>>> I agree it is not very friendly. Impossible to tell the correct value
>>>>> in
>>>>> the
>>>>> options section processing. It needs to be>= than the actual number of
>>>>> unique terms in the corpus and that is hard to anticipate though I
>>>>> think
>>>>> it
>>>>> is known in seq2sparse. If it turns out to be the dictionary size (I'm
>>>>> investigating), then it could be computed by adding a dictionary path
>>>>> argument instead of the current option. Trouble with that is the
>>>>> dictionary
>>>>> is not needed for anything else by LDA.
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Mahout LDA Parameter: maxIter (--numWords actually)

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Its caused by a getQuick in the bowels of infer(). Adding the test to 
every access inside the loop seems more expensive than just catching the 
exception in the mapper.

On 5/23/10 3:29 PM, Sean Owen wrote:
> Switching to dev list.
>
> I don't want to belabor a small point, but I'm wondering whether it's
> just better to check whatever array access causes the problem before
> it's accessed?
>
> Meaning...
>
> if (foo>= array.length) {
>    throw new IllegalStateException();
> }
> ... array[foo] ...
>
> instead of
>
> try {
> ... array[foo] ...
> } catch (ArrayIndexOutOfBoundsException aioobe) {
>    throw new IllegalStateException();
> }
>
> On Sun, May 23, 2010 at 11:26 PM, Jeff Eastman
> <jd...@windwardsolutions.com>  wrote:
>    
>> I added a try block in the mapper which catches the exception and outputs:
>>
>> java.lang.IllegalStateException: This is probably because the --numWords
>> argument is set too small.
>>     It needs to be>= than the number of words (terms actually) in the corpus
>> and can be
>>     larger if some storage inefficiency can be tolerated.
>>     at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:49)
>>     at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:1)
>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
>> Caused by: java.lang.ArrayIndexOutOfBoundsException: 3816
>>     at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:77)
>>     at
>> org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:44)
>>     at
>> org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:205)
>>     at
>> org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:103)
>>     at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47)
>>     ... 5 more
>>
>> I'll commit that for now while we explore a more elegant solution.
>>
>>
>> On 5/23/10 2:45 PM, Sean Owen wrote:
>>      
>>> Even something as simple as checking that bound and throwing
>>> IllegalStateException with a custom message -- yeah I imagine it's
>>> hard to detect this anytime earlier. Just a thought.
>>>
>>> On Sun, May 23, 2010 at 6:29 PM, Jeff Eastman
>>> <jd...@windwardsolutions.com>    wrote:
>>>
>>>        
>>>> I agree it is not very friendly. Impossible to tell the correct value in
>>>> the
>>>> options section processing. It needs to be>= than the actual number of
>>>> unique terms in the corpus and that is hard to anticipate though I think
>>>> it
>>>> is known in seq2sparse. If it turns out to be the dictionary size (I'm
>>>> investigating), then it could be computed by adding a dictionary path
>>>> argument instead of the current option. Trouble with that is the
>>>> dictionary
>>>> is not needed for anything else by LDA.
>>>>
>>>>
>>>>          
>>>
>>>        
>>
>>      
>    


Re: Mahout LDA Parameter: maxIter (--numWords actually)

Posted by Sean Owen <sr...@gmail.com>.
Switching to dev list.

I don't want to belabor a small point, but I'm wondering whether it's
just better to check whatever array access causes the problem before
it's accessed?

Meaning...

if (foo >= array.length) {
  throw new IllegalStateException();
}
... array[foo] ...

instead of

try {
... array[foo] ...
} catch (ArrayIndexOutOfBoundsException aioobe) {
  throw new IllegalStateException();
}

On Sun, May 23, 2010 at 11:26 PM, Jeff Eastman
<jd...@windwardsolutions.com> wrote:
> I added a try block in the mapper which catches the exception and outputs:
>
> java.lang.IllegalStateException: This is probably because the --numWords
> argument is set too small.
>    It needs to be >= than the number of words (terms actually) in the corpus
> and can be
>    larger if some storage inefficiency can be tolerated.
>    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:49)
>    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:1)
>    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
>    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
>    at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 3816
>    at org.apache.mahout.math.DenseMatrix.getQuick(DenseMatrix.java:77)
>    at
> org.apache.mahout.clustering.lda.LDAState.logProbWordGivenTopic(LDAState.java:44)
>    at
> org.apache.mahout.clustering.lda.LDAInference.eStepForWord(LDAInference.java:205)
>    at
> org.apache.mahout.clustering.lda.LDAInference.infer(LDAInference.java:103)
>    at org.apache.mahout.clustering.lda.LDAMapper.map(LDAMapper.java:47)
>    ... 5 more
>
> I'll commit that for now while we explore a more elegant solution.
>
>
> On 5/23/10 2:45 PM, Sean Owen wrote:
>>
>> Even something as simple as checking that bound and throwing
>> IllegalStateException with a custom message -- yeah I imagine it's
>> hard to detect this anytime earlier. Just a thought.
>>
>> On Sun, May 23, 2010 at 6:29 PM, Jeff Eastman
>> <jd...@windwardsolutions.com>  wrote:
>>
>>>
>>> I agree it is not very friendly. Impossible to tell the correct value in
>>> the
>>> options section processing. It needs to be>= than the actual number of
>>> unique terms in the corpus and that is hard to anticipate though I think
>>> it
>>> is known in seq2sparse. If it turns out to be the dictionary size (I'm
>>> investigating), then it could be computed by adding a dictionary path
>>> argument instead of the current option. Trouble with that is the
>>> dictionary
>>> is not needed for anything else by LDA.
>>>
>>>
>>
>>
>
>