You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Benson Margulies <bi...@gmail.com> on 2009/12/19 23:07:14 UTC

Array out of bounds in the KMeans driver

../hadoop-0.20.1/bin/hadoop jar
examples/target/mahout-examples-0.3-SNAPSHOT.job
org.apache.mahout.clustering.kmeans.KMeansDriver  --input
testdata/he_mahout_vector -c clusters -o output -m
org.apache.mahout.common.distance.ManhattanDistanceMeasure -k 7


09/12/19 17:02:18 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
09/12/19 17:02:18 INFO compress.CodecPool: Got brand-new decompressor
09/12/19 17:02:18 INFO compress.CodecPool: Got brand-new compressor
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
0, Size: 0
	at java.util.ArrayList.RangeCheck(ArrayList.java:547)
	at java.util.ArrayList.get(ArrayList.java:322)
	at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:94)
	at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:158)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
	at java.lang.reflect.Method.invoke(Method.java:597)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

Re: Array out of bounds in the KMeans driver

Posted by Benson Margulies <bi...@gmail.com>.
Oh, duh. We'll, being extremely short can't be a good sign.

On Sat, Dec 19, 2009 at 7:13 PM, Benson Margulies <bi...@gmail.com> wrote:
> Well, hmm. How would I inspect the result of the lucene Driver to see
> if, in fact, it favored me with any actual vectors?
>
> On Sat, Dec 19, 2009 at 5:31 PM, Drew Farris <dr...@gmail.com> wrote:
>> The stack trace below suggests that the input file may not contain any
>> vectors because the code inside the while loop at 77, (esp. the
>> conditional starting at line 82) never fired and thus either (or both)
>> chosenTests or chosenClusters is empty.
>>
>> There should likely be some form of bounds check on line 98 of
>> RandomSeedGenerator, but that woundn't solve the problem if the input
>> file is empty.
>>
>> On Sat, Dec 19, 2009 at 5:07 PM, Benson Margulies <bi...@gmail.com> wrote:
>>> ../hadoop-0.20.1/bin/hadoop jar
>>> examples/target/mahout-examples-0.3-SNAPSHOT.job
>>> org.apache.mahout.clustering.kmeans.KMeansDriver  --input
>>> testdata/he_mahout_vector -c clusters -o output -m
>>> org.apache.mahout.common.distance.ManhattanDistanceMeasure -k 7
>>>
>>>
>>> 09/12/19 17:02:18 WARN util.NativeCodeLoader: Unable to load
>>> native-hadoop library for your platform... using builtin-java classes
>>> where applicable
>>> 09/12/19 17:02:18 INFO compress.CodecPool: Got brand-new decompressor
>>> 09/12/19 17:02:18 INFO compress.CodecPool: Got brand-new compressor
>>> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
>>> 0, Size: 0
>>>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>>        at java.util.ArrayList.get(ArrayList.java:322)
>>>        at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:94)
>>>        at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:158)
>>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>>
>>
>

Re: Array out of bounds in the KMeans driver

Posted by Benson Margulies <bi...@gmail.com>.
Well, hmm. How would I inspect the result of the lucene Driver to see
if, in fact, it favored me with any actual vectors?

On Sat, Dec 19, 2009 at 5:31 PM, Drew Farris <dr...@gmail.com> wrote:
> The stack trace below suggests that the input file may not contain any
> vectors because the code inside the while loop at 77, (esp. the
> conditional starting at line 82) never fired and thus either (or both)
> chosenTests or chosenClusters is empty.
>
> There should likely be some form of bounds check on line 98 of
> RandomSeedGenerator, but that woundn't solve the problem if the input
> file is empty.
>
> On Sat, Dec 19, 2009 at 5:07 PM, Benson Margulies <bi...@gmail.com> wrote:
>> ../hadoop-0.20.1/bin/hadoop jar
>> examples/target/mahout-examples-0.3-SNAPSHOT.job
>> org.apache.mahout.clustering.kmeans.KMeansDriver  --input
>> testdata/he_mahout_vector -c clusters -o output -m
>> org.apache.mahout.common.distance.ManhattanDistanceMeasure -k 7
>>
>>
>> 09/12/19 17:02:18 WARN util.NativeCodeLoader: Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 09/12/19 17:02:18 INFO compress.CodecPool: Got brand-new decompressor
>> 09/12/19 17:02:18 INFO compress.CodecPool: Got brand-new compressor
>> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
>> 0, Size: 0
>>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>>        at java.util.ArrayList.get(ArrayList.java:322)
>>        at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:94)
>>        at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:158)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>        at java.lang.reflect.Method.invoke(Method.java:597)
>>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>>
>

Re: Array out of bounds in the KMeans driver

Posted by Benson Margulies <bi...@gmail.com>.
It didn't have term vectors.

On Sat, Dec 19, 2009 at 8:43 PM, Drew Farris <dr...@gmail.com> wrote:
> Does the IndexFiles class store term vectors for the contents field?
> If not, that could be the problem.
>
> Also, you can try dumping the vector file using
> o.a.m.utils.vectors.VectorDumper in mahout-utils and taking a look to
> see what's in there.
>
> Failing that, in mahout-examples, you can run ./bin/build-reuters.sh
> -- that will generate a known good set of vectors and you can try
> running clustering upon that. No need to let build-reuters.sh to
> complete, watch stdout and kill it once the vectors are done because
> it will start running lda and you're not really interested in that at
> this point. Once this is run, the vectors themselves can be found in
> work/vectors, dictionary in work/dict.txt (relative to the
> mahout-example directory)
>
> On Sat, Dec 19, 2009 at 7:41 PM, Benson Margulies <bi...@gmail.com> wrote:
>> So,
>>
>> I took the stock Lucene 'IndexFiles' class. I modified it to read
>> UTF-8. I ran it.
>>
>> I ran the following:
>>
>> java -cp $cp org.apache.mahout.utils.vectors.lucene.Driver --dir
>> he_lucene_index \
>>   --output he_mahout_vector --field contents --dictOut he_mahout_dict \
>>   --idField path
>>
>> and am rewarded with a tiny file of vectors. Clearly I'm messing something up.
>>
>

Re: Array out of bounds in the KMeans driver

Posted by Drew Farris <dr...@gmail.com>.
Does the IndexFiles class store term vectors for the contents field?
If not, that could be the problem.

Also, you can try dumping the vector file using
o.a.m.utils.vectors.VectorDumper in mahout-utils and taking a look to
see what's in there.

Failing that, in mahout-examples, you can run ./bin/build-reuters.sh
-- that will generate a known good set of vectors and you can try
running clustering upon that. No need to let build-reuters.sh to
complete, watch stdout and kill it once the vectors are done because
it will start running lda and you're not really interested in that at
this point. Once this is run, the vectors themselves can be found in
work/vectors, dictionary in work/dict.txt (relative to the
mahout-example directory)

On Sat, Dec 19, 2009 at 7:41 PM, Benson Margulies <bi...@gmail.com> wrote:
> So,
>
> I took the stock Lucene 'IndexFiles' class. I modified it to read
> UTF-8. I ran it.
>
> I ran the following:
>
> java -cp $cp org.apache.mahout.utils.vectors.lucene.Driver --dir
> he_lucene_index \
>   --output he_mahout_vector --field contents --dictOut he_mahout_dict \
>   --idField path
>
> and am rewarded with a tiny file of vectors. Clearly I'm messing something up.
>

Re: Array out of bounds in the KMeans driver

Posted by Benson Margulies <bi...@gmail.com>.
So,

I took the stock Lucene 'IndexFiles' class. I modified it to read
UTF-8. I ran it.

I ran the following:

java -cp $cp org.apache.mahout.utils.vectors.lucene.Driver --dir
he_lucene_index \
   --output he_mahout_vector --field contents --dictOut he_mahout_dict \
   --idField path

and am rewarded with a tiny file of vectors. Clearly I'm messing something up.

Re: Array out of bounds in the KMeans driver

Posted by Drew Farris <dr...@gmail.com>.
The stack trace below suggests that the input file may not contain any
vectors because the code inside the while loop at 77, (esp. the
conditional starting at line 82) never fired and thus either (or both)
chosenTests or chosenClusters is empty.

There should likely be some form of bounds check on line 98 of
RandomSeedGenerator, but that woundn't solve the problem if the input
file is empty.

On Sat, Dec 19, 2009 at 5:07 PM, Benson Margulies <bi...@gmail.com> wrote:
> ../hadoop-0.20.1/bin/hadoop jar
> examples/target/mahout-examples-0.3-SNAPSHOT.job
> org.apache.mahout.clustering.kmeans.KMeansDriver  --input
> testdata/he_mahout_vector -c clusters -o output -m
> org.apache.mahout.common.distance.ManhattanDistanceMeasure -k 7
>
>
> 09/12/19 17:02:18 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes
> where applicable
> 09/12/19 17:02:18 INFO compress.CodecPool: Got brand-new decompressor
> 09/12/19 17:02:18 INFO compress.CodecPool: Got brand-new compressor
> Exception in thread "main" java.lang.IndexOutOfBoundsException: Index:
> 0, Size: 0
>        at java.util.ArrayList.RangeCheck(ArrayList.java:547)
>        at java.util.ArrayList.get(ArrayList.java:322)
>        at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:94)
>        at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:158)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
>