You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Chris Bush <cn...@gmail.com> on 2010/10/06 01:17:59 UTC

ClassCastException running kmeans job with RandomSeedGenerator on reuters example data

Trying the kmeans clustering on reuters example data (Reuters-21578 news
collection) as covered in Mahout In Action, the following stack trace occurs
immediately (with and without HADOOP_HOME set -- with it set, the no
HADOOP_HOME warning is omitted) :

$ bin/mahout kmeans -i reuters-vectors -c reuters-initial-clusters -o
reuters-kmeans-clusters -dm
org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -r 1 -cd
1.0 -k 20 -x 10

no HADOOP_HOME set, running locally
Oct 5, 2010 2:27:28 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Command line arguments: {--clusters=reuters-initial-clusters,
--convergenceDelta=1.0,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --input=reuters-vectors, --maxIter=10, --maxRed=1,
--method=mapreduce, --numClusters=20, --output=reuters-kmeans-clusters,
--startPhase=0, --tempDir=temp}
Oct 5, 2010 2:27:29 PM org.slf4j.impl.JCLLoggerAdapter info
INFO: Deleting reuters-initial-clusters
Oct 5, 2010 2:27:29 PM org.apache.hadoop.util.NativeCodeLoader <clinit>
WARNING: Unable to load native-hadoop library for your platform... using
builtin-java classes where applicable
Oct 5, 2010 2:27:29 PM org.apache.hadoop.io.compress.CodecPool getCompressor
INFO: Got brand-new compressor
Exception in thread "main" java.lang.ClassCastException: class
org.apache.hadoop.io.IntWritable
at java.lang.Class.asSubclass(Class.java:3018)
at
org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:86)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:139)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:53)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
$

The org.apache.mahout.clustering.kmeans.RandomSeedGenerator class casts the
key from SequenceFile.Reader as a org.apache.hadoop.io.Writable successfully
but then tries to cast the value as org.apache.mahout.math.VectorWritable
unsuccessfully.

Thanks,

Chris

Re: ClassCastException running kmeans job with RandomSeedGenerator on reuters example data

Posted by Robin Anil <ro...@gmail.com>.
I have given out the new code to those who requested recently. Hopefully it
will be pushed to all along with the new text


On Wed, Oct 6, 2010 at 9:35 PM, Sean Owen <sr...@gmail.com> wrote:

> I agree -- Robin can you follow up on that?
> The text of the book is 1 chapter from completion. We are putting out 0.4
> imminently. I think we should decide that the book is written for 0.4.
>
>
> On Wed, Oct 6, 2010 at 4:44 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>>  I looked at the Reuters example in MiA and it has not yet been updated to
>> reflect recent changes in the file nomenclature in trunk. It was actually
>> incorrect for 0.3 too, as it shows the contents of reuters-vectors after
>> seq2sparse to be (on p132):
>>
>> $ls reuters-vectors/
>
>

Re: ClassCastException running kmeans job with RandomSeedGenerator on reuters example data

Posted by Sean Owen <sr...@gmail.com>.
I agree -- Robin can you follow up on that?
The text of the book is 1 chapter from completion. We are putting out 0.4
imminently. I think we should decide that the book is written for 0.4.

On Wed, Oct 6, 2010 at 4:44 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

>  I looked at the Reuters example in MiA and it has not yet been updated to
> reflect recent changes in the file nomenclature in trunk. It was actually
> incorrect for 0.3 too, as it shows the contents of reuters-vectors after
> seq2sparse to be (on p132):
>
> $ls reuters-vectors/

Re: ClassCastException running kmeans job with RandomSeedGenerator on reuters example data

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  I looked at the Reuters example in MiA and it has not yet been updated 
to reflect recent changes in the file nomenclature in trunk. It was 
actually incorrect for 0.3 too, as it shows the contents of 
reuters-vectors after seq2sparse to be (on p132):

$ls reuters-vectors/
dictionary.file-0
tfidf/
tokenized-documents/
vectors/
wordcount/

but then (on p144) it gives the input argument to k-means as:

-i reuters-vectors

which should have been:

-i reuters-vectors/tfidf (and maybe also /vectors after that, iirc, its 
been a few months since it was changed)

As noted below, the current nomenclature after seq2sparse is:

ls reuters-out-seqdir-sparse/
df-count/
frequency.file-0
tfidf-vectors/
wordcount/
dictionary.file-0
tf-vectors/
tokenized-documents/

We will need to get the book examples and the code in synch with 
whichever release coincides with its final publication. Both are moving 
targets right now. Given the rate of change of Mahout we always 
recommend using trunk and the trunk examples are most likely to work.

On 10/5/10 6:24 PM, Jeff Eastman wrote:
>  The random seed generator can't read the parts in the input folder 
> "reuters-vectors". What is in that directory? The program is expecting 
> part files containing VectorWritable points. If you ran 
> examples/bin/build-reuters.sh then the input to k-means (see the 
> script) should be:
>
> -i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/
>
> I suggest running the script with the k-means clustering uncommented 
> before getting outside of the standard file nomenclature.
> Jeff
>
>
> On 10/5/10 4:17 PM, Chris Bush wrote:
>> Trying the kmeans clustering on reuters example data (Reuters-21578 news
>> collection) as covered in Mahout In Action, the following stack trace 
>> occurs
>> immediately (with and without HADOOP_HOME set -- with it set, the no
>> HADOOP_HOME warning is omitted) :
>>
>> $ bin/mahout kmeans -i reuters-vectors -c reuters-initial-clusters -o
>> reuters-kmeans-clusters -dm
>> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -r 
>> 1 -cd
>> 1.0 -k 20 -x 10
>>
>> no HADOOP_HOME set, running locally
>> Oct 5, 2010 2:27:28 PM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Command line arguments: {--clusters=reuters-initial-clusters,
>> --convergenceDelta=1.0,
>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, 
>>
>> --endPhase=2147483647, --input=reuters-vectors, --maxIter=10, 
>> --maxRed=1,
>> --method=mapreduce, --numClusters=20, --output=reuters-kmeans-clusters,
>> --startPhase=0, --tempDir=temp}
>> Oct 5, 2010 2:27:29 PM org.slf4j.impl.JCLLoggerAdapter info
>> INFO: Deleting reuters-initial-clusters
>> Oct 5, 2010 2:27:29 PM org.apache.hadoop.util.NativeCodeLoader<clinit>
>> WARNING: Unable to load native-hadoop library for your platform... using
>> builtin-java classes where applicable
>> Oct 5, 2010 2:27:29 PM org.apache.hadoop.io.compress.CodecPool 
>> getCompressor
>> INFO: Got brand-new compressor
>> Exception in thread "main" java.lang.ClassCastException: class
>> org.apache.hadoop.io.IntWritable
>> at java.lang.Class.asSubclass(Class.java:3018)
>> at
>> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:86) 
>>
>> at
>> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:139) 
>>
>> at
>> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:53) 
>>
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) 
>>
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) 
>>
>> at java.lang.reflect.Method.invoke(Method.java:597)
>> at
>> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) 
>>
>> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
>> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
>> $
>>
>> The org.apache.mahout.clustering.kmeans.RandomSeedGenerator class 
>> casts the
>> key from SequenceFile.Reader as a org.apache.hadoop.io.Writable 
>> successfully
>> but then tries to cast the value as 
>> org.apache.mahout.math.VectorWritable
>> unsuccessfully.
>>
>> Thanks,
>>
>> Chris
>>
>


Re: ClassCastException running kmeans job with RandomSeedGenerator on reuters example data

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
  The random seed generator can't read the parts in the input folder 
"reuters-vectors". What is in that directory? The program is expecting 
part files containing VectorWritable points. If you ran 
examples/bin/build-reuters.sh then the input to k-means (see the script) 
should be:

-i ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/

I suggest running the script with the k-means clustering uncommented 
before getting outside of the standard file nomenclature.
Jeff


On 10/5/10 4:17 PM, Chris Bush wrote:
> Trying the kmeans clustering on reuters example data (Reuters-21578 news
> collection) as covered in Mahout In Action, the following stack trace occurs
> immediately (with and without HADOOP_HOME set -- with it set, the no
> HADOOP_HOME warning is omitted) :
>
> $ bin/mahout kmeans -i reuters-vectors -c reuters-initial-clusters -o
> reuters-kmeans-clusters -dm
> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure -r 1 -cd
> 1.0 -k 20 -x 10
>
> no HADOOP_HOME set, running locally
> Oct 5, 2010 2:27:28 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Command line arguments: {--clusters=reuters-initial-clusters,
> --convergenceDelta=1.0,
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
> --endPhase=2147483647, --input=reuters-vectors, --maxIter=10, --maxRed=1,
> --method=mapreduce, --numClusters=20, --output=reuters-kmeans-clusters,
> --startPhase=0, --tempDir=temp}
> Oct 5, 2010 2:27:29 PM org.slf4j.impl.JCLLoggerAdapter info
> INFO: Deleting reuters-initial-clusters
> Oct 5, 2010 2:27:29 PM org.apache.hadoop.util.NativeCodeLoader<clinit>
> WARNING: Unable to load native-hadoop library for your platform... using
> builtin-java classes where applicable
> Oct 5, 2010 2:27:29 PM org.apache.hadoop.io.compress.CodecPool getCompressor
> INFO: Got brand-new compressor
> Exception in thread "main" java.lang.ClassCastException: class
> org.apache.hadoop.io.IntWritable
> at java.lang.Class.asSubclass(Class.java:3018)
> at
> org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:86)
> at
> org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:139)
> at
> org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:53)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
> at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
> at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:175)
> $
>
> The org.apache.mahout.clustering.kmeans.RandomSeedGenerator class casts the
> key from SequenceFile.Reader as a org.apache.hadoop.io.Writable successfully
> but then tries to cast the value as org.apache.mahout.math.VectorWritable
> unsuccessfully.
>
> Thanks,
>
> Chris
>