You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by "david.stuart@progressivealliance.co.uk" <da...@progressivealliance.co.uk> on 2010/05/08 00:42:30 UTC

Creating Vectors for KMeans

​Hi All,

I am trying to create a vector file to go into KMeans clustering  
Algorithm. The Data I have is in Solr and I have followed this  
tutorial https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text

and used this command

bin/mahout lucene.vector --dir <path>/solr/data/index --field body \
--dictOut /tmp/dict.txt --output /tmp/out.txt --max 50
--norm 2

I get two files out and tried to use the out file with this command

bin/mahout
org.apache.mahout.clustering.kmeans.KMeansDriver -i
out.txt -o input-data-kmeans-clusters -c
clusters -m org.apache.mahout.common.distance.CosineDistanceMeasure -v
org.apache.mahout.matrix.SparseVector -x 50

I get an error about no clusters found am I even using the right  
vector file??


Regards

Dave



Re: Creating Vectors for KMeans

Posted by David Stuart <da...@progressivealliance.co.uk>.
Yep work perfectly thanks for the help!

David Stuart

On 8 May 2010, at 10:24, Robin Anil <ro...@gmail.com> wrote:

> oops. I figured it out. Please specify a -k (number of cluster)
> parameter and the distance threshold. :) KMeans need to know either
> the cluster count or the clusters in the -c clusters folder. If it
> doesn't find k then it assumes you have initial clusters put in the
> clusters folder.
>
> PS: you can simply do a "bin/mahout kmeans" to run kmeans Since  
> Mahout 0.3
>
> Robin
>
> On Sat, May 8, 2010 at 2:47 PM, david.stuart@progressivealliance.co.uk
> <da...@progressivealliance.co.uk> wrote:
>> Hi Robin,
>>
>> I'm using the latest from trunk so 0.4
>>
>> Seq file dump
>>
>> bin/mahout org.apache.mahout.utils.SequenceFileDumper --seqFile / 
>> tmp/out.txt
>> Input Path: /tmp/out.txt
>> Key class: class org.apache.hadoop.io.LongWritable Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Key: 0: Value: org.apache.mahout.math.VectorWritable@6bffc686
>> Key: 1: Value: org.apache.mahout.math.VectorWritable@6bffc686
>> Key: 2: Value: org.apache.mahout.math.VectorWritable@6bffc686
>> Key: 3: Value: org.apache.mahout.math.VectorWritable@6bffc686
>> Key: 4: Value: org.apache.mahout.math.VectorWritable@6bffc686
>> Key: 5: Value: org.apache.mahout.math.VectorWritable@6bffc686
>> Key: 6: Value: org.apache.mahout.math.VectorWritable@6bffc686
>> Key: 7: Value: org.apache.mahout.math.VectorWritable@6bffc686
>>
>>
>>
>> On 08 May 2010 at 10:57 Robin Anil <ro...@gmail.com> wrote:
>>
>>> David, couple of things needed to debug this
>>> 1) Tell me which version of Mahout are you using.
>>> 2) use o.a.m.utils.SequenceFileDumper to dump the out.txt and see  
>>> what
>>> the key and value classes are
>>>
>>> Robin

Re: Creating Vectors for KMeans

Posted by Robin Anil <ro...@gmail.com>.
oops. I figured it out. Please specify a -k (number of cluster)
parameter and the distance threshold. :) KMeans need to know either
the cluster count or the clusters in the -c clusters folder. If it
doesn't find k then it assumes you have initial clusters put in the
clusters folder.

PS: you can simply do a "bin/mahout kmeans" to run kmeans Since Mahout 0.3

Robin

On Sat, May 8, 2010 at 2:47 PM, david.stuart@progressivealliance.co.uk
<da...@progressivealliance.co.uk> wrote:
> Hi Robin,
>
> I'm using the latest from trunk so 0.4
>
> Seq file dump
>
> bin/mahout org.apache.mahout.utils.SequenceFileDumper --seqFile /tmp/out.txt
> Input Path: /tmp/out.txt
> Key class: class org.apache.hadoop.io.LongWritable Value Class: class
> org.apache.mahout.math.VectorWritable
> Key: 0: Value: org.apache.mahout.math.VectorWritable@6bffc686
> Key: 1: Value: org.apache.mahout.math.VectorWritable@6bffc686
> Key: 2: Value: org.apache.mahout.math.VectorWritable@6bffc686
> Key: 3: Value: org.apache.mahout.math.VectorWritable@6bffc686
> Key: 4: Value: org.apache.mahout.math.VectorWritable@6bffc686
> Key: 5: Value: org.apache.mahout.math.VectorWritable@6bffc686
> Key: 6: Value: org.apache.mahout.math.VectorWritable@6bffc686
> Key: 7: Value: org.apache.mahout.math.VectorWritable@6bffc686
>
>
>
> On 08 May 2010 at 10:57 Robin Anil <ro...@gmail.com> wrote:
>
>> David, couple of things needed to debug this
>> 1) Tell me which version of Mahout are you using.
>> 2) use o.a.m.utils.SequenceFileDumper to dump the out.txt and see what
>> the key and value classes are
>>
>> Robin

Re: Creating Vectors for KMeans

Posted by "david.stuart@progressivealliance.co.uk" <da...@progressivealliance.co.uk>.
Hi Robin,
 
I'm using the latest from trunk so 0.4
 
Seq file dump
 
bin/mahout org.apache.mahout.utils.SequenceFileDumper --seqFile /tmp/out.txt
Input Path: /tmp/out.txt
Key class: class org.apache.hadoop.io.LongWritable Value Class: class
org.apache.mahout.math.VectorWritable
Key: 0: Value: org.apache.mahout.math.VectorWritable@6bffc686
Key: 1: Value: org.apache.mahout.math.VectorWritable@6bffc686
Key: 2: Value: org.apache.mahout.math.VectorWritable@6bffc686
Key: 3: Value: org.apache.mahout.math.VectorWritable@6bffc686
Key: 4: Value: org.apache.mahout.math.VectorWritable@6bffc686
Key: 5: Value: org.apache.mahout.math.VectorWritable@6bffc686
Key: 6: Value: org.apache.mahout.math.VectorWritable@6bffc686
Key: 7: Value: org.apache.mahout.math.VectorWritable@6bffc686

 

On 08 May 2010 at 10:57 Robin Anil <ro...@gmail.com> wrote:

> David, couple of things needed to debug this
> 1) Tell me which version of Mahout are you using.
> 2) use o.a.m.utils.SequenceFileDumper to dump the out.txt and see what
> the key and value classes are
>
> Robin

Re: Creating Vectors for KMeans

Posted by Robin Anil <ro...@gmail.com>.
David, couple of things needed to debug this
1) Tell me which version of Mahout are you using.
2) use o.a.m.utils.SequenceFileDumper to dump the out.txt and see what
the key and value classes are

Robin

Re: Creating Vectors for KMeans

Posted by "david.stuart@progressivealliance.co.uk" <da...@progressivealliance.co.uk>.
Hi,
 
I tried again using
 
bin/mahout org.apache.mahout.clustering.kmeans.KMeansDriver -i /tmp/out.txt  -o
/tmp/foo -c clusters -x 5
 
Error results below
 
 
May 8, 2010 9:14:47 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0001
java.lang.RuntimeException: Error in configuring object
        at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
        at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
        at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
        ... 5 more
Caused by: java.lang.RuntimeException: Error in configuring object
        at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
        at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
        at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
        at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
        ... 10 more
Caused by: java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
        ... 13 more
Caused by: java.lang.IllegalStateException: Cluster is empty!
        at
org.apache.mahout.clustering.kmeans.KMeansMapper.configure(KMeansMapper.java:73)
        ... 18 more
May 8, 2010 9:14:48 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO:  map 0% reduce 0%
May 8, 2010 9:14:48 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Job complete: job_local_0001
May 8, 2010 9:14:48 AM org.apache.hadoop.mapred.Counters log
INFO: Counters: 0
May 8, 2010 9:14:48 AM org.slf4j.impl.JCLLoggerAdapter warn
WARNING: java.io.IOException: Job failed!
java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:
257)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java:204)
        at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:162)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.jav
a:68)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)
May 8, 2010 9:14:48 AM org.slf4j.impl.JCLLoggerAdapter info
 

On 08 May 2010 at 09:56 Robin Anil <ro...@gmail.com> wrote:

> try without the -v o.a.m...SparseVector and tell me how it goes
>
>
>
> On Sat, May 8, 2010 at 4:12 AM, david.stuart@progressivealliance.co.uk
> <da...@progressivealliance.co.uk> wrote:
> > Hi All,
> >
> > I am trying to create a vector file to go into KMeans clustering Algorithm.
> > The Data I have is in Solr and I have followed this tutorial
> > https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
> >
> > and used this command
> >
> > bin/mahout lucene.vector --dir <path>/solr/data/index --field body \
> > --dictOut /tmp/dict.txt --output /tmp/out.txt --max 50
> > --norm 2
> >
> > I get two files out and tried to use the out file with this command
> >
> > bin/mahout
> > org.apache.mahout.clustering.kmeans.KMeansDriver -i
> > out.txt -o input-data-kmeans-clusters -c
> > clusters -m org.apache.mahout.common.distance.CosineDistanceMeasure -v
> > org.apache.mahout.matrix.SparseVector -x 50
> >
> > I get an error about no clusters found am I even using the right vector
> > file??
> >
> >
> > Regards
> >
> > Dave
> >
> >
> >

Re: Creating Vectors for KMeans

Posted by Robin Anil <ro...@gmail.com>.
try without the -v o.a.m...SparseVector and tell me how it goes



On Sat, May 8, 2010 at 4:12 AM, david.stuart@progressivealliance.co.uk
<da...@progressivealliance.co.uk> wrote:
> Hi All,
>
> I am trying to create a vector file to go into KMeans clustering Algorithm.
> The Data I have is in Solr and I have followed this tutorial
> https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
>
> and used this command
>
> bin/mahout lucene.vector --dir <path>/solr/data/index --field body \
> --dictOut /tmp/dict.txt --output /tmp/out.txt --max 50
> --norm 2
>
> I get two files out and tried to use the out file with this command
>
> bin/mahout
> org.apache.mahout.clustering.kmeans.KMeansDriver -i
> out.txt -o input-data-kmeans-clusters -c
> clusters -m org.apache.mahout.common.distance.CosineDistanceMeasure -v
> org.apache.mahout.matrix.SparseVector -x 50
>
> I get an error about no clusters found am I even using the right vector
> file??
>
>
> Regards
>
> Dave
>
>
>