You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by "david.stuart@progressivealliance.co.uk" <da...@progressivealliance.co.uk> on 2010/05/08 00:42:30 UTC
Creating Vectors for KMeans
Hi All,
I am trying to create a vector file to go into KMeans clustering
Algorithm. The Data I have is in Solr and I have followed this
tutorial https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
and used this command
bin/mahout lucene.vector --dir <path>/solr/data/index --field body \
--dictOut /tmp/dict.txt --output /tmp/out.txt --max 50
--norm 2
I get two files out and tried to use the out file with this command
bin/mahout
org.apache.mahout.clustering.kmeans.KMeansDriver -i
out.txt -o input-data-kmeans-clusters -c
clusters -m org.apache.mahout.common.distance.CosineDistanceMeasure -v
org.apache.mahout.matrix.SparseVector -x 50
I get an error about no clusters found am I even using the right
vector file??
Regards
Dave
Re: Creating Vectors for KMeans
Posted by David Stuart <da...@progressivealliance.co.uk>.
Yep work perfectly thanks for the help!
David Stuart
On 8 May 2010, at 10:24, Robin Anil <ro...@gmail.com> wrote:
> oops. I figured it out. Please specify a -k (number of cluster)
> parameter and the distance threshold. :) KMeans need to know either
> the cluster count or the clusters in the -c clusters folder. If it
> doesn't find k then it assumes you have initial clusters put in the
> clusters folder.
>
> PS: you can simply do a "bin/mahout kmeans" to run kmeans Since
> Mahout 0.3
>
> Robin
>
> On Sat, May 8, 2010 at 2:47 PM, david.stuart@progressivealliance.co.uk
> <da...@progressivealliance.co.uk> wrote:
>> Hi Robin,
>>
>> I'm using the latest from trunk so 0.4
>>
>> Seq file dump
>>
>> bin/mahout org.apache.mahout.utils.SequenceFileDumper --seqFile /
>> tmp/out.txt
>> Input Path: /tmp/out.txt
>> Key class: class org.apache.hadoop.io.LongWritable Value Class: class
>> org.apache.mahout.math.VectorWritable
>> Key: 0: Value: org.apache.mahout.math.VectorWritable@6bffc686
>> Key: 1: Value: org.apache.mahout.math.VectorWritable@6bffc686
>> Key: 2: Value: org.apache.mahout.math.VectorWritable@6bffc686
>> Key: 3: Value: org.apache.mahout.math.VectorWritable@6bffc686
>> Key: 4: Value: org.apache.mahout.math.VectorWritable@6bffc686
>> Key: 5: Value: org.apache.mahout.math.VectorWritable@6bffc686
>> Key: 6: Value: org.apache.mahout.math.VectorWritable@6bffc686
>> Key: 7: Value: org.apache.mahout.math.VectorWritable@6bffc686
>>
>>
>>
>> On 08 May 2010 at 10:57 Robin Anil <ro...@gmail.com> wrote:
>>
>>> David, couple of things needed to debug this
>>> 1) Tell me which version of Mahout are you using.
>>> 2) use o.a.m.utils.SequenceFileDumper to dump the out.txt and see
>>> what
>>> the key and value classes are
>>>
>>> Robin
Re: Creating Vectors for KMeans
Posted by Robin Anil <ro...@gmail.com>.
oops. I figured it out. Please specify a -k (number of cluster)
parameter and the distance threshold. :) KMeans need to know either
the cluster count or the clusters in the -c clusters folder. If it
doesn't find k then it assumes you have initial clusters put in the
clusters folder.
PS: you can simply do a "bin/mahout kmeans" to run kmeans Since Mahout 0.3
Robin
On Sat, May 8, 2010 at 2:47 PM, david.stuart@progressivealliance.co.uk
<da...@progressivealliance.co.uk> wrote:
> Hi Robin,
>
> I'm using the latest from trunk so 0.4
>
> Seq file dump
>
> bin/mahout org.apache.mahout.utils.SequenceFileDumper --seqFile /tmp/out.txt
> Input Path: /tmp/out.txt
> Key class: class org.apache.hadoop.io.LongWritable Value Class: class
> org.apache.mahout.math.VectorWritable
> Key: 0: Value: org.apache.mahout.math.VectorWritable@6bffc686
> Key: 1: Value: org.apache.mahout.math.VectorWritable@6bffc686
> Key: 2: Value: org.apache.mahout.math.VectorWritable@6bffc686
> Key: 3: Value: org.apache.mahout.math.VectorWritable@6bffc686
> Key: 4: Value: org.apache.mahout.math.VectorWritable@6bffc686
> Key: 5: Value: org.apache.mahout.math.VectorWritable@6bffc686
> Key: 6: Value: org.apache.mahout.math.VectorWritable@6bffc686
> Key: 7: Value: org.apache.mahout.math.VectorWritable@6bffc686
>
>
>
> On 08 May 2010 at 10:57 Robin Anil <ro...@gmail.com> wrote:
>
>> David, couple of things needed to debug this
>> 1) Tell me which version of Mahout are you using.
>> 2) use o.a.m.utils.SequenceFileDumper to dump the out.txt and see what
>> the key and value classes are
>>
>> Robin
Re: Creating Vectors for KMeans
Posted by "david.stuart@progressivealliance.co.uk" <da...@progressivealliance.co.uk>.
Hi Robin,
I'm using the latest from trunk so 0.4
Seq file dump
bin/mahout org.apache.mahout.utils.SequenceFileDumper --seqFile /tmp/out.txt
Input Path: /tmp/out.txt
Key class: class org.apache.hadoop.io.LongWritable Value Class: class
org.apache.mahout.math.VectorWritable
Key: 0: Value: org.apache.mahout.math.VectorWritable@6bffc686
Key: 1: Value: org.apache.mahout.math.VectorWritable@6bffc686
Key: 2: Value: org.apache.mahout.math.VectorWritable@6bffc686
Key: 3: Value: org.apache.mahout.math.VectorWritable@6bffc686
Key: 4: Value: org.apache.mahout.math.VectorWritable@6bffc686
Key: 5: Value: org.apache.mahout.math.VectorWritable@6bffc686
Key: 6: Value: org.apache.mahout.math.VectorWritable@6bffc686
Key: 7: Value: org.apache.mahout.math.VectorWritable@6bffc686
On 08 May 2010 at 10:57 Robin Anil <ro...@gmail.com> wrote:
> David, couple of things needed to debug this
> 1) Tell me which version of Mahout are you using.
> 2) use o.a.m.utils.SequenceFileDumper to dump the out.txt and see what
> the key and value classes are
>
> Robin
Re: Creating Vectors for KMeans
Posted by Robin Anil <ro...@gmail.com>.
David, couple of things needed to debug this
1) Tell me which version of Mahout are you using.
2) use o.a.m.utils.SequenceFileDumper to dump the out.txt and see what
the key and value classes are
Robin
Re: Creating Vectors for KMeans
Posted by "david.stuart@progressivealliance.co.uk" <da...@progressivealliance.co.uk>.
Hi,
I tried again using
bin/mahout org.apache.mahout.clustering.kmeans.KMeansDriver -i /tmp/out.txt -o
/tmp/foo -c clusters -x 5
Error results below
May 8, 2010 9:14:47 AM org.apache.hadoop.mapred.LocalJobRunner$Job run
WARNING: job_local_0001
java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:354)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 5 more
Caused by: java.lang.RuntimeException: Error in configuring object
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:93)
at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64)
at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
... 10 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88)
... 13 more
Caused by: java.lang.IllegalStateException: Cluster is empty!
at
org.apache.mahout.clustering.kmeans.KMeansMapper.configure(KMeansMapper.java:73)
... 18 more
May 8, 2010 9:14:48 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: map 0% reduce 0%
May 8, 2010 9:14:48 AM org.apache.hadoop.mapred.JobClient monitorAndPrintJob
INFO: Job complete: job_local_0001
May 8, 2010 9:14:48 AM org.apache.hadoop.mapred.Counters log
INFO: Counters: 0
May 8, 2010 9:14:48 AM org.slf4j.impl.JCLLoggerAdapter warn
WARNING: java.io.IOException: Job failed!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:
257)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java:204)
at
org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:162)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.jav
a:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.jav
a:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:172)
May 8, 2010 9:14:48 AM org.slf4j.impl.JCLLoggerAdapter info
On 08 May 2010 at 09:56 Robin Anil <ro...@gmail.com> wrote:
> try without the -v o.a.m...SparseVector and tell me how it goes
>
>
>
> On Sat, May 8, 2010 at 4:12 AM, david.stuart@progressivealliance.co.uk
> <da...@progressivealliance.co.uk> wrote:
> > Hi All,
> >
> > I am trying to create a vector file to go into KMeans clustering Algorithm.
> > The Data I have is in Solr and I have followed this tutorial
> > https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
> >
> > and used this command
> >
> > bin/mahout lucene.vector --dir <path>/solr/data/index --field body \
> > --dictOut /tmp/dict.txt --output /tmp/out.txt --max 50
> > --norm 2
> >
> > I get two files out and tried to use the out file with this command
> >
> > bin/mahout
> > org.apache.mahout.clustering.kmeans.KMeansDriver -i
> > out.txt -o input-data-kmeans-clusters -c
> > clusters -m org.apache.mahout.common.distance.CosineDistanceMeasure -v
> > org.apache.mahout.matrix.SparseVector -x 50
> >
> > I get an error about no clusters found am I even using the right vector
> > file??
> >
> >
> > Regards
> >
> > Dave
> >
> >
> >
Re: Creating Vectors for KMeans
Posted by Robin Anil <ro...@gmail.com>.
try without the -v o.a.m...SparseVector and tell me how it goes
On Sat, May 8, 2010 at 4:12 AM, david.stuart@progressivealliance.co.uk
<da...@progressivealliance.co.uk> wrote:
> Hi All,
>
> I am trying to create a vector file to go into KMeans clustering Algorithm.
> The Data I have is in Solr and I have followed this tutorial
> https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text
>
> and used this command
>
> bin/mahout lucene.vector --dir <path>/solr/data/index --field body \
> --dictOut /tmp/dict.txt --output /tmp/out.txt --max 50
> --norm 2
>
> I get two files out and tried to use the out file with this command
>
> bin/mahout
> org.apache.mahout.clustering.kmeans.KMeansDriver -i
> out.txt -o input-data-kmeans-clusters -c
> clusters -m org.apache.mahout.common.distance.CosineDistanceMeasure -v
> org.apache.mahout.matrix.SparseVector -x 50
>
> I get an error about no clusters found am I even using the right vector
> file??
>
>
> Regards
>
> Dave
>
>
>