You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Cui tony <to...@gmail.com> on 2010/02/18 08:32:30 UTC
How to get results of K-means
Hi,
I'm a beginner on mahout.I have figure out how to run k-means of mahout.
But after that, I have no idea how to get the clustered result.
My input data is the standard example data : synthetic_control.data
After running, I got a points folder which someone says that it contains the
result. The points folders has mainly two files : part-00000 part-00001
file part-00000 like this:
EQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@�c?[34m~YYVb}?~\~UP_Z~N~V^@^@^C:^@^@^C8~N^C5{"class":"org.apache.mahout.matrix.Sparsor","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[28.7812,34.4632,31.3381,31.2834,28.9207,33.7596,25.3969,27.7849,35.2479,27.1159,32.8717,29.2171,36.0253,32.337,34.5249,32.8717,34.1173,26.5235,27.6623,26.3693,25.7744,29.27,30.7326,29.5054,33.0292,25.04,28.9167,24.3437,26.1203,34.9424,25.0293,26.6311,35.6541,28.4353,29.1495,28.1584,26.1927,33.3182,30.9772,27.0443,35.5344,26.2353,28.9964,32.0036,31.0558,34.2553,28.0721,28.9402,35.4973,29.747,31.4333,24.5556,33.7431,25.0466,34.9318,34.9879,32.4721,33.3759,25.4652,25.8717],\"numMappings\":60},\"cardinality\":60,\"lengthSquared\":-1.0,\"name\":\"\"}"}^A2^@^@^C;^@^@^C9~N^C6{"class":"org.apache.mahout.matrix.SparseVector","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[24.8923,25.741,27.5532,32.8217,27.8789,
I'm so confused about this result: who can I got the data with the clustered
label?
thanks~~
Re: How to get results of K-means
Posted by Cui tony <to...@gmail.com>.
Hi, Grant
I just install 0.3, but there is still some problem on dumping the cluser
from output files.
What I did detailly is following:
hadoop jar mahout-examples-0.3-SNAPSHOT.job
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -i testdata
the output is
[@master target]$ hadoop fs -ls output
Found 13 items
drwxr-xr-x - hadoop supergroup 0 2010-02-22 17:35
/user/hadoop/output/canopies
drwxr-xr-x - hadoop supergroup 0 2010-02-22 17:35
/user/hadoop/output/clusters-0
drwxr-xr-x - hadoop supergroup 0 2010-02-22 17:35
/user/hadoop/output/clusters-1
drwxr-xr-x - hadoop supergroup 0 2010-02-22 17:36
/user/hadoop/output/clusters-2
drwxr-xr-x - hadoop supergroup 0 2010-02-22 17:36
/user/hadoop/output/clusters-3
drwxr-xr-x - hadoop supergroup 0 2010-02-22 17:37
/user/hadoop/output/clusters-4
drwxr-xr-x - hadoop supergroup 0 2010-02-22 17:37
/user/hadoop/output/clusters-5
drwxr-xr-x - hadoop supergroup 0 2010-02-22 17:37
/user/hadoop/output/clusters-6
drwxr-xr-x - hadoop supergroup 0 2010-02-22 17:38
/user/hadoop/output/clusters-7
drwxr-xr-x - hadoop supergroup 0 2010-02-22 17:38
/user/hadoop/output/clusters-8
drwxr-xr-x - hadoop supergroup 0 2010-02-22 17:39
/user/hadoop/output/clusters-9
drwxr-xr-x - hadoop supergroup 0 2010-02-22 17:34
/user/hadoop/output/data
drwxr-xr-x - hadoop supergroup 0 2010-02-22 17:39
/user/hadoop/output/points
then I run : sh /data/hadoop/mahout/trunk/bin/mahout clusterdump -s
clusters-9/ -p points/
Input Path: /data/hadoop/mahout/trunk/examples/target/clusters-9/part-00000
Exception in thread "main" java.lang.NullPointerException
at
org.apache.mahout.utils.vectors.VectorHelper.vectorToString(VectorHelper.java:69)
at
org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:133)
at
org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:302)
and I run : sh /data/hadoop/mahout/trunk/bin/mahout clusterdump -s
testdata/synthetic_control.data -p points/
Error message is :
Exception in thread "main" java.lang.NullPointerException
at
org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:121)
at
org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:302)
Is there something wrong?
Thanks!!
2010/2/19 Grant Ingersoll <gs...@apache.org>
>
> On Feb 19, 2010, at 8:33 AM, Cui tony wrote:
>
> > Thank you all you guys.
> > I know how to change seqFile to txt file now.
> >
> > I'm sorry, Grant, your example is still a little complicated to me. How
> can
> > I run in this command: bin/mahout ?
> > This is the command line which I used :
> > hadoop jar data/mahout-examples-0.2.job
> > org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>
> Ah, sorry. Didn't realize you were using 0.2. bin/mahout is in trunk.
> We're about to release 0.3, so don't be afraid of trunk.
>
> >
> > And, Grant, could you give me some information on how to use SequenceFile
> > Dumper or class dumper?
> >
>
> They have a main class in them. If you setup the classpath, etc. you can
> use --help to see the options.
>
> -Grant
Re: How to get results of K-means
Posted by Grant Ingersoll <gs...@apache.org>.
On Feb 19, 2010, at 8:33 AM, Cui tony wrote:
> Thank you all you guys.
> I know how to change seqFile to txt file now.
>
> I'm sorry, Grant, your example is still a little complicated to me. How can
> I run in this command: bin/mahout ?
> This is the command line which I used :
> hadoop jar data/mahout-examples-0.2.job
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
Ah, sorry. Didn't realize you were using 0.2. bin/mahout is in trunk. We're about to release 0.3, so don't be afraid of trunk.
>
> And, Grant, could you give me some information on how to use SequenceFile
> Dumper or class dumper?
>
They have a main class in them. If you setup the classpath, etc. you can use --help to see the options.
-Grant
Re: How to get results of K-means
Posted by Cui tony <to...@gmail.com>.
Thank you all you guys.
I know how to change seqFile to txt file now.
I'm sorry, Grant, your example is still a little complicated to me. How can
I run in this command: bin/mahout ?
This is the command line which I used :
hadoop jar data/mahout-examples-0.2.job
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
And, Grant, could you give me some information on how to use SequenceFile
Dumper or class dumper?
Thank you!
�� 2010��2��18�� ����10:14��Grant Ingersoll <gs...@apache.org>���
> You can use the ClusterDumper class (I just posted an example at
> http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/)
> or you can use the SequenceFile Dumper.
>
> If you generated from a Lucene index, you can also use a class in utilities
> (ClusterLabels?) to print out labels based on LLR calculations.
>
> HTH,
> Grant
>
> On Feb 18, 2010, at 2:32 AM, Cui tony wrote:
>
> > Hi,
> > I'm a beginner on mahout.I have figure out how to run k-means of mahout.
> > But after that, I have no idea how to get the clustered result.
> >
> > My input data is the standard example data : synthetic_control.data
> >
> > After running, I got a points folder which someone says that it contains
> the
> > result. The points folders has mainly two files : part-00000 part-00001
> >
> >
> > file part-00000 like this:
> >
> EQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@�c?[34m~YYVb}?~\~UP_Z~N~V^@^@^C:^@^@^C8~N^C5{"class":"org.apache.mahout.matrix.Sparsor","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[28.7812,34.4632,31.3381,31.2834,28.9207,33.7596,25.3969,27.7849,35.2479,27.1159,32.8717,29.2171,36.0253,32.337,34.5249,32.8717,34.1173,26.5235,27.6623,26.3693,25.7744,29.27,30.7326,29.5054,33.0292,25.04,28.9167,24.3437,26.1203,34.9424,25.0293,26.6311,35.6541,28.4353,29.1495,28.1584,26.1927,33.3182,30.9772,27.0443,35.5344,26.2353,28.9964,32.0036,31.0558,34.2553,28.0721,28.9402,35.4973,29.747,31.4333,24.5556,33.7431,25.0466,34.9318,34.9879,32.4721,33.3759,25.4652,25.8717],\"numMappings\":60},\"cardinality\":60,\"lengthSquared\":-1.0,\"name\":\"\"}"}^A2^@^@^C;^@^@^C9~N^C6{"class":"org.apache.mahout.matrix.SparseVector","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[24.8923,25.741,27.5532,32.8217,27.8789,
> >
> > I'm so confused about this result: who can I got the data with the
> clustered
> > label?
> >
> > thanks~~
>
>
Re: How to get results of K-means
Posted by tammuz <ra...@gmail.com>.
Hello
I used : hadoop dfs -text <your-file> > local-file.txt
I replaced <your-file> by <My output>/points/part-00000
it works and I get the output file but I do not know what that mean
this is a sample of the file:
/reut2-000.sgm-0.txt 21059
/reut2-000.sgm-1.txt 21277
/reut2-000.sgm-10.txt 21255
/reut2-000.sgm-100.txt 21277
/reut2-000.sgm-101.txt 21277
/reut2-000.sgm-102.txt 21277
/reut2-000.sgm-103.txt 21277
/reut2-000.sgm-104.txt 21277
/reut2-000.sgm-105.txt 21277
/reut2-000.sgm-106.txt 21358
/reut2-000.sgm-107.txt 21277
I used k=20
how do you read this?
thanx
--
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-results-of-K-means-tp639641p990097.html
Sent from the Mahout User List mailing list archive at Nabble.com.
Re: How to get results of K-means
Posted by Grant Ingersoll <gs...@apache.org>.
You can use the ClusterDumper class (I just posted an example at http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/) or you can use the SequenceFile Dumper.
If you generated from a Lucene index, you can also use a class in utilities (ClusterLabels?) to print out labels based on LLR calculations.
HTH,
Grant
On Feb 18, 2010, at 2:32 AM, Cui tony wrote:
> Hi,
> I'm a beginner on mahout.I have figure out how to run k-means of mahout.
> But after that, I have no idea how to get the clustered result.
>
> My input data is the standard example data : synthetic_control.data
>
> After running, I got a points folder which someone says that it contains the
> result. The points folders has mainly two files : part-00000 part-00001
>
>
> file part-00000 like this:
> EQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@$*HNpH?[34m~YYVb}?~\~UP_Z~N~V^@^@^C:^@^@^C8~N^C5{"class":"org.apache.mahout.matrix.Sparsor","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[28.7812,34.4632,31.3381,31.2834,28.9207,33.7596,25.3969,27.7849,35.2479,27.1159,32.8717,29.2171,36.0253,32.337,34.5249,32.8717,34.1173,26.5235,27.6623,26.3693,25.7744,29.27,30.7326,29.5054,33.0292,25.04,28.9167,24.3437,26.1203,34.9424,25.0293,26.6311,35.6541,28.4353,29.1495,28.1584,26.1927,33.3182,30.9772,27.0443,35.5344,26.2353,28.9964,32.0036,31.0558,34.2553,28.0721,28.9402,35.4973,29.747,31.4333,24.5556,33.7431,25.0466,34.9318,34.9879,32.4721,33.3759,25.4652,25.8717],\"numMappings\":60},\"cardinality\":60,\"lengthSquared\":-1.0,\"name\":\"\"}"}^A2^@^@^C;^@^@^C9~N^C6{"class":"org.apache.mahout.matrix.SparseVector","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[24.8923,25.741,27.5532,32.8217,27.8789,
>
> I'm so confused about this result: who can I got the data with the clustered
> label?
>
> thanks~~
Re: How to get results of K-means
Posted by Ankur Goel <ga...@yahoo-inc.com>.
To get the data in the text form try
hadoop dfs -text <your-file> > local-file.txt
-@nkur
Sean Owen wrote:
> Robin and others can answer in more detail, but I can start -- you are
> trying to view a raw SequenceFile as a text file. I think you will
> need to use a cluster dumper utility to produce a human readable form.
>
> 2010/2/18 Cui tony <to...@gmail.com>:
>
>> Hi,
>> I'm a beginner on mahout.I have figure out how to run k-means of mahout.
>> But after that, I have no idea how to get the clustered result.
>>
>> My input data is the standard example data : synthetic_control.data
>>
>> After running, I got a points folder which someone says that it contains the
>> result. The points folders has mainly two files : part-00000 part-00001
>>
>>
>> file part-00000 like this:
>> EQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@羉?[34m~YYVb}?~\~UP_Z~N~V^@^@^C:^@^@^C8~N^C5{"class":"org.apache.mahout.matrix.Sparsor","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[28.7812,34.4632,31.3381,31.2834,28.9207,33.7596,25.3969,27.7849,35.2479,27.1159,32.8717,29.2171,36.0253,32.337,34.5249,32.8717,34.1173,26.5235,27.6623,26.3693,25.7744,29.27,30.7326,29.5054,33.0292,25.04,28.9167,24.3437,26.1203,34.9424,25.0293,26.6311,35.6541,28.4353,29.1495,28.1584,26.1927,33.3182,30.9772,27.0443,35.5344,26.2353,28.9964,32.0036,31.0558,34.2553,28.0721,28.9402,35.4973,29.747,31.4333,24.5556,33.7431,25.0466,34.9318,34.9879,32.4721,33.3759,25.4652,25.8717],\"numMappings\":60},\"cardinality\":60,\"lengthSquared\":-1.0,\"name\":\"\"}"}^A2^@^@^C;^@^@^C9~N^C6{"class":"org.apache.mahout.matrix.SparseVector","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[24.8923,25.741,27.5532,32.8217,27.8789,
>>
>> I'm so confused about this result: who can I got the data with the clustered
>> label?
>>
>> thanks~~
>>
>>
Re: How to get results of K-means
Posted by Sean Owen <sr...@gmail.com>.
Robin and others can answer in more detail, but I can start -- you are
trying to view a raw SequenceFile as a text file. I think you will
need to use a cluster dumper utility to produce a human readable form.
2010/2/18 Cui tony <to...@gmail.com>:
> Hi,
> I'm a beginner on mahout.I have figure out how to run k-means of mahout.
> But after that, I have no idea how to get the clustered result.
>
> My input data is the standard example data : synthetic_control.data
>
> After running, I got a points folder which someone says that it contains the
> result. The points folders has mainly two files : part-00000 part-00001
>
>
> file part-00000 like this:
> EQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@羉?[34m~YYVb}?~\~UP_Z~N~V^@^@^C:^@^@^C8~N^C5{"class":"org.apache.mahout.matrix.Sparsor","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[28.7812,34.4632,31.3381,31.2834,28.9207,33.7596,25.3969,27.7849,35.2479,27.1159,32.8717,29.2171,36.0253,32.337,34.5249,32.8717,34.1173,26.5235,27.6623,26.3693,25.7744,29.27,30.7326,29.5054,33.0292,25.04,28.9167,24.3437,26.1203,34.9424,25.0293,26.6311,35.6541,28.4353,29.1495,28.1584,26.1927,33.3182,30.9772,27.0443,35.5344,26.2353,28.9964,32.0036,31.0558,34.2553,28.0721,28.9402,35.4973,29.747,31.4333,24.5556,33.7431,25.0466,34.9318,34.9879,32.4721,33.3759,25.4652,25.8717],\"numMappings\":60},\"cardinality\":60,\"lengthSquared\":-1.0,\"name\":\"\"}"}^A2^@^@^C;^@^@^C9~N^C6{"class":"org.apache.mahout.matrix.SparseVector","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[24.8923,25.741,27.5532,32.8217,27.8789,
>
> I'm so confused about this result: who can I got the data with the clustered
> label?
>
> thanks~~
>