You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Cui tony <to...@gmail.com> on 2010/02/18 08:32:30 UTC

How to get results of K-means

Hi,
  I'm a beginner on mahout.I have figure out how to run k-means of mahout.
But after that, I have no idea how to get the clustered result.

My input data is the standard example data : synthetic_control.data

After running, I got a points folder which someone says that it contains the
result.  The points folders has mainly two files : part-00000  part-00001


file part-00000 like this:
EQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@�c?[34m~YYVb}?~\~UP_Z~N~V^@^@^C:^@^@^C8~N^C5{"class":"org.apache.mahout.matrix.Sparsor","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[28.7812,34.4632,31.3381,31.2834,28.9207,33.7596,25.3969,27.7849,35.2479,27.1159,32.8717,29.2171,36.0253,32.337,34.5249,32.8717,34.1173,26.5235,27.6623,26.3693,25.7744,29.27,30.7326,29.5054,33.0292,25.04,28.9167,24.3437,26.1203,34.9424,25.0293,26.6311,35.6541,28.4353,29.1495,28.1584,26.1927,33.3182,30.9772,27.0443,35.5344,26.2353,28.9964,32.0036,31.0558,34.2553,28.0721,28.9402,35.4973,29.747,31.4333,24.5556,33.7431,25.0466,34.9318,34.9879,32.4721,33.3759,25.4652,25.8717],\"numMappings\":60},\"cardinality\":60,\"lengthSquared\":-1.0,\"name\":\"\"}"}^A2^@^@^C;^@^@^C9~N^C6{"class":"org.apache.mahout.matrix.SparseVector","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[24.8923,25.741,27.5532,32.8217,27.8789,

I'm so confused about this result: who can I got the data with the clustered
label?

thanks~~

Re: How to get results of K-means

Posted by Cui tony <to...@gmail.com>.

Hi, Grant

I just install 0.3, but there is still some problem on dumping the cluser
from output files.
What I did detailly is following:

hadoop jar mahout-examples-0.3-SNAPSHOT.job
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job -i testdata
the output is
[@master target]$ hadoop fs -ls output
Found 13 items
drwxr-xr-x   - hadoop supergroup          0 2010-02-22 17:35
/user/hadoop/output/canopies
drwxr-xr-x   - hadoop supergroup          0 2010-02-22 17:35
/user/hadoop/output/clusters-0
drwxr-xr-x   - hadoop supergroup          0 2010-02-22 17:35
/user/hadoop/output/clusters-1
drwxr-xr-x   - hadoop supergroup          0 2010-02-22 17:36
/user/hadoop/output/clusters-2
drwxr-xr-x   - hadoop supergroup          0 2010-02-22 17:36
/user/hadoop/output/clusters-3
drwxr-xr-x   - hadoop supergroup          0 2010-02-22 17:37
/user/hadoop/output/clusters-4
drwxr-xr-x   - hadoop supergroup          0 2010-02-22 17:37
/user/hadoop/output/clusters-5
drwxr-xr-x   - hadoop supergroup          0 2010-02-22 17:37
/user/hadoop/output/clusters-6
drwxr-xr-x   - hadoop supergroup          0 2010-02-22 17:38
/user/hadoop/output/clusters-7
drwxr-xr-x   - hadoop supergroup          0 2010-02-22 17:38
/user/hadoop/output/clusters-8
drwxr-xr-x   - hadoop supergroup          0 2010-02-22 17:39
/user/hadoop/output/clusters-9
drwxr-xr-x   - hadoop supergroup          0 2010-02-22 17:34
/user/hadoop/output/data
drwxr-xr-x   - hadoop supergroup          0 2010-02-22 17:39
/user/hadoop/output/points

then I run : sh /data/hadoop/mahout/trunk/bin/mahout clusterdump -s
clusters-9/ -p points/
Input Path: /data/hadoop/mahout/trunk/examples/target/clusters-9/part-00000
Exception in thread "main" java.lang.NullPointerException
        at
org.apache.mahout.utils.vectors.VectorHelper.vectorToString(VectorHelper.java:69)
        at
org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:133)
        at
org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:302)

and I run : sh /data/hadoop/mahout/trunk/bin/mahout clusterdump -s
testdata/synthetic_control.data -p points/
Error message is :
Exception in thread "main" java.lang.NullPointerException
        at
org.apache.mahout.utils.clustering.ClusterDumper.printClusters(ClusterDumper.java:121)
        at
org.apache.mahout.utils.clustering.ClusterDumper.main(ClusterDumper.java:302)

Is there something wrong?

Thanks!!


2010/2/19 Grant Ingersoll <gs...@apache.org>

>
> On Feb 19, 2010, at 8:33 AM, Cui tony wrote:
>
> > Thank you all you guys.
> > I know how to change seqFile to txt file now.
> >
> > I'm sorry, Grant, your example is still a little complicated to me. How
> can
> > I run in this command: bin/mahout ?
> > This is the command line which I used :
> > hadoop jar data/mahout-examples-0.2.job
> > org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>
> Ah, sorry.  Didn't realize you were using 0.2.  bin/mahout is in trunk.
>  We're about to release 0.3, so don't be afraid of trunk.
>
> >
> > And, Grant, could you give me some information on how to use SequenceFile
> > Dumper or class dumper?
> >
>
> They have a main class in them.  If you setup the classpath, etc. you can
> use --help to see the options.
>
> -Grant

Re: How to get results of K-means

Posted by Grant Ingersoll <gs...@apache.org>.

On Feb 19, 2010, at 8:33 AM, Cui tony wrote:

> Thank you all you guys.
> I know how to change seqFile to txt file now.
> 
> I'm sorry, Grant, your example is still a little complicated to me. How can
> I run in this command: bin/mahout ?
> This is the command line which I used :
> hadoop jar data/mahout-examples-0.2.job
> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

Ah, sorry.  Didn't realize you were using 0.2.  bin/mahout is in trunk.  We're about to release 0.3, so don't be afraid of trunk.

> 
> And, Grant, could you give me some information on how to use SequenceFile
> Dumper or class dumper?
> 

They have a main class in them.  If you setup the classpath, etc. you can use --help to see the options.

-Grant

Re: How to get results of K-means

Posted by Cui tony <to...@gmail.com>.

Thank you all you guys.
I know how to change seqFile to txt file now.

I'm sorry, Grant, your example is still a little complicated to me. How can
I run in this command: bin/mahout ?
This is the command line which I used :
 hadoop jar data/mahout-examples-0.2.job
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job

And, Grant, could you give me some information on how to use SequenceFile
Dumper or class dumper?


Thank you!

�� 2010��2��18�� ����10:14��Grant Ingersoll <gs...@apache.org>д����

> You can use the ClusterDumper class (I just posted an example at
> http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/)
> or you can use the SequenceFile Dumper.
>
> If you generated from a Lucene index, you can also use a class in utilities
> (ClusterLabels?) to print out labels based on LLR calculations.
>
> HTH,
> Grant
>
> On Feb 18, 2010, at 2:32 AM, Cui tony wrote:
>
> > Hi,
> >  I'm a beginner on mahout.I have figure out how to run k-means of mahout.
> > But after that, I have no idea how to get the clustered result.
> >
> > My input data is the standard example data : synthetic_control.data
> >
> > After running, I got a points folder which someone says that it contains
> the
> > result.  The points folders has mainly two files : part-00000  part-00001
> >
> >
> > file part-00000 like this:
> >
> EQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@�c?[34m~YYVb}?~\~UP_Z~N~V^@^@^C:^@^@^C8~N^C5{"class":"org.apache.mahout.matrix.Sparsor","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[28.7812,34.4632,31.3381,31.2834,28.9207,33.7596,25.3969,27.7849,35.2479,27.1159,32.8717,29.2171,36.0253,32.337,34.5249,32.8717,34.1173,26.5235,27.6623,26.3693,25.7744,29.27,30.7326,29.5054,33.0292,25.04,28.9167,24.3437,26.1203,34.9424,25.0293,26.6311,35.6541,28.4353,29.1495,28.1584,26.1927,33.3182,30.9772,27.0443,35.5344,26.2353,28.9964,32.0036,31.0558,34.2553,28.0721,28.9402,35.4973,29.747,31.4333,24.5556,33.7431,25.0466,34.9318,34.9879,32.4721,33.3759,25.4652,25.8717],\"numMappings\":60},\"cardinality\":60,\"lengthSquared\":-1.0,\"name\":\"\"}"}^A2^@^@^C;^@^@^C9~N^C6{"class":"org.apache.mahout.matrix.SparseVector","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[24.8923,25.741,27.5532,32.8217,27.8789,
> >
> > I'm so confused about this result: who can I got the data with the
> clustered
> > label?
> >
> > thanks~~
>
>

Re: How to get results of K-means

Posted by tammuz <ra...@gmail.com>.

Hello

I used : hadoop dfs -text <your-file>  > local-file.txt

I replaced <your-file> by <My output>/points/part-00000

it works and I get the output file but I do not know what that mean

this is a sample of the file:

/reut2-000.sgm-0.txt	21059
/reut2-000.sgm-1.txt	21277
/reut2-000.sgm-10.txt	21255
/reut2-000.sgm-100.txt	21277
/reut2-000.sgm-101.txt	21277
/reut2-000.sgm-102.txt	21277
/reut2-000.sgm-103.txt	21277
/reut2-000.sgm-104.txt	21277
/reut2-000.sgm-105.txt	21277
/reut2-000.sgm-106.txt	21358
/reut2-000.sgm-107.txt	21277

I used k=20

how do you read this?

thanx


-- 
View this message in context: http://lucene.472066.n3.nabble.com/How-to-get-results-of-K-means-tp639641p990097.html
Sent from the Mahout User List mailing list archive at Nabble.com.

Re: How to get results of K-means

Posted by Grant Ingersoll <gs...@apache.org>.

You can use the ClusterDumper class (I just posted an example at http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/) or you can use the SequenceFile Dumper.

If you generated from a Lucene index, you can also use a class in utilities (ClusterLabels?) to print out labels based on LLR calculations.

HTH,
Grant

On Feb 18, 2010, at 2:32 AM, Cui tony wrote:

> Hi,
>  I'm a beginner on mahout.I have figure out how to run k-means of mahout.
> But after that, I have no idea how to get the clustered result.
> 
> My input data is the standard example data : synthetic_control.data
> 
> After running, I got a points folder which someone says that it contains the
> result.  The points folders has mainly two files : part-00000  part-00001
> 
> 
> file part-00000 like this:
> EQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@$*HNpH?[34m~YYVb}?~\~UP_Z~N~V^@^@^C:^@^@^C8~N^C5{"class":"org.apache.mahout.matrix.Sparsor","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[28.7812,34.4632,31.3381,31.2834,28.9207,33.7596,25.3969,27.7849,35.2479,27.1159,32.8717,29.2171,36.0253,32.337,34.5249,32.8717,34.1173,26.5235,27.6623,26.3693,25.7744,29.27,30.7326,29.5054,33.0292,25.04,28.9167,24.3437,26.1203,34.9424,25.0293,26.6311,35.6541,28.4353,29.1495,28.1584,26.1927,33.3182,30.9772,27.0443,35.5344,26.2353,28.9964,32.0036,31.0558,34.2553,28.0721,28.9402,35.4973,29.747,31.4333,24.5556,33.7431,25.0466,34.9318,34.9879,32.4721,33.3759,25.4652,25.8717],\"numMappings\":60},\"cardinality\":60,\"lengthSquared\":-1.0,\"name\":\"\"}"}^A2^@^@^C;^@^@^C9~N^C6{"class":"org.apache.mahout.matrix.SparseVector","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[24.8923,25.741,27.5532,32.8217,27.8789,
> 
> I'm so confused about this result: who can I got the data with the clustered
> label?
> 
> thanks~~

Re: How to get results of K-means

Posted by Ankur Goel <ga...@yahoo-inc.com>.

To get the data in the text form  try

hadoop dfs -text <your-file>  > local-file.txt

-@nkur
Sean Owen wrote:
> Robin and others can answer in more detail, but I can start -- you are
> trying to view a raw SequenceFile as a text file. I think you will
> need to use a cluster dumper utility to produce a human readable form.
>
> 2010/2/18 Cui tony <to...@gmail.com>:
>   
>> Hi,
>>  I'm a beginner on mahout.I have figure out how to run k-means of mahout.
>> But after that, I have no idea how to get the clustered result.
>>
>> My input data is the standard example data : synthetic_control.data
>>
>> After running, I got a points folder which someone says that it contains the
>> result.  The points folders has mainly two files : part-00000  part-00001
>>
>>
>> file part-00000 like this:
>> EQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@羉?[34m~YYVb}?~\~UP_Z~N~V^@^@^C:^@^@^C8~N^C5{"class":"org.apache.mahout.matrix.Sparsor","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[28.7812,34.4632,31.3381,31.2834,28.9207,33.7596,25.3969,27.7849,35.2479,27.1159,32.8717,29.2171,36.0253,32.337,34.5249,32.8717,34.1173,26.5235,27.6623,26.3693,25.7744,29.27,30.7326,29.5054,33.0292,25.04,28.9167,24.3437,26.1203,34.9424,25.0293,26.6311,35.6541,28.4353,29.1495,28.1584,26.1927,33.3182,30.9772,27.0443,35.5344,26.2353,28.9964,32.0036,31.0558,34.2553,28.0721,28.9402,35.4973,29.747,31.4333,24.5556,33.7431,25.0466,34.9318,34.9879,32.4721,33.3759,25.4652,25.8717],\"numMappings\":60},\"cardinality\":60,\"lengthSquared\":-1.0,\"name\":\"\"}"}^A2^@^@^C;^@^@^C9~N^C6{"class":"org.apache.mahout.matrix.SparseVector","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[24.8923,25.741,27.5532,32.8217,27.8789,
>>
>> I'm so confused about this result: who can I got the data with the clustered
>> label?
>>
>> thanks~~
>>
>>

Re: How to get results of K-means

Posted by Sean Owen <sr...@gmail.com>.

Robin and others can answer in more detail, but I can start -- you are
trying to view a raw SequenceFile as a text file. I think you will
need to use a cluster dumper utility to produce a human readable form.

2010/2/18 Cui tony <to...@gmail.com>:
> Hi,
>  I'm a beginner on mahout.I have figure out how to run k-means of mahout.
> But after that, I have no idea how to get the clustered result.
>
> My input data is the standard example data : synthetic_control.data
>
> After running, I got a points folder which someone says that it contains the
> result.  The points folders has mainly two files : part-00000  part-00001
>
>
> file part-00000 like this:
> EQ^F^Yorg.apache.hadoop.io.Text^Yorg.apache.hadoop.io.Text^@^@^@^@^@^@羉?[34m~YYVb}?~\~UP_Z~N~V^@^@^C:^@^@^C8~N^C5{"class":"org.apache.mahout.matrix.Sparsor","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[28.7812,34.4632,31.3381,31.2834,28.9207,33.7596,25.3969,27.7849,35.2479,27.1159,32.8717,29.2171,36.0253,32.337,34.5249,32.8717,34.1173,26.5235,27.6623,26.3693,25.7744,29.27,30.7326,29.5054,33.0292,25.04,28.9167,24.3437,26.1203,34.9424,25.0293,26.6311,35.6541,28.4353,29.1495,28.1584,26.1927,33.3182,30.9772,27.0443,35.5344,26.2353,28.9964,32.0036,31.0558,34.2553,28.0721,28.9402,35.4973,29.747,31.4333,24.5556,33.7431,25.0466,34.9318,34.9879,32.4721,33.3759,25.4652,25.8717],\"numMappings\":60},\"cardinality\":60,\"lengthSquared\":-1.0,\"name\":\"\"}"}^A2^@^@^C;^@^@^C9~N^C6{"class":"org.apache.mahout.matrix.SparseVector","vector":"{\"values\":{\"indices\":[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59],\"values\":[24.8923,25.741,27.5532,32.8217,27.8789,
>
> I'm so confused about this result: who can I got the data with the clustered
> label?
>
> thanks~~
>