You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Valerio Ceraudo <va...@gmail.com> on 2010/08/30 01:12:05 UTC

retrieve k-means result

Hi,

finally i run my first kmeans but now i have got a little issue,how to get the
result!

I ran this command:

bin/mahout clusterdump -s /home/vuvvo/clusters/ -o /home/vuvvo/finalOutput

that created a new file, finalOutput, with inside this:

CL-21551{n=0 c=[575:9.188, 1808:3.181, 2547:4.371, 4163:1.124, 5124:4.596,
8861:4.434, 9283:3.338, 9890:8.271, 10037:8.539, 12965:4.542, 13168:5.738,
13565:4.949, 14237:8.155, 15400:4.632, 16624:7.935, 17506:8.146, 17539:5.145,
17618:9.048, 17911:3.863, 18054:8.035, 19236:6.017, 21095:4.222, 21693:7.932,
21748:3.895, 22512:4.292, 23003:5.238, 24173:7.173, 24613:3.429, 26094:1.133,
26587:1.335, 26989:4.451, 28079:7.765, 28086:5.722, 29221:7.316, 29518:3.197,
30388:2.589, 30554:6.252, 30555:7.173] r=/reut2-021.sgm-75.txt =]}
CL-21560{n=0 c=[1038:2.701, 1808:3.181, 1997:2.870, 2289:4.733, 2547:4.371,
4163:1.124, 4750:4.726, 5951:9.593, 13610:3.942, 13918:5.486, 14641:4.362,
17528:3.302, 18699:2.280, 20900:9.188, 22742:6.573, 23678:3.758, 23692:5.211,
24645:4.078, 25024:4.584, 25451:3.575, 26094:1.133, 27130:4.263, 30790:2.870]
r=/reut2-021.sgm-83.txt =]}
CL-21576{n=0 c=[1736:8.677, 1808:3.181, 2425:3.548, 2547:4.371, 3147:4.257,
4163:1.124, 8873:5.482, 8902:3.481, 9324:6.045, 9702:5.867, 9712:7.424,
10605:2.932, 11857:5.217, 12762:8.359, 12763:5.171, 12880:8.900, 12909:7.088,
13419:4.930, 14143:4.105, 14257:4.622, 16915:4.902, 17376:3.317, 17525:6.689,
17673:4.020, 17911:2.732, 17934:7.048, 18094:4.005, 18822:3.928, 19081:3.448,
19200:4.315, 19319:1.992, 19377:8.567, 20746:3.430, 20881:3.581, 22751:5.024,
23605:6.447, 24326:8.582, 24863:6.209, 24913:4.002, 25701:5.781, 26094:1.133,
26587:1.888, 26787:7.545, 26978:5.011, 26980:5.533, 27631:4.850, 27986:3.793,
29213:16.616, 29649:3.400, 29660:3.857, 30023:6.344, 30322:4.009, 31232:12.271,
31402:3.719] r=/reut2-021.sgm-98.txt =]}

now,how i must read it? the cl are the clusters and the c=[.......] are the data
of the cluster?

Thanks all

Re: retrieve k-means result

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  When you specify -k to k-means it will randomly sample k values from 
your input data set to use as the initial cluster centers. Those 
clusters will be written to the -c directory and form the prior of the 
k-means iteration. Each iteration will then produce a revised set of 
clusters in clusters-x; from your clusters-5 dump it looks like the 
computation has not yet converged (CL-21551 has not converged whereas 
VL-21560 has). If you are running this against Reuters, I suggest your 
initial -k value is too low and you need to increase the --maxIter value 
to obtain convergence.

In terms of importing into Weka, I suggest the cluster dumper probably 
won't be of much use and suggest you write your own job to convert the 
clusters into a format you can use more directly.

On 8/30/10 11:20 AM, Valerio Ceraudo wrote:
> In the folder clusters what I used there is this file: part-randomSeed
>
>
> created by the command:
>
> bin/mahout kmeans -i /home/vuvvo/reuters-out-seqdir-sparse/tfidf-vectors/ -c
> /home/vuvvo/clusters -o /home/vuvvo/reuters-kmeans -k 3 --maxIter 5
>
> I need to use the files in the folder reuters-kmeans? inside it I have got some
> other sub-directory called cluster-x where x is from 1 to 5.
>
> I tried to give the cluster-5 as input and inside finalOutput I have got a file
> big 1,4 Mb but very hard to open also with 4 mb of ram on a 64 bit ^^
>
> I can read the first and the second row:
> CL-21551 {n=1855 c =[1:0.011,2:0.005...to 31:0.012
> and the second row:
> VL-21560{n=19722 c[0:0.012 etc etc...
>
> is now converged and correct?
>
> is there a more comfortable way to read this file?because than i need to convert
> it in data for weka.
>
> This night i will try to convert an irff data in a sgm to see what i obtain with
> it.
>
>
>
>

Re: retrieve k-means result

Posted by Valerio Ceraudo <va...@gmail.com>.

In the folder clusters what I used there is this file: part-randomSeed


created by the command:

bin/mahout kmeans -i /home/vuvvo/reuters-out-seqdir-sparse/tfidf-vectors/ -c 
/home/vuvvo/clusters -o /home/vuvvo/reuters-kmeans -k 3 --maxIter 5

I need to use the files in the folder reuters-kmeans? inside it I have got some
other sub-directory called cluster-x where x is from 1 to 5.

I tried to give the cluster-5 as input and inside finalOutput I have got a file
big 1,4 Mb but very hard to open also with 4 mb of ram on a 64 bit ^^

I can read the first and the second row:
CL-21551 {n=1855 c =[1:0.011,2:0.005...to 31:0.012
and the second row:
VL-21560{n=19722 c[0:0.012 etc etc...

is now converged and correct?

is there a more comfortable way to read this file?because than i need to convert
it in data for weka.

This night i will try to convert an irff data in a sgm to see what i obtain with
it.

Re: retrieve k-means result

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  On the n=0 point and its associated r= problem: Where did you get the 
clusters file? If it was from a clusters-0 directory then this would be 
explained by the fact that these clusters are inputs to the algorithm 
and have not yet observed any of the points. Also, the "CL" encoding 
indicates the clusters have not converged (would be "VL" otherwise). 
This also indicates you may be dumping the wrong clusters directory. Its 
not possible to tell from the command line you have.

On 8/30/10 4:45 AM, Jeff Eastman wrote:
>  On 8/29/10 4:12 PM, Valerio Ceraudo wrote:
>> Hi,
>>
>> finally i run my first kmeans but now i have got a little issue,how 
>> to get the
>> result!
>>
>> I ran this command:
>>
>> bin/mahout clusterdump -s /home/vuvvo/clusters/ -o 
>> /home/vuvvo/finalOutput
>>
>> that created a new file, finalOutput, with inside this:
>>
>> CL-21551{n=0 c=[575:9.188, 1808:3.181, 2547:4.371, 4163:1.124, 
>> 5124:4.596,
>> 8861:4.434, 9283:3.338, 9890:8.271, 10037:8.539, 12965:4.542, 
>> 13168:5.738,
>> 13565:4.949, 14237:8.155, 15400:4.632, 16624:7.935, 17506:8.146, 
>> 17539:5.145,
>> 17618:9.048, 17911:3.863, 18054:8.035, 19236:6.017, 21095:4.222, 
>> 21693:7.932,
>> 21748:3.895, 22512:4.292, 23003:5.238, 24173:7.173, 24613:3.429, 
>> 26094:1.133,
>> 26587:1.335, 26989:4.451, 28079:7.765, 28086:5.722, 29221:7.316, 
>> 29518:3.197,
>> 30388:2.589, 30554:6.252, 30555:7.173] r=/reut2-021.sgm-75.txt =]}
>> CL-21560{n=0 c=[1038:2.701, 1808:3.181, 1997:2.870, 2289:4.733, 
>> 2547:4.371,
>> 4163:1.124, 4750:4.726, 5951:9.593, 13610:3.942, 13918:5.486, 
>> 14641:4.362,
>> 17528:3.302, 18699:2.280, 20900:9.188, 22742:6.573, 23678:3.758, 
>> 23692:5.211,
>> 24645:4.078, 25024:4.584, 25451:3.575, 26094:1.133, 27130:4.263, 
>> 30790:2.870]
>> r=/reut2-021.sgm-83.txt =]}
>> CL-21576{n=0 c=[1736:8.677, 1808:3.181, 2425:3.548, 2547:4.371, 
>> 3147:4.257,
>> 4163:1.124, 8873:5.482, 8902:3.481, 9324:6.045, 9702:5.867, 9712:7.424,
>> 10605:2.932, 11857:5.217, 12762:8.359, 12763:5.171, 12880:8.900, 
>> 12909:7.088,
>> 13419:4.930, 14143:4.105, 14257:4.622, 16915:4.902, 17376:3.317, 
>> 17525:6.689,
>> 17673:4.020, 17911:2.732, 17934:7.048, 18094:4.005, 18822:3.928, 
>> 19081:3.448,
>> 19200:4.315, 19319:1.992, 19377:8.567, 20746:3.430, 20881:3.581, 
>> 22751:5.024,
>> 23605:6.447, 24326:8.582, 24863:6.209, 24913:4.002, 25701:5.781, 
>> 26094:1.133,
>> 26587:1.888, 26787:7.545, 26978:5.011, 26980:5.533, 27631:4.850, 
>> 27986:3.793,
>> 29213:16.616, 29649:3.400, 29660:3.857, 30023:6.344, 30322:4.009, 
>> 31232:12.271,
>> 31402:3.719] r=/reut2-021.sgm-98.txt =]}
>>
>> now,how i must read it? the cl are the clusters and the c=[.......] 
>> are the data
>> of the cluster?
>>
>> Thanks all
>>
>>
> Precisely. CL-n is the clusterId and n= the number of points observed 
> (not sure why yours are all 0) during the last iteration and c= the 
> center of the cluster in sparse notation. Within c= each of the [i:d] 
> pairs denote an index and an associated value. The r= values are the 
> radius of the cluster (std of the observed points). The code in 
> AbstractCluster which formats r= could be improved (formatVector and 
> asFormatString) to better handle the empty radius vectors produced 
> when n=0.
>
> And congratulations on your first k-Means!
> Jeff

Re: retrieve k-means result

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

  On 8/29/10 4:12 PM, Valerio Ceraudo wrote:
> Hi,
>
> finally i run my first kmeans but now i have got a little issue,how to get the
> result!
>
> I ran this command:
>
> bin/mahout clusterdump -s /home/vuvvo/clusters/ -o /home/vuvvo/finalOutput
>
> that created a new file, finalOutput, with inside this:
>
> CL-21551{n=0 c=[575:9.188, 1808:3.181, 2547:4.371, 4163:1.124, 5124:4.596,
> 8861:4.434, 9283:3.338, 9890:8.271, 10037:8.539, 12965:4.542, 13168:5.738,
> 13565:4.949, 14237:8.155, 15400:4.632, 16624:7.935, 17506:8.146, 17539:5.145,
> 17618:9.048, 17911:3.863, 18054:8.035, 19236:6.017, 21095:4.222, 21693:7.932,
> 21748:3.895, 22512:4.292, 23003:5.238, 24173:7.173, 24613:3.429, 26094:1.133,
> 26587:1.335, 26989:4.451, 28079:7.765, 28086:5.722, 29221:7.316, 29518:3.197,
> 30388:2.589, 30554:6.252, 30555:7.173] r=/reut2-021.sgm-75.txt =]}
> CL-21560{n=0 c=[1038:2.701, 1808:3.181, 1997:2.870, 2289:4.733, 2547:4.371,
> 4163:1.124, 4750:4.726, 5951:9.593, 13610:3.942, 13918:5.486, 14641:4.362,
> 17528:3.302, 18699:2.280, 20900:9.188, 22742:6.573, 23678:3.758, 23692:5.211,
> 24645:4.078, 25024:4.584, 25451:3.575, 26094:1.133, 27130:4.263, 30790:2.870]
> r=/reut2-021.sgm-83.txt =]}
> CL-21576{n=0 c=[1736:8.677, 1808:3.181, 2425:3.548, 2547:4.371, 3147:4.257,
> 4163:1.124, 8873:5.482, 8902:3.481, 9324:6.045, 9702:5.867, 9712:7.424,
> 10605:2.932, 11857:5.217, 12762:8.359, 12763:5.171, 12880:8.900, 12909:7.088,
> 13419:4.930, 14143:4.105, 14257:4.622, 16915:4.902, 17376:3.317, 17525:6.689,
> 17673:4.020, 17911:2.732, 17934:7.048, 18094:4.005, 18822:3.928, 19081:3.448,
> 19200:4.315, 19319:1.992, 19377:8.567, 20746:3.430, 20881:3.581, 22751:5.024,
> 23605:6.447, 24326:8.582, 24863:6.209, 24913:4.002, 25701:5.781, 26094:1.133,
> 26587:1.888, 26787:7.545, 26978:5.011, 26980:5.533, 27631:4.850, 27986:3.793,
> 29213:16.616, 29649:3.400, 29660:3.857, 30023:6.344, 30322:4.009, 31232:12.271,
> 31402:3.719] r=/reut2-021.sgm-98.txt =]}
>
> now,how i must read it? the cl are the clusters and the c=[.......] are the data
> of the cluster?
>
> Thanks all
>
>
Precisely. CL-n is the clusterId and n= the number of points observed 
(not sure why yours are all 0) during the last iteration and c= the 
center of the cluster in sparse notation. Within c= each of the [i:d] 
pairs denote an index and an associated value. The r= values are the 
radius of the cluster (std of the observed points). The code in 
AbstractCluster which formats r= could be improved (formatVector and 
asFormatString) to better handle the empty radius vectors produced when n=0.

And congratulations on your first k-Means!
Jeff