You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Valerio Ceraudo <va...@gmail.com> on 2010/08/30 01:12:05 UTC
retrieve k-means result
Hi,
finally i run my first kmeans but now i have got a little issue,how to get the
result!
I ran this command:
bin/mahout clusterdump -s /home/vuvvo/clusters/ -o /home/vuvvo/finalOutput
that created a new file, finalOutput, with inside this:
CL-21551{n=0 c=[575:9.188, 1808:3.181, 2547:4.371, 4163:1.124, 5124:4.596,
8861:4.434, 9283:3.338, 9890:8.271, 10037:8.539, 12965:4.542, 13168:5.738,
13565:4.949, 14237:8.155, 15400:4.632, 16624:7.935, 17506:8.146, 17539:5.145,
17618:9.048, 17911:3.863, 18054:8.035, 19236:6.017, 21095:4.222, 21693:7.932,
21748:3.895, 22512:4.292, 23003:5.238, 24173:7.173, 24613:3.429, 26094:1.133,
26587:1.335, 26989:4.451, 28079:7.765, 28086:5.722, 29221:7.316, 29518:3.197,
30388:2.589, 30554:6.252, 30555:7.173] r=/reut2-021.sgm-75.txt =]}
CL-21560{n=0 c=[1038:2.701, 1808:3.181, 1997:2.870, 2289:4.733, 2547:4.371,
4163:1.124, 4750:4.726, 5951:9.593, 13610:3.942, 13918:5.486, 14641:4.362,
17528:3.302, 18699:2.280, 20900:9.188, 22742:6.573, 23678:3.758, 23692:5.211,
24645:4.078, 25024:4.584, 25451:3.575, 26094:1.133, 27130:4.263, 30790:2.870]
r=/reut2-021.sgm-83.txt =]}
CL-21576{n=0 c=[1736:8.677, 1808:3.181, 2425:3.548, 2547:4.371, 3147:4.257,
4163:1.124, 8873:5.482, 8902:3.481, 9324:6.045, 9702:5.867, 9712:7.424,
10605:2.932, 11857:5.217, 12762:8.359, 12763:5.171, 12880:8.900, 12909:7.088,
13419:4.930, 14143:4.105, 14257:4.622, 16915:4.902, 17376:3.317, 17525:6.689,
17673:4.020, 17911:2.732, 17934:7.048, 18094:4.005, 18822:3.928, 19081:3.448,
19200:4.315, 19319:1.992, 19377:8.567, 20746:3.430, 20881:3.581, 22751:5.024,
23605:6.447, 24326:8.582, 24863:6.209, 24913:4.002, 25701:5.781, 26094:1.133,
26587:1.888, 26787:7.545, 26978:5.011, 26980:5.533, 27631:4.850, 27986:3.793,
29213:16.616, 29649:3.400, 29660:3.857, 30023:6.344, 30322:4.009, 31232:12.271,
31402:3.719] r=/reut2-021.sgm-98.txt =]}
now,how i must read it? the cl are the clusters and the c=[.......] are the data
of the cluster?
Thanks all
Re: retrieve k-means result
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
When you specify -k to k-means it will randomly sample k values from
your input data set to use as the initial cluster centers. Those
clusters will be written to the -c directory and form the prior of the
k-means iteration. Each iteration will then produce a revised set of
clusters in clusters-x; from your clusters-5 dump it looks like the
computation has not yet converged (CL-21551 has not converged whereas
VL-21560 has). If you are running this against Reuters, I suggest your
initial -k value is too low and you need to increase the --maxIter value
to obtain convergence.
In terms of importing into Weka, I suggest the cluster dumper probably
won't be of much use and suggest you write your own job to convert the
clusters into a format you can use more directly.
On 8/30/10 11:20 AM, Valerio Ceraudo wrote:
> In the folder clusters what I used there is this file: part-randomSeed
>
>
> created by the command:
>
> bin/mahout kmeans -i /home/vuvvo/reuters-out-seqdir-sparse/tfidf-vectors/ -c
> /home/vuvvo/clusters -o /home/vuvvo/reuters-kmeans -k 3 --maxIter 5
>
> I need to use the files in the folder reuters-kmeans? inside it I have got some
> other sub-directory called cluster-x where x is from 1 to 5.
>
> I tried to give the cluster-5 as input and inside finalOutput I have got a file
> big 1,4 Mb but very hard to open also with 4 mb of ram on a 64 bit ^^
>
> I can read the first and the second row:
> CL-21551 {n=1855 c =[1:0.011,2:0.005...to 31:0.012
> and the second row:
> VL-21560{n=19722 c[0:0.012 etc etc...
>
> is now converged and correct?
>
> is there a more comfortable way to read this file?because than i need to convert
> it in data for weka.
>
> This night i will try to convert an irff data in a sgm to see what i obtain with
> it.
>
>
>
>
Re: retrieve k-means result
Posted by Valerio Ceraudo <va...@gmail.com>.
In the folder clusters what I used there is this file: part-randomSeed
created by the command:
bin/mahout kmeans -i /home/vuvvo/reuters-out-seqdir-sparse/tfidf-vectors/ -c
/home/vuvvo/clusters -o /home/vuvvo/reuters-kmeans -k 3 --maxIter 5
I need to use the files in the folder reuters-kmeans? inside it I have got some
other sub-directory called cluster-x where x is from 1 to 5.
I tried to give the cluster-5 as input and inside finalOutput I have got a file
big 1,4 Mb but very hard to open also with 4 mb of ram on a 64 bit ^^
I can read the first and the second row:
CL-21551 {n=1855 c =[1:0.011,2:0.005...to 31:0.012
and the second row:
VL-21560{n=19722 c[0:0.012 etc etc...
is now converged and correct?
is there a more comfortable way to read this file?because than i need to convert
it in data for weka.
This night i will try to convert an irff data in a sgm to see what i obtain with
it.
Re: retrieve k-means result
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
On the n=0 point and its associated r= problem: Where did you get the
clusters file? If it was from a clusters-0 directory then this would be
explained by the fact that these clusters are inputs to the algorithm
and have not yet observed any of the points. Also, the "CL" encoding
indicates the clusters have not converged (would be "VL" otherwise).
This also indicates you may be dumping the wrong clusters directory. Its
not possible to tell from the command line you have.
On 8/30/10 4:45 AM, Jeff Eastman wrote:
> On 8/29/10 4:12 PM, Valerio Ceraudo wrote:
>> Hi,
>>
>> finally i run my first kmeans but now i have got a little issue,how
>> to get the
>> result!
>>
>> I ran this command:
>>
>> bin/mahout clusterdump -s /home/vuvvo/clusters/ -o
>> /home/vuvvo/finalOutput
>>
>> that created a new file, finalOutput, with inside this:
>>
>> CL-21551{n=0 c=[575:9.188, 1808:3.181, 2547:4.371, 4163:1.124,
>> 5124:4.596,
>> 8861:4.434, 9283:3.338, 9890:8.271, 10037:8.539, 12965:4.542,
>> 13168:5.738,
>> 13565:4.949, 14237:8.155, 15400:4.632, 16624:7.935, 17506:8.146,
>> 17539:5.145,
>> 17618:9.048, 17911:3.863, 18054:8.035, 19236:6.017, 21095:4.222,
>> 21693:7.932,
>> 21748:3.895, 22512:4.292, 23003:5.238, 24173:7.173, 24613:3.429,
>> 26094:1.133,
>> 26587:1.335, 26989:4.451, 28079:7.765, 28086:5.722, 29221:7.316,
>> 29518:3.197,
>> 30388:2.589, 30554:6.252, 30555:7.173] r=/reut2-021.sgm-75.txt =]}
>> CL-21560{n=0 c=[1038:2.701, 1808:3.181, 1997:2.870, 2289:4.733,
>> 2547:4.371,
>> 4163:1.124, 4750:4.726, 5951:9.593, 13610:3.942, 13918:5.486,
>> 14641:4.362,
>> 17528:3.302, 18699:2.280, 20900:9.188, 22742:6.573, 23678:3.758,
>> 23692:5.211,
>> 24645:4.078, 25024:4.584, 25451:3.575, 26094:1.133, 27130:4.263,
>> 30790:2.870]
>> r=/reut2-021.sgm-83.txt =]}
>> CL-21576{n=0 c=[1736:8.677, 1808:3.181, 2425:3.548, 2547:4.371,
>> 3147:4.257,
>> 4163:1.124, 8873:5.482, 8902:3.481, 9324:6.045, 9702:5.867, 9712:7.424,
>> 10605:2.932, 11857:5.217, 12762:8.359, 12763:5.171, 12880:8.900,
>> 12909:7.088,
>> 13419:4.930, 14143:4.105, 14257:4.622, 16915:4.902, 17376:3.317,
>> 17525:6.689,
>> 17673:4.020, 17911:2.732, 17934:7.048, 18094:4.005, 18822:3.928,
>> 19081:3.448,
>> 19200:4.315, 19319:1.992, 19377:8.567, 20746:3.430, 20881:3.581,
>> 22751:5.024,
>> 23605:6.447, 24326:8.582, 24863:6.209, 24913:4.002, 25701:5.781,
>> 26094:1.133,
>> 26587:1.888, 26787:7.545, 26978:5.011, 26980:5.533, 27631:4.850,
>> 27986:3.793,
>> 29213:16.616, 29649:3.400, 29660:3.857, 30023:6.344, 30322:4.009,
>> 31232:12.271,
>> 31402:3.719] r=/reut2-021.sgm-98.txt =]}
>>
>> now,how i must read it? the cl are the clusters and the c=[.......]
>> are the data
>> of the cluster?
>>
>> Thanks all
>>
>>
> Precisely. CL-n is the clusterId and n= the number of points observed
> (not sure why yours are all 0) during the last iteration and c= the
> center of the cluster in sparse notation. Within c= each of the [i:d]
> pairs denote an index and an associated value. The r= values are the
> radius of the cluster (std of the observed points). The code in
> AbstractCluster which formats r= could be improved (formatVector and
> asFormatString) to better handle the empty radius vectors produced
> when n=0.
>
> And congratulations on your first k-Means!
> Jeff
Re: retrieve k-means result
Posted by Jeff Eastman <jd...@windwardsolutions.com>.
On 8/29/10 4:12 PM, Valerio Ceraudo wrote:
> Hi,
>
> finally i run my first kmeans but now i have got a little issue,how to get the
> result!
>
> I ran this command:
>
> bin/mahout clusterdump -s /home/vuvvo/clusters/ -o /home/vuvvo/finalOutput
>
> that created a new file, finalOutput, with inside this:
>
> CL-21551{n=0 c=[575:9.188, 1808:3.181, 2547:4.371, 4163:1.124, 5124:4.596,
> 8861:4.434, 9283:3.338, 9890:8.271, 10037:8.539, 12965:4.542, 13168:5.738,
> 13565:4.949, 14237:8.155, 15400:4.632, 16624:7.935, 17506:8.146, 17539:5.145,
> 17618:9.048, 17911:3.863, 18054:8.035, 19236:6.017, 21095:4.222, 21693:7.932,
> 21748:3.895, 22512:4.292, 23003:5.238, 24173:7.173, 24613:3.429, 26094:1.133,
> 26587:1.335, 26989:4.451, 28079:7.765, 28086:5.722, 29221:7.316, 29518:3.197,
> 30388:2.589, 30554:6.252, 30555:7.173] r=/reut2-021.sgm-75.txt =]}
> CL-21560{n=0 c=[1038:2.701, 1808:3.181, 1997:2.870, 2289:4.733, 2547:4.371,
> 4163:1.124, 4750:4.726, 5951:9.593, 13610:3.942, 13918:5.486, 14641:4.362,
> 17528:3.302, 18699:2.280, 20900:9.188, 22742:6.573, 23678:3.758, 23692:5.211,
> 24645:4.078, 25024:4.584, 25451:3.575, 26094:1.133, 27130:4.263, 30790:2.870]
> r=/reut2-021.sgm-83.txt =]}
> CL-21576{n=0 c=[1736:8.677, 1808:3.181, 2425:3.548, 2547:4.371, 3147:4.257,
> 4163:1.124, 8873:5.482, 8902:3.481, 9324:6.045, 9702:5.867, 9712:7.424,
> 10605:2.932, 11857:5.217, 12762:8.359, 12763:5.171, 12880:8.900, 12909:7.088,
> 13419:4.930, 14143:4.105, 14257:4.622, 16915:4.902, 17376:3.317, 17525:6.689,
> 17673:4.020, 17911:2.732, 17934:7.048, 18094:4.005, 18822:3.928, 19081:3.448,
> 19200:4.315, 19319:1.992, 19377:8.567, 20746:3.430, 20881:3.581, 22751:5.024,
> 23605:6.447, 24326:8.582, 24863:6.209, 24913:4.002, 25701:5.781, 26094:1.133,
> 26587:1.888, 26787:7.545, 26978:5.011, 26980:5.533, 27631:4.850, 27986:3.793,
> 29213:16.616, 29649:3.400, 29660:3.857, 30023:6.344, 30322:4.009, 31232:12.271,
> 31402:3.719] r=/reut2-021.sgm-98.txt =]}
>
> now,how i must read it? the cl are the clusters and the c=[.......] are the data
> of the cluster?
>
> Thanks all
>
>
Precisely. CL-n is the clusterId and n= the number of points observed
(not sure why yours are all 0) during the last iteration and c= the
center of the cluster in sparse notation. Within c= each of the [i:d]
pairs denote an index and an associated value. The r= values are the
radius of the cluster (std of the observed points). The code in
AbstractCluster which formats r= could be improved (formatVector and
asFormatString) to better handle the empty radius vectors produced when n=0.
And congratulations on your first k-Means!
Jeff