You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Baoqiang Cao <bq...@gmail.com> on 2012/03/14 18:52:57 UTC
can't get thru "-p"
Hi,
Very sorry for such a trivial question but ran out of luck. I'm trying
to see which points (thru point-ids) belong to which cluster center.
Here is what I did:
mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
/mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points
> out
The onscreen output is:
12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
{--dictionary=/mahout/sparse/dictionary.file-0,
--dictionaryType=sequencefile,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --outputFormat=TEXT,
--pointsDir=/mahout/points,
--seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
--tempDir=temp}
12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is available
12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
(Minutes: 2.2546)
There is nothing under "/mahout/points". Any help on why and how?
Thanks in advance.
Baoqiang
What to do with empty docs
Posted by Pat Ferrel <pa...@occamsmachete.com>.
You may want to tune your analyzer to let more tokens through. The empty
docs may be used in the TFIDF part of the analysis to calculate IDF, not
sure?
As to clusters with empty docs, I don't see how you can avoid that
unless you drop them before running clustering. I can't think of any
useful information they add to clustering.
On 3/21/12 12:06 PM, Baoqiang Cao wrote:
> This is extremely helpful! Thanks a lot.
> Indeed, after seqdumper I got such:
>
> Key: 1774184: Value: wt: 1.0distance: 0.8839410915753125 vec: [123426:1.000]
> Key: 1705919: Value: wt: 1.0distance: 0.0 vec: []
> Key: 1705919: Value: wt: 1.0distance: 0.0 vec: []
> Key: 1705919: Value: wt: 1.0distance: 0.0 vec: []
>
> And after redid the seq2sparse part with "-nv", I got the postid as
> well. So, it is clear that due to stop words and rare occurrence,
> some documents end up with no words left--empty documents. That is why
> I got "vec: []". Is there any suggestion dealing with them? Leave as
> is so always be aware of one trivial cluster (empty vectors)? Or...
>
> Thanks again for wonderful help!
>
>
> On Tue, Mar 20, 2012 at 10:35 AM, Pat Ferrel<pa...@occamsmachete.com> wrote:
>> What I have done is to use "seq2sparse -nv" to get named vectors.
>>
>> *mahout seq2sparse \
>> -i reuters-seqfiles/ \
>> -o reuters-vectors/ \
>> -ow -chunk 100 \
>> -x 90 \
>> -seq \
>> -a com.finderbots.analyzers.LuceneStemmingAnalyzer \
>> -ml 50 \
>> -n 2 \
>> -nv*
>>
>> This will use the filename as the key in the vector sequence file. The keys
>> will remain the same through the clustering phase.
>>
>> Run "kmeans -cl" to get the clusteredPoint dir created and in it you will
>> find a part-m-00000
>>
>> *mahout kmeans \
>> -i reuters-vectors/tfidf-vectors/ \
>> -c reuters-kmeans-centroids \
>> -cl \
>> -o reuters-kmeans-clusters \
>> -k 20 \
>> -ow \
>> -x 10 \
>> -dm org.apache.mahout.common.distance.CosineDistanceMeasure *
>>
>> Use seqdumper:
>>
>> mahout seqdumper -s
>> reuters-kmeans-clusters/clusteredPoints/part-m-00000 | more
>>
>> You will see that the file contains
>>
>> key: clusterid, value: wt = % likelihood the vector is in cluster,
>> distance from centroid, named vector belonging to the cluster,
>> vector data.
>>
>> For kmeans the likelihood will be 1.0 or 0. For example:
>>
>> Key: 21477: Value: wt: 1.0distance: 0.9420744909793364 vec:
>> /-tmp/reut2-000.sgm-158.txt = [372:0.318, 966:0.396, 3027:0.230,
>> 8816:0.452, 8868:0.308, 13639:0.278, 13648:0.264, 14334:0.270,
>> 14371:0.413]
>>
>> Clusters, of course, cannot have names. A simple solution is to construct a
>> name from the top terms in the centroid output from clusterdump.
>>
>> I also recommend /Mahout in Action/ from Manning Publishing. You can buy it
>> here:
>> http://manning.com/owen/
>>
>>
>> On 3/19/12 7:45 PM, Baoqiang Cao wrote:
>>> Thanks again for the reference!
>>>
>>> One more question if you don't mind. How do I get the text keys for
>>> each cluster? I mean, so at beginning we have input file for
>>> seq2sparse, and we knew the format is
>>>
>>> textkey1 text1
>>> textkey2 text2
>>> ..
>>>
>>> after mahout kmeans, and clusterdump, how do I know, for example,
>>> which texts belong to cluster 1 from command line? I thought the
>>> outputs from clusterdump have what I want, but so far, it is still
>>> elusive.
>>>
>>> Best,
>>> Baoqiang
>>>
>>>
>>>
>>> On Mon, Mar 19, 2012 at 4:01 PM, Pat Ferrel<pa...@occamsmachete.com> wrote:
>>>> I guess you figured this out but the cluster drivers take "-cl", which
>>>> tells
>>>> them to put points into the calculated clusters and output to the
>>>> clusterPoints directory. Then you pass that in to clusterdump.
>>>>
>>>> instructions here:
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering
>>>>
>>>> the --help for the mahout cluster drivers is incomplete, check cwiki for
>>>> differences
>>>>
>>>>
>>>>
>>>> On 3/14/12 3:18 PM, Baoqiang Cao wrote:
>>>>> Thanks a lot. But I don't know if I miss anything in front of my teary
>>>>> eyes because of Wednesday afternoon or ? I have equivalent inputs as
>>>>> yours:
>>>>>
>>>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>>>> /mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points
>>>>>
>>>>> the cluster files after 15 iterations are
>>>>> /mahout/kmeans/clusters-15-final. /mahout/points is a directory I
>>>>> created in prior. On screen, the output are something like
>>>>> "VL-1721020{n=186 c=[...". It just is no any output files under that
>>>>> directory.
>>>>>
>>>>> Any help , please
>>>>>
>>>>> On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel<pa...@occamsmachete.com>
>>>>> wrote:
>>>>>> The -p parameter is an input. You should pass in the clusterPoints/
>>>>>> directory that was generated by the cluster driver you used.
>>>>>>
>>>>>> My use of fkmeans might be an example:
>>>>>>
>>>>>> mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
>>>>>> wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
>>>>>> 2 -ow -x 10 -dm
>>>>>> org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>>
>>>>>> This will create
>>>>>> wikipedia-clusters/clusters/clusteredPoints/part-m-00000
>>>>>> which is the file with the clustered points. I then did a clusterdump
>>>>>>
>>>>>> mahout clusterdump -s
>>>>>> wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
>>>>>> wikipedia-fkmeans-clusters/clusteredPoints/ -d
>>>>>> wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
>>>>>> org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>>
>>>>>> This will output to the screen. Use -o to specify an output file.
>>>>>>
>>>>>> Good advice for any user of mahout is read the output of the help very
>>>>>> carefully. IMHO it is very easy to misunderstand the parameters,
>>>>>> inputs,
>>>>>> and
>>>>>> outputs. I think I only understand about 10%. Try:
>>>>>>
>>>>>> mahout fkmeans --help
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3/14/12 10:52 AM, Baoqiang Cao wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Very sorry for such a trivial question but ran out of luck. I'm trying
>>>>>>> to see which points (thru point-ids) belong to which cluster center.
>>>>>>> Here is what I did:
>>>>>>>
>>>>>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>>>>>> /mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points
>>>>>>>> out
>>>>>>> The onscreen output is:
>>>>>>>
>>>>>>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
>>>>>>> {--dictionary=/mahout/sparse/dictionary.file-0,
>>>>>>> --dictionaryType=sequencefile,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>>>>>> --endPhase=2147483647, --outputFormat=TEXT,
>>>>>>> --pointsDir=/mahout/points,
>>>>>>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
>>>>>>> --tempDir=temp}
>>>>>>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is
>>>>>>> available
>>>>>>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>>>>>> library
>>>>>>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
>>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
>>>>>>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
>>>>>>> (Minutes: 2.2546)
>>>>>>>
>>>>>>>
>>>>>>> There is nothing under "/mahout/points". Any help on why and how?
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>> Baoqiang
>>>>>>>
Re: can't get thru "-p"
Posted by Baoqiang Cao <bq...@gmail.com>.
This is extremely helpful! Thanks a lot.
Indeed, after seqdumper I got such:
Key: 1774184: Value: wt: 1.0distance: 0.8839410915753125 vec: [123426:1.000]
Key: 1705919: Value: wt: 1.0distance: 0.0 vec: []
Key: 1705919: Value: wt: 1.0distance: 0.0 vec: []
Key: 1705919: Value: wt: 1.0distance: 0.0 vec: []
And after redid the seq2sparse part with "-nv", I got the postid as
well. So, it is clear that due to stop words and rare occurrence,
some documents end up with no words left--empty documents. That is why
I got "vec: []". Is there any suggestion dealing with them? Leave as
is so always be aware of one trivial cluster (empty vectors)? Or...
Thanks again for wonderful help!
On Tue, Mar 20, 2012 at 10:35 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> What I have done is to use "seq2sparse -nv" to get named vectors.
>
> *mahout seq2sparse \
> -i reuters-seqfiles/ \
> -o reuters-vectors/ \
> -ow -chunk 100 \
> -x 90 \
> -seq \
> -a com.finderbots.analyzers.LuceneStemmingAnalyzer \
> -ml 50 \
> -n 2 \
> -nv*
>
> This will use the filename as the key in the vector sequence file. The keys
> will remain the same through the clustering phase.
>
> Run "kmeans -cl" to get the clusteredPoint dir created and in it you will
> find a part-m-00000
>
> *mahout kmeans \
> -i reuters-vectors/tfidf-vectors/ \
> -c reuters-kmeans-centroids \
> -cl \
> -o reuters-kmeans-clusters \
> -k 20 \
> -ow \
> -x 10 \
> -dm org.apache.mahout.common.distance.CosineDistanceMeasure *
>
> Use seqdumper:
>
> mahout seqdumper -s
> reuters-kmeans-clusters/clusteredPoints/part-m-00000 | more
>
> You will see that the file contains
>
> key: clusterid, value: wt = % likelihood the vector is in cluster,
> distance from centroid, named vector belonging to the cluster,
> vector data.
>
> For kmeans the likelihood will be 1.0 or 0. For example:
>
> Key: 21477: Value: wt: 1.0distance: 0.9420744909793364 vec:
> /-tmp/reut2-000.sgm-158.txt = [372:0.318, 966:0.396, 3027:0.230,
> 8816:0.452, 8868:0.308, 13639:0.278, 13648:0.264, 14334:0.270,
> 14371:0.413]
>
> Clusters, of course, cannot have names. A simple solution is to construct a
> name from the top terms in the centroid output from clusterdump.
>
> I also recommend /Mahout in Action/ from Manning Publishing. You can buy it
> here:
> http://manning.com/owen/
>
>
> On 3/19/12 7:45 PM, Baoqiang Cao wrote:
>>
>> Thanks again for the reference!
>>
>> One more question if you don't mind. How do I get the text keys for
>> each cluster? I mean, so at beginning we have input file for
>> seq2sparse, and we knew the format is
>>
>> textkey1 text1
>> textkey2 text2
>> ..
>>
>> after mahout kmeans, and clusterdump, how do I know, for example,
>> which texts belong to cluster 1 from command line? I thought the
>> outputs from clusterdump have what I want, but so far, it is still
>> elusive.
>>
>> Best,
>> Baoqiang
>>
>>
>>
>> On Mon, Mar 19, 2012 at 4:01 PM, Pat Ferrel<pa...@occamsmachete.com> wrote:
>>>
>>> I guess you figured this out but the cluster drivers take "-cl", which
>>> tells
>>> them to put points into the calculated clusters and output to the
>>> clusterPoints directory. Then you pass that in to clusterdump.
>>>
>>> instructions here:
>>> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering
>>>
>>> the --help for the mahout cluster drivers is incomplete, check cwiki for
>>> differences
>>>
>>>
>>>
>>> On 3/14/12 3:18 PM, Baoqiang Cao wrote:
>>>>
>>>> Thanks a lot. But I don't know if I miss anything in front of my teary
>>>> eyes because of Wednesday afternoon or ? I have equivalent inputs as
>>>> yours:
>>>>
>>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>>> /mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points
>>>>
>>>> the cluster files after 15 iterations are
>>>> /mahout/kmeans/clusters-15-final. /mahout/points is a directory I
>>>> created in prior. On screen, the output are something like
>>>> "VL-1721020{n=186 c=[...". It just is no any output files under that
>>>> directory.
>>>>
>>>> Any help , please
>>>>
>>>> On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel<pa...@occamsmachete.com>
>>>> wrote:
>>>>>
>>>>> The -p parameter is an input. You should pass in the clusterPoints/
>>>>> directory that was generated by the cluster driver you used.
>>>>>
>>>>> My use of fkmeans might be an example:
>>>>>
>>>>> mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
>>>>> wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
>>>>> 2 -ow -x 10 -dm
>>>>> org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>
>>>>> This will create
>>>>> wikipedia-clusters/clusters/clusteredPoints/part-m-00000
>>>>> which is the file with the clustered points. I then did a clusterdump
>>>>>
>>>>> mahout clusterdump -s
>>>>> wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
>>>>> wikipedia-fkmeans-clusters/clusteredPoints/ -d
>>>>> wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
>>>>> org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>
>>>>> This will output to the screen. Use -o to specify an output file.
>>>>>
>>>>> Good advice for any user of mahout is read the output of the help very
>>>>> carefully. IMHO it is very easy to misunderstand the parameters,
>>>>> inputs,
>>>>> and
>>>>> outputs. I think I only understand about 10%. Try:
>>>>>
>>>>> mahout fkmeans --help
>>>>>
>>>>>
>>>>>
>>>>> On 3/14/12 10:52 AM, Baoqiang Cao wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Very sorry for such a trivial question but ran out of luck. I'm trying
>>>>>> to see which points (thru point-ids) belong to which cluster center.
>>>>>> Here is what I did:
>>>>>>
>>>>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>>>>> /mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points
>>>>>>>
>>>>>>> out
>>>>>>
>>>>>> The onscreen output is:
>>>>>>
>>>>>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
>>>>>> {--dictionary=/mahout/sparse/dictionary.file-0,
>>>>>> --dictionaryType=sequencefile,
>>>>>>
>>>>>>
>>>>>>
>>>>>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>>>>> --endPhase=2147483647, --outputFormat=TEXT,
>>>>>> --pointsDir=/mahout/points,
>>>>>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
>>>>>> --tempDir=temp}
>>>>>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is
>>>>>> available
>>>>>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>>>>> library
>>>>>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
>>>>>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
>>>>>> (Minutes: 2.2546)
>>>>>>
>>>>>>
>>>>>> There is nothing under "/mahout/points". Any help on why and how?
>>>>>>
>>>>>> Thanks in advance.
>>>>>> Baoqiang
>>>>>>
>
Re: can't get thru "-p"
Posted by Pat Ferrel <pa...@occamsmachete.com>.
What I have done is to use "seq2sparse -nv" to get named vectors.
*mahout seq2sparse \
-i reuters-seqfiles/ \
-o reuters-vectors/ \
-ow -chunk 100 \
-x 90 \
-seq \
-a com.finderbots.analyzers.LuceneStemmingAnalyzer \
-ml 50 \
-n 2 \
-nv*
This will use the filename as the key in the vector sequence file. The
keys will remain the same through the clustering phase.
Run "kmeans -cl" to get the clusteredPoint dir created and in it you
will find a part-m-00000
*mahout kmeans \
-i reuters-vectors/tfidf-vectors/ \
-c reuters-kmeans-centroids \
-cl \
-o reuters-kmeans-clusters \
-k 20 \
-ow \
-x 10 \
-dm org.apache.mahout.common.distance.CosineDistanceMeasure *
Use seqdumper:
mahout seqdumper -s
reuters-kmeans-clusters/clusteredPoints/part-m-00000 | more
You will see that the file contains
key: clusterid, value: wt = % likelihood the vector is in cluster,
distance from centroid, named vector belonging to the cluster,
vector data.
For kmeans the likelihood will be 1.0 or 0. For example:
Key: 21477: Value: wt: 1.0distance: 0.9420744909793364 vec:
/-tmp/reut2-000.sgm-158.txt = [372:0.318, 966:0.396, 3027:0.230,
8816:0.452, 8868:0.308, 13639:0.278, 13648:0.264, 14334:0.270,
14371:0.413]
Clusters, of course, cannot have names. A simple solution is to
construct a name from the top terms in the centroid output from clusterdump.
I also recommend /Mahout in Action/ from Manning Publishing. You can buy
it here:
http://manning.com/owen/
On 3/19/12 7:45 PM, Baoqiang Cao wrote:
> Thanks again for the reference!
>
> One more question if you don't mind. How do I get the text keys for
> each cluster? I mean, so at beginning we have input file for
> seq2sparse, and we knew the format is
>
> textkey1 text1
> textkey2 text2
> ..
>
> after mahout kmeans, and clusterdump, how do I know, for example,
> which texts belong to cluster 1 from command line? I thought the
> outputs from clusterdump have what I want, but so far, it is still
> elusive.
>
> Best,
> Baoqiang
>
>
>
> On Mon, Mar 19, 2012 at 4:01 PM, Pat Ferrel<pa...@occamsmachete.com> wrote:
>> I guess you figured this out but the cluster drivers take "-cl", which tells
>> them to put points into the calculated clusters and output to the
>> clusterPoints directory. Then you pass that in to clusterdump.
>>
>> instructions here:
>> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering
>>
>> the --help for the mahout cluster drivers is incomplete, check cwiki for
>> differences
>>
>>
>>
>> On 3/14/12 3:18 PM, Baoqiang Cao wrote:
>>> Thanks a lot. But I don't know if I miss anything in front of my teary
>>> eyes because of Wednesday afternoon or ? I have equivalent inputs as
>>> yours:
>>>
>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>> /mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points
>>>
>>> the cluster files after 15 iterations are
>>> /mahout/kmeans/clusters-15-final. /mahout/points is a directory I
>>> created in prior. On screen, the output are something like
>>> "VL-1721020{n=186 c=[...". It just is no any output files under that
>>> directory.
>>>
>>> Any help , please
>>>
>>> On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel<pa...@occamsmachete.com> wrote:
>>>> The -p parameter is an input. You should pass in the clusterPoints/
>>>> directory that was generated by the cluster driver you used.
>>>>
>>>> My use of fkmeans might be an example:
>>>>
>>>> mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
>>>> wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
>>>> 2 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>
>>>> This will create wikipedia-clusters/clusters/clusteredPoints/part-m-00000
>>>> which is the file with the clustered points. I then did a clusterdump
>>>>
>>>> mahout clusterdump -s
>>>> wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
>>>> wikipedia-fkmeans-clusters/clusteredPoints/ -d
>>>> wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
>>>> org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>
>>>> This will output to the screen. Use -o to specify an output file.
>>>>
>>>> Good advice for any user of mahout is read the output of the help very
>>>> carefully. IMHO it is very easy to misunderstand the parameters, inputs,
>>>> and
>>>> outputs. I think I only understand about 10%. Try:
>>>>
>>>> mahout fkmeans --help
>>>>
>>>>
>>>>
>>>> On 3/14/12 10:52 AM, Baoqiang Cao wrote:
>>>>> Hi,
>>>>>
>>>>> Very sorry for such a trivial question but ran out of luck. I'm trying
>>>>> to see which points (thru point-ids) belong to which cluster center.
>>>>> Here is what I did:
>>>>>
>>>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>>>> /mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points
>>>>>> out
>>>>> The onscreen output is:
>>>>>
>>>>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
>>>>> {--dictionary=/mahout/sparse/dictionary.file-0,
>>>>> --dictionaryType=sequencefile,
>>>>>
>>>>>
>>>>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>>>> --endPhase=2147483647, --outputFormat=TEXT,
>>>>> --pointsDir=/mahout/points,
>>>>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
>>>>> --tempDir=temp}
>>>>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is
>>>>> available
>>>>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>>>> library
>>>>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
>>>>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
>>>>> (Minutes: 2.2546)
>>>>>
>>>>>
>>>>> There is nothing under "/mahout/points". Any help on why and how?
>>>>>
>>>>> Thanks in advance.
>>>>> Baoqiang
>>>>>
Re: can't get thru "-p"
Posted by Baoqiang Cao <bq...@gmail.com>.
Thanks again for the reference!
One more question if you don't mind. How do I get the text keys for
each cluster? I mean, so at beginning we have input file for
seq2sparse, and we knew the format is
textkey1 text1
textkey2 text2
..
after mahout kmeans, and clusterdump, how do I know, for example,
which texts belong to cluster 1 from command line? I thought the
outputs from clusterdump have what I want, but so far, it is still
elusive.
Best,
Baoqiang
On Mon, Mar 19, 2012 at 4:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> I guess you figured this out but the cluster drivers take "-cl", which tells
> them to put points into the calculated clusters and output to the
> clusterPoints directory. Then you pass that in to clusterdump.
>
> instructions here:
> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering
>
> the --help for the mahout cluster drivers is incomplete, check cwiki for
> differences
>
>
>
> On 3/14/12 3:18 PM, Baoqiang Cao wrote:
>>
>> Thanks a lot. But I don't know if I miss anything in front of my teary
>> eyes because of Wednesday afternoon or ? I have equivalent inputs as
>> yours:
>>
>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>> /mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points
>>
>> the cluster files after 15 iterations are
>> /mahout/kmeans/clusters-15-final. /mahout/points is a directory I
>> created in prior. On screen, the output are something like
>> "VL-1721020{n=186 c=[...". It just is no any output files under that
>> directory.
>>
>> Any help , please
>>
>> On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel<pa...@occamsmachete.com> wrote:
>>>
>>> The -p parameter is an input. You should pass in the clusterPoints/
>>> directory that was generated by the cluster driver you used.
>>>
>>> My use of fkmeans might be an example:
>>>
>>> mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
>>> wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
>>> 2 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>>>
>>> This will create wikipedia-clusters/clusters/clusteredPoints/part-m-00000
>>> which is the file with the clustered points. I then did a clusterdump
>>>
>>> mahout clusterdump -s
>>> wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
>>> wikipedia-fkmeans-clusters/clusteredPoints/ -d
>>> wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
>>> org.apache.mahout.common.distance.CosineDistanceMeasure
>>>
>>> This will output to the screen. Use -o to specify an output file.
>>>
>>> Good advice for any user of mahout is read the output of the help very
>>> carefully. IMHO it is very easy to misunderstand the parameters, inputs,
>>> and
>>> outputs. I think I only understand about 10%. Try:
>>>
>>> mahout fkmeans --help
>>>
>>>
>>>
>>> On 3/14/12 10:52 AM, Baoqiang Cao wrote:
>>>>
>>>> Hi,
>>>>
>>>> Very sorry for such a trivial question but ran out of luck. I'm trying
>>>> to see which points (thru point-ids) belong to which cluster center.
>>>> Here is what I did:
>>>>
>>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>>> /mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points
>>>>>
>>>>> out
>>>>
>>>> The onscreen output is:
>>>>
>>>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
>>>> {--dictionary=/mahout/sparse/dictionary.file-0,
>>>> --dictionaryType=sequencefile,
>>>>
>>>>
>>>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>>> --endPhase=2147483647, --outputFormat=TEXT,
>>>> --pointsDir=/mahout/points,
>>>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
>>>> --tempDir=temp}
>>>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is
>>>> available
>>>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>>> library
>>>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
>>>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
>>>> (Minutes: 2.2546)
>>>>
>>>>
>>>> There is nothing under "/mahout/points". Any help on why and how?
>>>>
>>>> Thanks in advance.
>>>> Baoqiang
>>>>
>
Re: can't get thru "-p"
Posted by Pat Ferrel <pa...@occamsmachete.com>.
I guess you figured this out but the cluster drivers take "-cl", which
tells them to put points into the calculated clusters and output to the
clusterPoints directory. Then you pass that in to clusterdump.
instructions here:
https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering
the --help for the mahout cluster drivers is incomplete, check cwiki for
differences
On 3/14/12 3:18 PM, Baoqiang Cao wrote:
> Thanks a lot. But I don't know if I miss anything in front of my teary
> eyes because of Wednesday afternoon or ? I have equivalent inputs as
> yours:
>
> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
> /mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points
>
> the cluster files after 15 iterations are
> /mahout/kmeans/clusters-15-final. /mahout/points is a directory I
> created in prior. On screen, the output are something like
> "VL-1721020{n=186 c=[...". It just is no any output files under that
> directory.
>
> Any help , please
>
> On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel<pa...@occamsmachete.com> wrote:
>> The -p parameter is an input. You should pass in the clusterPoints/
>> directory that was generated by the cluster driver you used.
>>
>> My use of fkmeans might be an example:
>>
>> mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
>> wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
>> 2 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>>
>> This will create wikipedia-clusters/clusters/clusteredPoints/part-m-00000
>> which is the file with the clustered points. I then did a clusterdump
>>
>> mahout clusterdump -s
>> wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
>> wikipedia-fkmeans-clusters/clusteredPoints/ -d
>> wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
>> org.apache.mahout.common.distance.CosineDistanceMeasure
>>
>> This will output to the screen. Use -o to specify an output file.
>>
>> Good advice for any user of mahout is read the output of the help very
>> carefully. IMHO it is very easy to misunderstand the parameters, inputs, and
>> outputs. I think I only understand about 10%. Try:
>>
>> mahout fkmeans --help
>>
>>
>>
>> On 3/14/12 10:52 AM, Baoqiang Cao wrote:
>>> Hi,
>>>
>>> Very sorry for such a trivial question but ran out of luck. I'm trying
>>> to see which points (thru point-ids) belong to which cluster center.
>>> Here is what I did:
>>>
>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>> /mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points
>>>> out
>>> The onscreen output is:
>>>
>>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
>>> {--dictionary=/mahout/sparse/dictionary.file-0,
>>> --dictionaryType=sequencefile,
>>>
>>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>> --endPhase=2147483647, --outputFormat=TEXT,
>>> --pointsDir=/mahout/points,
>>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
>>> --tempDir=temp}
>>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is
>>> available
>>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>> library
>>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
>>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
>>> (Minutes: 2.2546)
>>>
>>>
>>> There is nothing under "/mahout/points". Any help on why and how?
>>>
>>> Thanks in advance.
>>> Baoqiang
>>>
Re: can't get thru "-p"
Posted by Baoqiang Cao <bq...@gmail.com>.
Thanks a lot. But I don't know if I miss anything in front of my teary
eyes because of Wednesday afternoon or ? I have equivalent inputs as
yours:
mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
/mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points
the cluster files after 15 iterations are
/mahout/kmeans/clusters-15-final. /mahout/points is a directory I
created in prior. On screen, the output are something like
"VL-1721020{n=186 c=[...". It just is no any output files under that
directory.
Any help , please
On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> The -p parameter is an input. You should pass in the clusterPoints/
> directory that was generated by the cluster driver you used.
>
> My use of fkmeans might be an example:
>
> mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
> wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
> 2 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>
> This will create wikipedia-clusters/clusters/clusteredPoints/part-m-00000
> which is the file with the clustered points. I then did a clusterdump
>
> mahout clusterdump -s
> wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
> wikipedia-fkmeans-clusters/clusteredPoints/ -d
> wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure
>
> This will output to the screen. Use -o to specify an output file.
>
> Good advice for any user of mahout is read the output of the help very
> carefully. IMHO it is very easy to misunderstand the parameters, inputs, and
> outputs. I think I only understand about 10%. Try:
>
> mahout fkmeans --help
>
>
>
> On 3/14/12 10:52 AM, Baoqiang Cao wrote:
>>
>> Hi,
>>
>> Very sorry for such a trivial question but ran out of luck. I'm trying
>> to see which points (thru point-ids) belong to which cluster center.
>> Here is what I did:
>>
>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>> /mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points
>>>
>>> out
>>
>> The onscreen output is:
>>
>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
>> {--dictionary=/mahout/sparse/dictionary.file-0,
>> --dictionaryType=sequencefile,
>>
>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>> --endPhase=2147483647, --outputFormat=TEXT,
>> --pointsDir=/mahout/points,
>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
>> --tempDir=temp}
>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is
>> available
>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop
>> library
>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
>> (Minutes: 2.2546)
>>
>>
>> There is nothing under "/mahout/points". Any help on why and how?
>>
>> Thanks in advance.
>> Baoqiang
>>
>
Re: can't get thru "-p"
Posted by Pat Ferrel <pa...@occamsmachete.com>.
The -p parameter is an input. You should pass in the clusterPoints/
directory that was generated by the cluster driver you used.
My use of fkmeans might be an example:
mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
2 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
This will create
wikipedia-clusters/clusters/clusteredPoints/part-m-00000 which is the
file with the clustered points. I then did a clusterdump
mahout clusterdump -s
wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
wikipedia-fkmeans-clusters/clusteredPoints/ -d
wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
org.apache.mahout.common.distance.CosineDistanceMeasure
This will output to the screen. Use -o to specify an output file.
Good advice for any user of mahout is read the output of the help very
carefully. IMHO it is very easy to misunderstand the parameters, inputs,
and outputs. I think I only understand about 10%. Try:
mahout fkmeans --help
On 3/14/12 10:52 AM, Baoqiang Cao wrote:
> Hi,
>
> Very sorry for such a trivial question but ran out of luck. I'm trying
> to see which points (thru point-ids) belong to which cluster center.
> Here is what I did:
>
> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
> /mahout/sparse/dictionary.file-0 -dt sequencefile -p /mahout/points
>> out
> The onscreen output is:
>
> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
> {--dictionary=/mahout/sparse/dictionary.file-0,
> --dictionaryType=sequencefile,
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
> --endPhase=2147483647, --outputFormat=TEXT,
> --pointsDir=/mahout/points,
> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
> --tempDir=temp}
> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is available
> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop library
> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
> (Minutes: 2.2546)
>
>
> There is nothing under "/mahout/points". Any help on why and how?
>
> Thanks in advance.
> Baoqiang
>