You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Baoqiang Cao <bq...@gmail.com> on 2012/03/14 18:52:57 UTC

can't get thru "-p"

Hi,

Very sorry for such a trivial question but ran out of luck. I'm trying
to see which points (thru point-ids) belong to which cluster center.
Here is what I did:

mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
/mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
> out

The onscreen output is:

12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
{--dictionary=/mahout/sparse/dictionary.file-0,
--dictionaryType=sequencefile,
--distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
--endPhase=2147483647, --outputFormat=TEXT,
--pointsDir=/mahout/points,
--seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
--tempDir=temp}
12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is available
12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop library
12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
(Minutes: 2.2546)


There is nothing under "/mahout/points". Any help on why and how?

Thanks in advance.
Baoqiang

What to do with empty docs

Posted by Pat Ferrel <pa...@occamsmachete.com>.
You may want to tune your analyzer to let more tokens through. The empty 
docs may be used in the TFIDF part of the analysis to calculate IDF, not 
sure?

As to clusters with empty docs, I don't see how you can avoid that 
unless you drop them before running clustering. I can't think of any 
useful information they add to clustering.

On 3/21/12 12:06 PM, Baoqiang Cao wrote:
> This is extremely helpful! Thanks a lot.
> Indeed, after seqdumper I got such:
>
> Key: 1774184: Value: wt: 1.0distance: 0.8839410915753125  vec: [123426:1.000]
> Key: 1705919: Value: wt: 1.0distance: 0.0  vec: []
> Key: 1705919: Value: wt: 1.0distance: 0.0  vec: []
> Key: 1705919: Value: wt: 1.0distance: 0.0  vec: []
>
> And after redid the seq2sparse part with "-nv", I got the postid as
> well. So, it is clear that due to stop words and rare occurrence,
> some documents end up with no words left--empty documents. That is why
> I got "vec: []". Is there any suggestion dealing with them? Leave as
> is so always be aware of one trivial cluster (empty vectors)? Or...
>
> Thanks again for wonderful help!
>
>
> On Tue, Mar 20, 2012 at 10:35 AM, Pat Ferrel<pa...@occamsmachete.com>  wrote:
>> What I have done is to use "seq2sparse -nv" to get named vectors.
>>
>>    *mahout seq2sparse \
>>        -i reuters-seqfiles/ \
>>        -o reuters-vectors/ \
>>        -ow -chunk 100 \
>>        -x 90 \
>>        -seq \
>>        -a com.finderbots.analyzers.LuceneStemmingAnalyzer \
>>        -ml 50 \
>>        -n 2 \
>>        -nv*
>>
>> This will use the filename as the key in the vector sequence file. The keys
>> will remain the same through the clustering phase.
>>
>> Run "kmeans -cl" to get the clusteredPoint dir created and in it you will
>> find a part-m-00000
>>
>>    *mahout kmeans \
>>        -i reuters-vectors/tfidf-vectors/ \
>>        -c reuters-kmeans-centroids \
>>        -cl \
>>        -o reuters-kmeans-clusters \
>>        -k 20 \
>>        -ow \
>>        -x 10 \
>>        -dm org.apache.mahout.common.distance.CosineDistanceMeasure *
>>
>> Use seqdumper:
>>
>>    mahout seqdumper -s
>>    reuters-kmeans-clusters/clusteredPoints/part-m-00000 | more
>>
>> You will see that the file contains
>>
>>    key: clusterid, value: wt = % likelihood the vector is in cluster,
>>    distance from centroid, named vector belonging to the cluster,
>>    vector data.
>>
>> For kmeans the likelihood will be 1.0 or 0. For example:
>>
>>    Key: 21477: Value: wt: 1.0distance: 0.9420744909793364  vec:
>>    /-tmp/reut2-000.sgm-158.txt = [372:0.318, 966:0.396, 3027:0.230,
>>    8816:0.452, 8868:0.308, 13639:0.278, 13648:0.264, 14334:0.270,
>>    14371:0.413]
>>
>> Clusters, of course, cannot have names. A simple solution is to construct a
>> name from the top terms in the centroid output from clusterdump.
>>
>> I also recommend /Mahout in Action/ from Manning Publishing. You can buy it
>> here:
>> http://manning.com/owen/
>>
>>
>> On 3/19/12 7:45 PM, Baoqiang Cao wrote:
>>> Thanks again for the reference!
>>>
>>> One more question if you don't mind. How do I get the text keys for
>>> each cluster? I mean, so at beginning we have input file for
>>> seq2sparse, and we knew the format is
>>>
>>> textkey1 text1
>>> textkey2 text2
>>> ..
>>>
>>> after mahout kmeans, and clusterdump, how do I know, for example,
>>> which texts belong to cluster 1 from command line? I thought the
>>> outputs from clusterdump have what I want, but so far, it is still
>>> elusive.
>>>
>>> Best,
>>> Baoqiang
>>>
>>>
>>>
>>> On Mon, Mar 19, 2012 at 4:01 PM, Pat Ferrel<pa...@occamsmachete.com>    wrote:
>>>> I guess you figured this out but the cluster drivers take "-cl", which
>>>> tells
>>>> them to put points into the calculated clusters and output to the
>>>> clusterPoints directory. Then you pass that in to clusterdump.
>>>>
>>>> instructions here:
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering
>>>>
>>>> the --help for the mahout cluster drivers is incomplete, check cwiki for
>>>> differences
>>>>
>>>>
>>>>
>>>> On 3/14/12 3:18 PM, Baoqiang Cao wrote:
>>>>> Thanks a lot. But I don't know if I miss anything in front of my teary
>>>>> eyes because of Wednesday afternoon or ? I have equivalent inputs as
>>>>> yours:
>>>>>
>>>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>>>> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>>>>>
>>>>> the cluster files after 15 iterations are
>>>>> /mahout/kmeans/clusters-15-final. /mahout/points is a directory I
>>>>> created in prior. On screen, the output are something like
>>>>> "VL-1721020{n=186 c=[...". It just is no any output files under that
>>>>> directory.
>>>>>
>>>>> Any help , please
>>>>>
>>>>> On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel<pa...@occamsmachete.com>
>>>>>   wrote:
>>>>>> The -p parameter is an input. You should pass in the clusterPoints/
>>>>>> directory that was generated by the cluster driver you used.
>>>>>>
>>>>>> My use of fkmeans might be an example:
>>>>>>
>>>>>>    mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
>>>>>>    wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
>>>>>>    2 -ow -x 10 -dm
>>>>>> org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>>
>>>>>> This will create
>>>>>> wikipedia-clusters/clusters/clusteredPoints/part-m-00000
>>>>>> which is the file with the clustered points. I then did a clusterdump
>>>>>>
>>>>>>    mahout clusterdump -s
>>>>>>    wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
>>>>>>    wikipedia-fkmeans-clusters/clusteredPoints/ -d
>>>>>>   wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
>>>>>>    org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>>
>>>>>> This will output to the screen. Use -o to specify an output file.
>>>>>>
>>>>>> Good advice for any user of mahout is read the output of the help very
>>>>>> carefully. IMHO it is very easy to misunderstand the parameters,
>>>>>> inputs,
>>>>>> and
>>>>>> outputs. I think I only understand about 10%. Try:
>>>>>>
>>>>>>    mahout fkmeans --help
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3/14/12 10:52 AM, Baoqiang Cao wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Very sorry for such a trivial question but ran out of luck. I'm trying
>>>>>>> to see which points (thru point-ids) belong to which cluster center.
>>>>>>> Here is what I did:
>>>>>>>
>>>>>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>>>>>> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>>>>>>>> out
>>>>>>> The onscreen output is:
>>>>>>>
>>>>>>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
>>>>>>> {--dictionary=/mahout/sparse/dictionary.file-0,
>>>>>>> --dictionaryType=sequencefile,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>>>>>> --endPhase=2147483647, --outputFormat=TEXT,
>>>>>>> --pointsDir=/mahout/points,
>>>>>>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
>>>>>>> --tempDir=temp}
>>>>>>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is
>>>>>>> available
>>>>>>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>>>>>> library
>>>>>>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
>>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
>>>>>>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
>>>>>>> (Minutes: 2.2546)
>>>>>>>
>>>>>>>
>>>>>>> There is nothing under "/mahout/points". Any help on why and how?
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>> Baoqiang
>>>>>>>

Re: can't get thru "-p"

Posted by Baoqiang Cao <bq...@gmail.com>.
This is extremely helpful! Thanks a lot.
Indeed, after seqdumper I got such:

Key: 1774184: Value: wt: 1.0distance: 0.8839410915753125  vec: [123426:1.000]
Key: 1705919: Value: wt: 1.0distance: 0.0  vec: []
Key: 1705919: Value: wt: 1.0distance: 0.0  vec: []
Key: 1705919: Value: wt: 1.0distance: 0.0  vec: []

And after redid the seq2sparse part with "-nv", I got the postid as
well. So, it is clear that due to stop words and rare occurrence,
some documents end up with no words left--empty documents. That is why
I got "vec: []". Is there any suggestion dealing with them? Leave as
is so always be aware of one trivial cluster (empty vectors)? Or...

Thanks again for wonderful help!


On Tue, Mar 20, 2012 at 10:35 AM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> What I have done is to use "seq2sparse -nv" to get named vectors.
>
>   *mahout seq2sparse \
>       -i reuters-seqfiles/ \
>       -o reuters-vectors/ \
>       -ow -chunk 100 \
>       -x 90 \
>       -seq \
>       -a com.finderbots.analyzers.LuceneStemmingAnalyzer \
>       -ml 50 \
>       -n 2 \
>       -nv*
>
> This will use the filename as the key in the vector sequence file. The keys
> will remain the same through the clustering phase.
>
> Run "kmeans -cl" to get the clusteredPoint dir created and in it you will
> find a part-m-00000
>
>   *mahout kmeans \
>       -i reuters-vectors/tfidf-vectors/ \
>       -c reuters-kmeans-centroids \
>       -cl \
>       -o reuters-kmeans-clusters \
>       -k 20 \
>       -ow \
>       -x 10 \
>       -dm org.apache.mahout.common.distance.CosineDistanceMeasure *
>
> Use seqdumper:
>
>   mahout seqdumper -s
>   reuters-kmeans-clusters/clusteredPoints/part-m-00000 | more
>
> You will see that the file contains
>
>   key: clusterid, value: wt = % likelihood the vector is in cluster,
>   distance from centroid, named vector belonging to the cluster,
>   vector data.
>
> For kmeans the likelihood will be 1.0 or 0. For example:
>
>   Key: 21477: Value: wt: 1.0distance: 0.9420744909793364  vec:
>   /-tmp/reut2-000.sgm-158.txt = [372:0.318, 966:0.396, 3027:0.230,
>   8816:0.452, 8868:0.308, 13639:0.278, 13648:0.264, 14334:0.270,
>   14371:0.413]
>
> Clusters, of course, cannot have names. A simple solution is to construct a
> name from the top terms in the centroid output from clusterdump.
>
> I also recommend /Mahout in Action/ from Manning Publishing. You can buy it
> here:
> http://manning.com/owen/
>
>
> On 3/19/12 7:45 PM, Baoqiang Cao wrote:
>>
>> Thanks again for the reference!
>>
>> One more question if you don't mind. How do I get the text keys for
>> each cluster? I mean, so at beginning we have input file for
>> seq2sparse, and we knew the format is
>>
>> textkey1 text1
>> textkey2 text2
>> ..
>>
>> after mahout kmeans, and clusterdump, how do I know, for example,
>> which texts belong to cluster 1 from command line? I thought the
>> outputs from clusterdump have what I want, but so far, it is still
>> elusive.
>>
>> Best,
>> Baoqiang
>>
>>
>>
>> On Mon, Mar 19, 2012 at 4:01 PM, Pat Ferrel<pa...@occamsmachete.com>  wrote:
>>>
>>> I guess you figured this out but the cluster drivers take "-cl", which
>>> tells
>>> them to put points into the calculated clusters and output to the
>>> clusterPoints directory. Then you pass that in to clusterdump.
>>>
>>> instructions here:
>>> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering
>>>
>>> the --help for the mahout cluster drivers is incomplete, check cwiki for
>>> differences
>>>
>>>
>>>
>>> On 3/14/12 3:18 PM, Baoqiang Cao wrote:
>>>>
>>>> Thanks a lot. But I don't know if I miss anything in front of my teary
>>>> eyes because of Wednesday afternoon or ? I have equivalent inputs as
>>>> yours:
>>>>
>>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>>> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>>>>
>>>> the cluster files after 15 iterations are
>>>> /mahout/kmeans/clusters-15-final. /mahout/points is a directory I
>>>> created in prior. On screen, the output are something like
>>>> "VL-1721020{n=186 c=[...". It just is no any output files under that
>>>> directory.
>>>>
>>>> Any help , please
>>>>
>>>> On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel<pa...@occamsmachete.com>
>>>>  wrote:
>>>>>
>>>>> The -p parameter is an input. You should pass in the clusterPoints/
>>>>> directory that was generated by the cluster driver you used.
>>>>>
>>>>> My use of fkmeans might be an example:
>>>>>
>>>>>   mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
>>>>>   wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
>>>>>   2 -ow -x 10 -dm
>>>>> org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>
>>>>> This will create
>>>>> wikipedia-clusters/clusters/clusteredPoints/part-m-00000
>>>>> which is the file with the clustered points. I then did a clusterdump
>>>>>
>>>>>   mahout clusterdump -s
>>>>>   wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
>>>>>   wikipedia-fkmeans-clusters/clusteredPoints/ -d
>>>>>  wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
>>>>>   org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>
>>>>> This will output to the screen. Use -o to specify an output file.
>>>>>
>>>>> Good advice for any user of mahout is read the output of the help very
>>>>> carefully. IMHO it is very easy to misunderstand the parameters,
>>>>> inputs,
>>>>> and
>>>>> outputs. I think I only understand about 10%. Try:
>>>>>
>>>>>   mahout fkmeans --help
>>>>>
>>>>>
>>>>>
>>>>> On 3/14/12 10:52 AM, Baoqiang Cao wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Very sorry for such a trivial question but ran out of luck. I'm trying
>>>>>> to see which points (thru point-ids) belong to which cluster center.
>>>>>> Here is what I did:
>>>>>>
>>>>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>>>>> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>>>>>>>
>>>>>>> out
>>>>>>
>>>>>> The onscreen output is:
>>>>>>
>>>>>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
>>>>>> {--dictionary=/mahout/sparse/dictionary.file-0,
>>>>>> --dictionaryType=sequencefile,
>>>>>>
>>>>>>
>>>>>>
>>>>>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>>>>> --endPhase=2147483647, --outputFormat=TEXT,
>>>>>> --pointsDir=/mahout/points,
>>>>>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
>>>>>> --tempDir=temp}
>>>>>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is
>>>>>> available
>>>>>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>>>>> library
>>>>>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
>>>>>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
>>>>>> (Minutes: 2.2546)
>>>>>>
>>>>>>
>>>>>> There is nothing under "/mahout/points". Any help on why and how?
>>>>>>
>>>>>> Thanks in advance.
>>>>>> Baoqiang
>>>>>>
>

Re: can't get thru "-p"

Posted by Pat Ferrel <pa...@occamsmachete.com>.
What I have done is to use "seq2sparse -nv" to get named vectors.

    *mahout seq2sparse \
        -i reuters-seqfiles/ \
        -o reuters-vectors/ \
        -ow -chunk 100 \
        -x 90 \
        -seq \
        -a com.finderbots.analyzers.LuceneStemmingAnalyzer \
        -ml 50 \
        -n 2 \
        -nv*

This will use the filename as the key in the vector sequence file. The 
keys will remain the same through the clustering phase.

Run "kmeans -cl" to get the clusteredPoint dir created and in it you 
will find a part-m-00000

    *mahout kmeans \
        -i reuters-vectors/tfidf-vectors/ \
        -c reuters-kmeans-centroids \
        -cl \
        -o reuters-kmeans-clusters \
        -k 20 \
        -ow \
        -x 10 \
        -dm org.apache.mahout.common.distance.CosineDistanceMeasure *

Use seqdumper:

    mahout seqdumper -s
    reuters-kmeans-clusters/clusteredPoints/part-m-00000 | more

You will see that the file contains

    key: clusterid, value: wt = % likelihood the vector is in cluster,
    distance from centroid, named vector belonging to the cluster,
    vector data.

For kmeans the likelihood will be 1.0 or 0. For example:

    Key: 21477: Value: wt: 1.0distance: 0.9420744909793364  vec:
    /-tmp/reut2-000.sgm-158.txt = [372:0.318, 966:0.396, 3027:0.230,
    8816:0.452, 8868:0.308, 13639:0.278, 13648:0.264, 14334:0.270,
    14371:0.413]

Clusters, of course, cannot have names. A simple solution is to 
construct a name from the top terms in the centroid output from clusterdump.

I also recommend /Mahout in Action/ from Manning Publishing. You can buy 
it here:
http://manning.com/owen/

On 3/19/12 7:45 PM, Baoqiang Cao wrote:
> Thanks again for the reference!
>
> One more question if you don't mind. How do I get the text keys for
> each cluster? I mean, so at beginning we have input file for
> seq2sparse, and we knew the format is
>
> textkey1 text1
> textkey2 text2
> ..
>
> after mahout kmeans, and clusterdump, how do I know, for example,
> which texts belong to cluster 1 from command line? I thought the
> outputs from clusterdump have what I want, but so far, it is still
> elusive.
>
> Best,
> Baoqiang
>
>
>
> On Mon, Mar 19, 2012 at 4:01 PM, Pat Ferrel<pa...@occamsmachete.com>  wrote:
>> I guess you figured this out but the cluster drivers take "-cl", which tells
>> them to put points into the calculated clusters and output to the
>> clusterPoints directory. Then you pass that in to clusterdump.
>>
>> instructions here:
>> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering
>>
>> the --help for the mahout cluster drivers is incomplete, check cwiki for
>> differences
>>
>>
>>
>> On 3/14/12 3:18 PM, Baoqiang Cao wrote:
>>> Thanks a lot. But I don't know if I miss anything in front of my teary
>>> eyes because of Wednesday afternoon or ? I have equivalent inputs as
>>> yours:
>>>
>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>>>
>>> the cluster files after 15 iterations are
>>> /mahout/kmeans/clusters-15-final. /mahout/points is a directory I
>>> created in prior. On screen, the output are something like
>>> "VL-1721020{n=186 c=[...". It just is no any output files under that
>>> directory.
>>>
>>> Any help , please
>>>
>>> On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel<pa...@occamsmachete.com>    wrote:
>>>> The -p parameter is an input. You should pass in the clusterPoints/
>>>> directory that was generated by the cluster driver you used.
>>>>
>>>> My use of fkmeans might be an example:
>>>>
>>>>    mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
>>>>    wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
>>>>    2 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>
>>>> This will create wikipedia-clusters/clusters/clusteredPoints/part-m-00000
>>>> which is the file with the clustered points. I then did a clusterdump
>>>>
>>>>    mahout clusterdump -s
>>>>    wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
>>>>    wikipedia-fkmeans-clusters/clusteredPoints/ -d
>>>>   wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
>>>>    org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>
>>>> This will output to the screen. Use -o to specify an output file.
>>>>
>>>> Good advice for any user of mahout is read the output of the help very
>>>> carefully. IMHO it is very easy to misunderstand the parameters, inputs,
>>>> and
>>>> outputs. I think I only understand about 10%. Try:
>>>>
>>>>    mahout fkmeans --help
>>>>
>>>>
>>>>
>>>> On 3/14/12 10:52 AM, Baoqiang Cao wrote:
>>>>> Hi,
>>>>>
>>>>> Very sorry for such a trivial question but ran out of luck. I'm trying
>>>>> to see which points (thru point-ids) belong to which cluster center.
>>>>> Here is what I did:
>>>>>
>>>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>>>> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>>>>>> out
>>>>> The onscreen output is:
>>>>>
>>>>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
>>>>> {--dictionary=/mahout/sparse/dictionary.file-0,
>>>>> --dictionaryType=sequencefile,
>>>>>
>>>>>
>>>>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>>>> --endPhase=2147483647, --outputFormat=TEXT,
>>>>> --pointsDir=/mahout/points,
>>>>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
>>>>> --tempDir=temp}
>>>>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is
>>>>> available
>>>>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>>>> library
>>>>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
>>>>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
>>>>> (Minutes: 2.2546)
>>>>>
>>>>>
>>>>> There is nothing under "/mahout/points". Any help on why and how?
>>>>>
>>>>> Thanks in advance.
>>>>> Baoqiang
>>>>>

Re: can't get thru "-p"

Posted by Baoqiang Cao <bq...@gmail.com>.
Thanks again for the reference!

One more question if you don't mind. How do I get the text keys for
each cluster? I mean, so at beginning we have input file for
seq2sparse, and we knew the format is

textkey1 text1
textkey2 text2
..

after mahout kmeans, and clusterdump, how do I know, for example,
which texts belong to cluster 1 from command line? I thought the
outputs from clusterdump have what I want, but so far, it is still
elusive.

Best,
Baoqiang



On Mon, Mar 19, 2012 at 4:01 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> I guess you figured this out but the cluster drivers take "-cl", which tells
> them to put points into the calculated clusters and output to the
> clusterPoints directory. Then you pass that in to clusterdump.
>
> instructions here:
> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering
>
> the --help for the mahout cluster drivers is incomplete, check cwiki for
> differences
>
>
>
> On 3/14/12 3:18 PM, Baoqiang Cao wrote:
>>
>> Thanks a lot. But I don't know if I miss anything in front of my teary
>> eyes because of Wednesday afternoon or ? I have equivalent inputs as
>> yours:
>>
>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>>
>> the cluster files after 15 iterations are
>> /mahout/kmeans/clusters-15-final. /mahout/points is a directory I
>> created in prior. On screen, the output are something like
>> "VL-1721020{n=186 c=[...". It just is no any output files under that
>> directory.
>>
>> Any help , please
>>
>> On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel<pa...@occamsmachete.com>  wrote:
>>>
>>> The -p parameter is an input. You should pass in the clusterPoints/
>>> directory that was generated by the cluster driver you used.
>>>
>>> My use of fkmeans might be an example:
>>>
>>>   mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
>>>   wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
>>>   2 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>>>
>>> This will create wikipedia-clusters/clusters/clusteredPoints/part-m-00000
>>> which is the file with the clustered points. I then did a clusterdump
>>>
>>>   mahout clusterdump -s
>>>   wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
>>>   wikipedia-fkmeans-clusters/clusteredPoints/ -d
>>>  wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
>>>   org.apache.mahout.common.distance.CosineDistanceMeasure
>>>
>>> This will output to the screen. Use -o to specify an output file.
>>>
>>> Good advice for any user of mahout is read the output of the help very
>>> carefully. IMHO it is very easy to misunderstand the parameters, inputs,
>>> and
>>> outputs. I think I only understand about 10%. Try:
>>>
>>>   mahout fkmeans --help
>>>
>>>
>>>
>>> On 3/14/12 10:52 AM, Baoqiang Cao wrote:
>>>>
>>>> Hi,
>>>>
>>>> Very sorry for such a trivial question but ran out of luck. I'm trying
>>>> to see which points (thru point-ids) belong to which cluster center.
>>>> Here is what I did:
>>>>
>>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>>> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>>>>>
>>>>> out
>>>>
>>>> The onscreen output is:
>>>>
>>>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
>>>> {--dictionary=/mahout/sparse/dictionary.file-0,
>>>> --dictionaryType=sequencefile,
>>>>
>>>>
>>>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>>> --endPhase=2147483647, --outputFormat=TEXT,
>>>> --pointsDir=/mahout/points,
>>>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
>>>> --tempDir=temp}
>>>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is
>>>> available
>>>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>>> library
>>>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
>>>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
>>>> (Minutes: 2.2546)
>>>>
>>>>
>>>> There is nothing under "/mahout/points". Any help on why and how?
>>>>
>>>> Thanks in advance.
>>>> Baoqiang
>>>>
>

Re: can't get thru "-p"

Posted by Pat Ferrel <pa...@occamsmachete.com>.
I guess you figured this out but the cluster drivers take "-cl", which 
tells them to put points into the calculated clusters and output to the 
clusterPoints directory. Then you pass that in to clusterdump.

instructions here: 
https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering

the --help for the mahout cluster drivers is incomplete, check cwiki for 
differences


On 3/14/12 3:18 PM, Baoqiang Cao wrote:
> Thanks a lot. But I don't know if I miss anything in front of my teary
> eyes because of Wednesday afternoon or ? I have equivalent inputs as
> yours:
>
> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>
> the cluster files after 15 iterations are
> /mahout/kmeans/clusters-15-final. /mahout/points is a directory I
> created in prior. On screen, the output are something like
> "VL-1721020{n=186 c=[...". It just is no any output files under that
> directory.
>
> Any help , please
>
> On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel<pa...@occamsmachete.com>  wrote:
>> The -p parameter is an input. You should pass in the clusterPoints/
>> directory that was generated by the cluster driver you used.
>>
>> My use of fkmeans might be an example:
>>
>>    mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
>>    wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
>>    2 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>>
>> This will create wikipedia-clusters/clusters/clusteredPoints/part-m-00000
>> which is the file with the clustered points. I then did a clusterdump
>>
>>    mahout clusterdump -s
>>    wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
>>    wikipedia-fkmeans-clusters/clusteredPoints/ -d
>>   wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
>>    org.apache.mahout.common.distance.CosineDistanceMeasure
>>
>> This will output to the screen. Use -o to specify an output file.
>>
>> Good advice for any user of mahout is read the output of the help very
>> carefully. IMHO it is very easy to misunderstand the parameters, inputs, and
>> outputs. I think I only understand about 10%. Try:
>>
>>    mahout fkmeans --help
>>
>>
>>
>> On 3/14/12 10:52 AM, Baoqiang Cao wrote:
>>> Hi,
>>>
>>> Very sorry for such a trivial question but ran out of luck. I'm trying
>>> to see which points (thru point-ids) belong to which cluster center.
>>> Here is what I did:
>>>
>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>>>> out
>>> The onscreen output is:
>>>
>>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
>>> {--dictionary=/mahout/sparse/dictionary.file-0,
>>> --dictionaryType=sequencefile,
>>>
>>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>> --endPhase=2147483647, --outputFormat=TEXT,
>>> --pointsDir=/mahout/points,
>>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
>>> --tempDir=temp}
>>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is
>>> available
>>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>> library
>>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
>>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
>>> (Minutes: 2.2546)
>>>
>>>
>>> There is nothing under "/mahout/points". Any help on why and how?
>>>
>>> Thanks in advance.
>>> Baoqiang
>>>

Re: can't get thru "-p"

Posted by Baoqiang Cao <bq...@gmail.com>.
Thanks a lot. But I don't know if I miss anything in front of my teary
eyes because of Wednesday afternoon or ? I have equivalent inputs as
yours:

mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
/mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points

the cluster files after 15 iterations are
/mahout/kmeans/clusters-15-final. /mahout/points is a directory I
created in prior. On screen, the output are something like
"VL-1721020{n=186 c=[...". It just is no any output files under that
directory.

Any help , please

On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel <pa...@occamsmachete.com> wrote:
> The -p parameter is an input. You should pass in the clusterPoints/
> directory that was generated by the cluster driver you used.
>
> My use of fkmeans might be an example:
>
>   mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
>   wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
>   2 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>
> This will create wikipedia-clusters/clusters/clusteredPoints/part-m-00000
> which is the file with the clustered points. I then did a clusterdump
>
>   mahout clusterdump -s
>   wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
>   wikipedia-fkmeans-clusters/clusteredPoints/ -d
>  wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
>   org.apache.mahout.common.distance.CosineDistanceMeasure
>
> This will output to the screen. Use -o to specify an output file.
>
> Good advice for any user of mahout is read the output of the help very
> carefully. IMHO it is very easy to misunderstand the parameters, inputs, and
> outputs. I think I only understand about 10%. Try:
>
>   mahout fkmeans --help
>
>
>
> On 3/14/12 10:52 AM, Baoqiang Cao wrote:
>>
>> Hi,
>>
>> Very sorry for such a trivial question but ran out of luck. I'm trying
>> to see which points (thru point-ids) belong to which cluster center.
>> Here is what I did:
>>
>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>>>
>>> out
>>
>> The onscreen output is:
>>
>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
>> {--dictionary=/mahout/sparse/dictionary.file-0,
>> --dictionaryType=sequencefile,
>>
>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>> --endPhase=2147483647, --outputFormat=TEXT,
>> --pointsDir=/mahout/points,
>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
>> --tempDir=temp}
>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is
>> available
>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop
>> library
>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
>> (Minutes: 2.2546)
>>
>>
>> There is nothing under "/mahout/points". Any help on why and how?
>>
>> Thanks in advance.
>> Baoqiang
>>
>

Re: can't get thru "-p"

Posted by Pat Ferrel <pa...@occamsmachete.com>.
The -p parameter is an input. You should pass in the clusterPoints/ 
directory that was generated by the cluster driver you used.

My use of fkmeans might be an example:

    mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
    wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
    2 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure

This will create 
wikipedia-clusters/clusters/clusteredPoints/part-m-00000 which is the 
file with the clustered points. I then did a clusterdump

    mahout clusterdump -s
    wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
    wikipedia-fkmeans-clusters/clusteredPoints/ -d 
    wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
    org.apache.mahout.common.distance.CosineDistanceMeasure

This will output to the screen. Use -o to specify an output file.

Good advice for any user of mahout is read the output of the help very 
carefully. IMHO it is very easy to misunderstand the parameters, inputs, 
and outputs. I think I only understand about 10%. Try:

    mahout fkmeans --help


On 3/14/12 10:52 AM, Baoqiang Cao wrote:
> Hi,
>
> Very sorry for such a trivial question but ran out of luck. I'm trying
> to see which points (thru point-ids) belong to which cluster center.
> Here is what I did:
>
> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>> out
> The onscreen output is:
>
> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
> {--dictionary=/mahout/sparse/dictionary.file-0,
> --dictionaryType=sequencefile,
> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
> --endPhase=2147483647, --outputFormat=TEXT,
> --pointsDir=/mahout/points,
> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
> --tempDir=temp}
> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is available
> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop library
> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
> (Minutes: 2.2546)
>
>
> There is nothing under "/mahout/points". Any help on why and how?
>
> Thanks in advance.
> Baoqiang
>