You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Pat Ferrel <pa...@occamsmachete.com> on 2012/03/21 21:13:39 UTC

What to do with empty docs

You may want to tune your analyzer to let more tokens through. The empty 
docs may be used in the TFIDF part of the analysis to calculate IDF, not 
sure?

As to clusters with empty docs, I don't see how you can avoid that 
unless you drop them before running clustering. I can't think of any 
useful information they add to clustering.

On 3/21/12 12:06 PM, Baoqiang Cao wrote:
> This is extremely helpful! Thanks a lot.
> Indeed, after seqdumper I got such:
>
> Key: 1774184: Value: wt: 1.0distance: 0.8839410915753125  vec: [123426:1.000]
> Key: 1705919: Value: wt: 1.0distance: 0.0  vec: []
> Key: 1705919: Value: wt: 1.0distance: 0.0  vec: []
> Key: 1705919: Value: wt: 1.0distance: 0.0  vec: []
>
> And after redid the seq2sparse part with "-nv", I got the postid as
> well. So, it is clear that due to stop words and rare occurrence,
> some documents end up with no words left--empty documents. That is why
> I got "vec: []". Is there any suggestion dealing with them? Leave as
> is so always be aware of one trivial cluster (empty vectors)? Or...
>
> Thanks again for wonderful help!
>
>
> On Tue, Mar 20, 2012 at 10:35 AM, Pat Ferrel<pa...@occamsmachete.com>  wrote:
>> What I have done is to use "seq2sparse -nv" to get named vectors.
>>
>>    *mahout seq2sparse \
>>        -i reuters-seqfiles/ \
>>        -o reuters-vectors/ \
>>        -ow -chunk 100 \
>>        -x 90 \
>>        -seq \
>>        -a com.finderbots.analyzers.LuceneStemmingAnalyzer \
>>        -ml 50 \
>>        -n 2 \
>>        -nv*
>>
>> This will use the filename as the key in the vector sequence file. The keys
>> will remain the same through the clustering phase.
>>
>> Run "kmeans -cl" to get the clusteredPoint dir created and in it you will
>> find a part-m-00000
>>
>>    *mahout kmeans \
>>        -i reuters-vectors/tfidf-vectors/ \
>>        -c reuters-kmeans-centroids \
>>        -cl \
>>        -o reuters-kmeans-clusters \
>>        -k 20 \
>>        -ow \
>>        -x 10 \
>>        -dm org.apache.mahout.common.distance.CosineDistanceMeasure *
>>
>> Use seqdumper:
>>
>>    mahout seqdumper -s
>>    reuters-kmeans-clusters/clusteredPoints/part-m-00000 | more
>>
>> You will see that the file contains
>>
>>    key: clusterid, value: wt = % likelihood the vector is in cluster,
>>    distance from centroid, named vector belonging to the cluster,
>>    vector data.
>>
>> For kmeans the likelihood will be 1.0 or 0. For example:
>>
>>    Key: 21477: Value: wt: 1.0distance: 0.9420744909793364  vec:
>>    /-tmp/reut2-000.sgm-158.txt = [372:0.318, 966:0.396, 3027:0.230,
>>    8816:0.452, 8868:0.308, 13639:0.278, 13648:0.264, 14334:0.270,
>>    14371:0.413]
>>
>> Clusters, of course, cannot have names. A simple solution is to construct a
>> name from the top terms in the centroid output from clusterdump.
>>
>> I also recommend /Mahout in Action/ from Manning Publishing. You can buy it
>> here:
>> http://manning.com/owen/
>>
>>
>> On 3/19/12 7:45 PM, Baoqiang Cao wrote:
>>> Thanks again for the reference!
>>>
>>> One more question if you don't mind. How do I get the text keys for
>>> each cluster? I mean, so at beginning we have input file for
>>> seq2sparse, and we knew the format is
>>>
>>> textkey1 text1
>>> textkey2 text2
>>> ..
>>>
>>> after mahout kmeans, and clusterdump, how do I know, for example,
>>> which texts belong to cluster 1 from command line? I thought the
>>> outputs from clusterdump have what I want, but so far, it is still
>>> elusive.
>>>
>>> Best,
>>> Baoqiang
>>>
>>>
>>>
>>> On Mon, Mar 19, 2012 at 4:01 PM, Pat Ferrel<pa...@occamsmachete.com>    wrote:
>>>> I guess you figured this out but the cluster drivers take "-cl", which
>>>> tells
>>>> them to put points into the calculated clusters and output to the
>>>> clusterPoints directory. Then you pass that in to clusterdump.
>>>>
>>>> instructions here:
>>>> https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering
>>>>
>>>> the --help for the mahout cluster drivers is incomplete, check cwiki for
>>>> differences
>>>>
>>>>
>>>>
>>>> On 3/14/12 3:18 PM, Baoqiang Cao wrote:
>>>>> Thanks a lot. But I don't know if I miss anything in front of my teary
>>>>> eyes because of Wednesday afternoon or ? I have equivalent inputs as
>>>>> yours:
>>>>>
>>>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>>>> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>>>>>
>>>>> the cluster files after 15 iterations are
>>>>> /mahout/kmeans/clusters-15-final. /mahout/points is a directory I
>>>>> created in prior. On screen, the output are something like
>>>>> "VL-1721020{n=186 c=[...". It just is no any output files under that
>>>>> directory.
>>>>>
>>>>> Any help , please
>>>>>
>>>>> On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel<pa...@occamsmachete.com>
>>>>>   wrote:
>>>>>> The -p parameter is an input. You should pass in the clusterPoints/
>>>>>> directory that was generated by the cluster driver you used.
>>>>>>
>>>>>> My use of fkmeans might be an example:
>>>>>>
>>>>>>    mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
>>>>>>    wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
>>>>>>    2 -ow -x 10 -dm
>>>>>> org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>>
>>>>>> This will create
>>>>>> wikipedia-clusters/clusters/clusteredPoints/part-m-00000
>>>>>> which is the file with the clustered points. I then did a clusterdump
>>>>>>
>>>>>>    mahout clusterdump -s
>>>>>>    wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
>>>>>>    wikipedia-fkmeans-clusters/clusteredPoints/ -d
>>>>>>   wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
>>>>>>    org.apache.mahout.common.distance.CosineDistanceMeasure
>>>>>>
>>>>>> This will output to the screen. Use -o to specify an output file.
>>>>>>
>>>>>> Good advice for any user of mahout is read the output of the help very
>>>>>> carefully. IMHO it is very easy to misunderstand the parameters,
>>>>>> inputs,
>>>>>> and
>>>>>> outputs. I think I only understand about 10%. Try:
>>>>>>
>>>>>>    mahout fkmeans --help
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 3/14/12 10:52 AM, Baoqiang Cao wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Very sorry for such a trivial question but ran out of luck. I'm trying
>>>>>>> to see which points (thru point-ids) belong to which cluster center.
>>>>>>> Here is what I did:
>>>>>>>
>>>>>>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>>>>>>> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>>>>>>>> out
>>>>>>> The onscreen output is:
>>>>>>>
>>>>>>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
>>>>>>> {--dictionary=/mahout/sparse/dictionary.file-0,
>>>>>>> --dictionaryType=sequencefile,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>>>>>> --endPhase=2147483647, --outputFormat=TEXT,
>>>>>>> --pointsDir=/mahout/points,
>>>>>>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
>>>>>>> --tempDir=temp}
>>>>>>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is
>>>>>>> available
>>>>>>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop
>>>>>>> library
>>>>>>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
>>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>>>>>>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
>>>>>>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
>>>>>>> (Minutes: 2.2546)
>>>>>>>
>>>>>>>
>>>>>>> There is nothing under "/mahout/points". Any help on why and how?
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>> Baoqiang
>>>>>>>