You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Pavan Kumar N (JIRA)" <ji...@apache.org> on 2014/04/03 12:17:16 UTC

[jira] [Comment Edited] (MAHOUT-1450) Cleaning up clustering documentation on mahout website

    [ https://issues.apache.org/jira/browse/MAHOUT-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13958685#comment-13958685 ] 

Pavan Kumar N edited comment on MAHOUT-1450 at 4/3/14 10:16 AM:
----------------------------------------------------------------

[~ssc] Thanks for your comment. As [~smarthi] has mentioned in earlier comments, using canopy for initial clusters followed by kmeans is old school and inefficient approach, kindly reconsider whether you need the same documentation in the implementation part. It says how you can use canopy's initial clusters for running kmeans followed by a flowchart representation of the same. I suggest adding a flow chart on streamingkmeans. 

This is for your consideration. Other than that, everything else seems fine to me. Reuters kmeans (cluster-reuters.sh) worked for me. I suggest to add the info, do an "mvn install" (-Dskip tests if necessary) before a first time user runs these examples.


was (Author: pknarayan):
[~ssc] Thanks for your comment. As [~smarthi] has mentioned in earlier comments, using canopy for initial clusters followed by kmeans is old school and inefficient approach, kindly reconsider whether you need the same documentation in the implementation part. It says how you can use canopy's initial clusters for running kmeans followed by a flowchart representation of the same. I suggest adding a flow chart on streamingkmeans. 

This is for your consideration. Other than that, everything else seems fine to me.

> Cleaning up clustering documentation on mahout website 
> -------------------------------------------------------
>
>                 Key: MAHOUT-1450
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1450
>             Project: Mahout
>          Issue Type: Documentation
>          Components: Documentation
>         Environment: This affects all mahout versions
>            Reporter: Pavan Kumar N
>              Labels: documentation, newbie
>             Fix For: 1.0
>
>
> In canopy clustering, the strategy for parallelization seems to have some dead links. Need to clean them and replace with new links (if there are any). Here is the link:
> http://mahout.apache.org/users/clustering/canopy-clustering.html
> Here are some details of the dead links for kmeans clustering page:
> On the k-Means clustering - basics page, 
> first line of the Quickstart part of the documentation, the hyperlink "Here"
> http://mahout.apache.org/users/clustering/k-means-clustering%5Equickstart-kmeans.sh.html
> first sentence of Strategy for parallelization part of documentation, the hyperlink "Cluster computing and MapReduce", second second sentence the hyperlink "here" and last sentence the hyperlink "http://www2.chass.ncsu.edu/garson/PA765/cluster.htm" are dead.
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt
> http://www2.chass.ncsu.edu/garson/PA765/cluster.htm
> Under the page: http://mahout.apache.org/users/clustering/visualizing-sample-clusters.html
> in the second sentence of Pre-prep part of this page, the hyperlink "setup mahout" is dead.
> http://mahout.apache.org/users/clustering/users/basics/quickstart.html
> The existing documentation is too ambiguous and I recommend to make the following changes so the new users can use it as tutorial.
> The Quickstart should be replaced with the following:
> Get the data from:
> wget http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
> Place it within the example folder from mahout home director:
> mahout-0.7/examples/reuters
> mkdir reuters
> cd reuters
> mkdir reuters-out
> mv reuters21578.tar.gz reuters-out
> cd reuters-out
> tar -xzvf reuters21578.tar.gz
> cd ..
> Mahout specific Commands
> #1 run the org.apache.lucene.benchmark .utils.ExtractReuters class
> ${MAHOUT_HOME}/bin/mahout
> org.apache.lucene.benchmark.utils.ExtractReuters reuters-out
> reuters-text
> #2 copy the file to your HDFS
> bin/hadoop fs -copyFromLocal
> /home/bigdata/mahout-distribution-0.7/examples/reuters-text
> hdfs://localhost:54310/user/bigdata/
> #3 generate sequence-file
> mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text
> -o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5
> -chunk → specifying the number of data blocks
> UTF-8 → specifying the appropriate input format
> #4 Check the generated sequence-file
> mahout-0.7$ ./bin/mahout seqdumper -i
> /your-hdfs-path-to/reuters-seqfiles/chunk-0 | less
> #5 From sequence-file generate vector file
> mahout seq2sparse -i
> hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o
> hdfs://localhost:54310/user/bigdata/reuters-vectors -ow
> -ow → overwrite
> #6 take a look at it should have 7 items by using this command
> bin/hadoop fs -ls
> reuters-vectors/df-count
> reuters-vectors/dictionary.file-0
> reuters-vectors/frequency.file-0
> reuters-vectors/tf-vectors
> reuters-vectors/tfidf-vectors
> reuters-vectors/tokenized-documents
> reuters-vectors/wordcount
> bin/hadoop fs -ls reuters-vectors
> #7 check the vector: reuters-vectors/tf-vectors/part-r-00000
> mahout-0.7$ hadoop fs -ls reuters-vectors/tf-vectors
> #8 Run canopy clustering to get optimal initial centroids for k-means
> mahout canopy -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tf-vectors -o
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000
> -dm → specifying the distance measure to be used while clustering (here it is cosine distance measure)
> #9 Run k-means clustering algorithm
> mahout kmeans -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tfidf-vectors -c
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -o
> hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters -cd 0.1 -ow
> -x 20 -k 10
> -i → input
> -o → output
> -c → initial centroids for k-means (not defining this parameter will
> trigger k-means to generate random initial centroids)
> -cd → convergence delta parameter
> -ow → overwrite
> -x → specifying number of k-means iterations
> -k → specifying number of clusters
> #10 Export k-means output using Cluster Dumper tool
> mahout clusterdump -dt sequencefile -d hdfs://localhost:54310/user/bigdata/reuters-vectors/dictionary.file-*
> -i hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters/clusters-8-
> final -o clusters.txt -b 15
> -dt → dictionary type
> -b → specifying length of each word
> Mahout 0.7 version did have some problems using the DisplayKmeans module which should ideally display the clusters in a 2d graph. But it gave me the same output for different input datasets. I was using dataset of recent news items that was crawled from various websites.



--
This message was sent by Atlassian JIRA
(v6.2#6252)