You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Pavan Kumar N (JIRA)" <ji...@apache.org> on 2014/03/21 11:28:45 UTC

[jira] [Comment Edited] (MAHOUT-1450) Cleaning up clustering documentation on mahout website

    [ https://issues.apache.org/jira/browse/MAHOUT-1450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942949#comment-13942949 ] 

Pavan Kumar N edited comment on MAHOUT-1450 at 3/21/14 10:27 AM:
-----------------------------------------------------------------

Thanks Andrew. I just modified the issue to cover all clustering algorithms in the same issue in jira 1450 itself. In canopy clustering, the strategy for parallelization seems to have some dead links. Need to clean them and replace with new links (if there are any). Here is the link:

http://mahout.apache.org/users/clustering/canopy-clustering.html

also added a new issue for streamingkmeans documentation, request people to have a look


was (Author: pknarayan):
Thanks Andrew. I just modified the issue to cover all clustering algorithms in the same issue in jira 1450 itself. In canopy clustering, the strategy for parallelization seems to have some dead links. Need to clean them and replace with new links (if there are any). Here is the link:

http://mahout.apache.org/users/clustering/canopy-clustering.html

> Cleaning up clustering documentation on mahout website 
> -------------------------------------------------------
>
>                 Key: MAHOUT-1450
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1450
>             Project: Mahout
>          Issue Type: Documentation
>          Components: Documentation
>         Environment: This affects all mahout versions
>            Reporter: Pavan Kumar N
>              Labels: documentation, newbie
>             Fix For: 1.0
>
>
> In canopy clustering, the strategy for parallelization seems to have some dead links. Need to clean them and replace with new links (if there are any). Here is the link:
> http://mahout.apache.org/users/clustering/canopy-clustering.html
> Here are some details of the dead links for kmeans clustering page:
> On the k-Means clustering - basics page, 
> first line of the Quickstart part of the documentation, the hyperlink "Here"
> http://mahout.apache.org/users/clustering/k-means-clustering%5Equickstart-kmeans.sh.html
> first sentence of Strategy for parallelization part of documentation, the hyperlink "Cluster computing and MapReduce", second second sentence the hyperlink "here" and last sentence the hyperlink "http://www2.chass.ncsu.edu/garson/PA765/cluster.htm" are dead.
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/listing.html
> http://code.google.com/edu/content/submissions/mapreduce-minilecture/lec4-clustering.ppt
> http://www2.chass.ncsu.edu/garson/PA765/cluster.htm
> Under the page: http://mahout.apache.org/users/clustering/visualizing-sample-clusters.html
> in the second sentence of Pre-prep part of this page, the hyperlink "setup mahout" is dead.
> http://mahout.apache.org/users/clustering/users/basics/quickstart.html
> The existing documentation is too ambiguous and I recommend to make the following changes so the new users can use it as tutorial.
> The Quickstart should be replaced with the following:
> Get the data from:
> wget http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
> Place it within the example folder from mahout home director:
> mahout-0.7/examples/reuters
> mkdir reuters
> cd reuters
> mkdir reuters-out
> mv reuters21578.tar.gz reuters-out
> cd reuters-out
> tar -xzvf reuters21578.tar.gz
> cd ..
> Mahout specific Commands
> #1 run the org.apache.lucene.benchmark .utils.ExtractReuters class
> ${MAHOUT_HOME}/bin/mahout
> org.apache.lucene.benchmark.utils.ExtractReuters reuters-out
> reuters-text
> #2 copy the file to your HDFS
> bin/hadoop fs -copyFromLocal
> /home/bigdata/mahout-distribution-0.7/examples/reuters-text
> hdfs://localhost:54310/user/bigdata/
> #3 generate sequence-file
> mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text
> -o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5
> -chunk → specifying the number of data blocks
> UTF-8 → specifying the appropriate input format
> #4 Check the generated sequence-file
> mahout-0.7$ ./bin/mahout seqdumper -i
> /your-hdfs-path-to/reuters-seqfiles/chunk-0 | less
> #5 From sequence-file generate vector file
> mahout seq2sparse -i
> hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o
> hdfs://localhost:54310/user/bigdata/reuters-vectors -ow
> -ow → overwrite
> #6 take a look at it should have 7 items by using this command
> bin/hadoop fs -ls
> reuters-vectors/df-count
> reuters-vectors/dictionary.file-0
> reuters-vectors/frequency.file-0
> reuters-vectors/tf-vectors
> reuters-vectors/tfidf-vectors
> reuters-vectors/tokenized-documents
> reuters-vectors/wordcount
> bin/hadoop fs -ls reuters-vectors
> #7 check the vector: reuters-vectors/tf-vectors/part-r-00000
> mahout-0.7$ hadoop fs -ls reuters-vectors/tf-vectors
> #8 Run canopy clustering to get optimal initial centroids for k-means
> mahout canopy -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tf-vectors -o
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -dm
> org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000
> -dm → specifying the distance measure to be used while clustering (here it is cosine distance measure)
> #9 Run k-means clustering algorithm
> mahout kmeans -i
> hdfs://localhost:54310/user/bigdata/reuters-vectors/tfidf-vectors -c
> hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -o
> hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters -cd 0.1 -ow
> -x 20 -k 10
> -i → input
> -o → output
> -c → initial centroids for k-means (not defining this parameter will
> trigger k-means to generate random initial centroids)
> -cd → convergence delta parameter
> -ow → overwrite
> -x → specifying number of k-means iterations
> -k → specifying number of clusters
> #10 Export k-means output using Cluster Dumper tool
> mahout clusterdump -dt sequencefile -d hdfs://localhost:54310/user/bigdata/reuters-vectors/dictionary.file-*
> -i hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters/clusters-8-
> final -o clusters.txt -b 15
> -dt → dictionary type
> -b → specifying length of each word
> Mahout 0.7 version did have some problems using the DisplayKmeans module which should ideally display the clusters in a 2d graph. But it gave me the same output for different input datasets. I was using dataset of recent news items that was crawled from various websites.



--
This message was sent by Atlassian JIRA
(v6.2#6252)