You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by is...@apache.org on 2013/11/20 21:30:07 UTC

svn commit: r1543937 - /mahout/site/mahout_cms/trunk/content/users/clustering/clusteringyourdata.mdtext

Author: isabel
Date: Wed Nov 20 20:30:07 2013
New Revision: 1543937

URL: http://svn.apache.org/r1543937
Log:
MAHOUT-1245 - fix formatting

Modified:
    mahout/site/mahout_cms/trunk/content/users/clustering/clusteringyourdata.mdtext

Modified: mahout/site/mahout_cms/trunk/content/users/clustering/clusteringyourdata.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/clustering/clusteringyourdata.mdtext?rev=1543937&r1=1543936&r2=1543937&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/clustering/clusteringyourdata.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/clustering/clusteringyourdata.mdtext Wed Nov 20 20:30:07 2013
@@ -1,72 +1,38 @@
 Title: ClusteringYourData
+
 +*Mahout_0.8*+
 
-After you've done the [Quickstart](quickstart.html)
- and are familiar with the basics of Mahout, it is time to cluster your own
-data. 
+After you've done the [Quickstart](quickstart.html) and are familiar with the basics of Mahout, it is time to cluster your own
+data. See also [Wikipedia on cluster analysis](en.wikipedia.org/wiki/Cluster_analysis) for more background.
 
 The following pieces *may* be useful for in getting started:
 
 <a name="ClusteringYourData-Input"></a>
 # Input
 
-For starters, you will need your data in an appropriate Vector format
-(which has changed since Mahout 0.1)
-
-* See [Creating Vectors](creating-vectors.html)
+For starters, you will need your data in an appropriate Vector format, see [Creating Vectors](../basics/creating-vectors.html).
+In particular for text preparation check out [Creating Vectors from Text](../basics/creating-vectors-from-text.html).
 
-<a name="ClusteringYourData-TextPreparation"></a>
-## Text Preparation
-
-* See [Creating Vectors from Text](creating-vectors-from-text.html)
-*
-http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering
 
 <a name="ClusteringYourData-RunningtheProcess"></a>
 # Running the Process
 
-<a name="ClusteringYourData-Canopy"></a>
-## Canopy
-
-Background: [canopy ](-canopy-clustering.html)
-
-Documentation of running canopy from the command line: [canopy-commandline](canopy-commandline.html)
-
-<a name="ClusteringYourData-kMeans"></a>
-## kMeans
-
-Background: [K-Means Clustering](k-means-clustering.html)
-
-Documentation of running kMeans from the command line: [k-means-commandline](k-means-commandline.html)
-
-Documentation of running fuzzy kMeans from the command line: [fuzzy-k-means-commandline](fuzzy-k-means-commandline.html)
-
-<a name="ClusteringYourData-Dirichlet"></a>
-## Dirichlet
-
-Background: [dirichlet ](-dirichlet-process-clustering.html)
-
-Documentation of running dirichlet from the command line: [dirichlet-commandline](dirichlet-commandline.html)
-
-<a name="ClusteringYourData-Mean-shift"></a>
-## Mean-shift
+* [Canopy background](canopy-clustering.html) and [canopy-commandline](canopy-commandline.html).
 
-Background:  [meanshift ](-mean-shift-clustering.html)
+* [K-Means background](k-means-clustering.html), [k-means-commandline](k-means-commandline.html), and
+[fuzzy-k-means-commandline](fuzzy-k-means-commandline.html).
 
-Documentation of running mean shift from the command line: [mean-shift-commandline](mean-shift-commandline.html)
+* [Dirichlet background](dirichlet-process-clustering.html) and [dirichlet-commandline](dirichlet-commandline.html).
 
-<a name="ClusteringYourData-LatentDirichletAllocation"></a>
-## Latent Dirichlet Allocation
+* [Meanshift background](mean-shift-clustering.html) and [mean-shift-commandline](mean-shift-commandline.html).
 
-Background and documentation: [LDA](-latent-dirichlet-allocation.html)
+* [LDA (Latent Dirichlet Allocation) background](-latent-dirichlet-allocation.html) and [lda-commandline](lda-commandline.html).
 
-Documentation of running LDA from the command line: [lda-commandline](lda-commandline.html)
 
 <a name="ClusteringYourData-RetrievingtheOutput"></a>
 # Retrieving the Output
 
-Mahout has a cluster dumper utility that can be used to retrieve and
-evaluate your clustering data.
+Mahout has a cluster dumper utility that can be used to retrieve and evaluate your clustering data.
 
     ./bin/mahout clusterdump <OPTIONS>
 
@@ -74,39 +40,43 @@ evaluate your clustering data.
 <a name="ClusteringYourData-Theclusterdumperoptionsare:"></a>
 ## The cluster dumper options are:
 
-      --help (-h)				   Print out help		    
-      --input (-i) input			   The directory containing
-Sequence    
+      --help (-h)				   Print out help	
+	    
+      --input (-i) input			   The directory containing Sequence    
     					   Files for the Clusters	    
-      --output (-o) output			   The output file.  If not
-specified,  
+
+      --output (-o) output			   The output file.  If not specified,  
     					   dumps to the console.
-      --outputFormat (-of) outputFormat	   The optional output format to
-write
-    					   the results as. Options: TEXT,
-CSV, or GRAPH_ML		 
+
+      --outputFormat (-of) outputFormat	   The optional output format to write
+    					   the results as. Options: TEXT, CSV, or GRAPH_ML		 
+
       --substring (-b) substring		   The number of chars of the	    
-    					   asFormatString() to print	    
+    					   asFormatString() to print	
+    
       --pointsDir (-p) pointsDir		   The directory containing points  
-    					   sequence files mapping input
-vectors 
-    					   to their cluster.  If specified, 
+ 					   sequence files mapping input vectors     					   to their cluster.  If specified, 
     					   then the program will output the 
     					   points associated with a cluster 
+
       --dictionary (-d) dictionary		   The dictionary file. 	    
+
       --dictionaryType (-dt) dictionaryType    The dictionary file type	    
     					   (text|sequencefile)
-      --distanceMeasure (-dm) distanceMeasure  The classname of the
-DistanceMeasure.
+
+      --distanceMeasure (-dm) distanceMeasure  The classname of the DistanceMeasure.
     					   Default is SquaredEuclidean.     
+
       --numWords (-n) numWords		   The number of top terms to print 
+
       --tempDir tempDir			   Intermediate output directory
+
       --startPhase startPhase		   First phase to run
+
       --endPhase endPhase			   Last phase to run
-      --evaluate (-e)			   Run ClusterEvaluator and
-CDbwEvaluator over the
-    					   input. The output will be
-appended to the rest of
+
+      --evaluate (-e)			   Run ClusterEvaluator and CDbwEvaluator over the
+    					   input. The output will be appended to the rest of
     					   the output at the end.   
 
 
@@ -115,10 +85,8 @@ More information on using clusterdump ut
 <a name="ClusteringYourData-ValidatingtheOutput"></a>
 # Validating the Output
 
-From Ted Dunning's response on See
-http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
 {quote}
-A principled approach to cluster evaluation is to measure how well the
+Ted Dunning: A principled approach to cluster evaluation is to measure how well the
 cluster membership captures the structure of unseen data.  A natural
 measure for this is to measure how much of the entropy of the data is
 captured by cluster membership.  For k-means and its natural L_2 metric,
@@ -149,8 +117,3 @@ is working using this kind of inspection
 good at seeing (making up) patterns.
 {quote}
 
-
-<a name="ClusteringYourData-References"></a>
-# References
-
-* [Mahout archive references](http://www.lucidimagination.com/search/p:mahout?q=clustering)