You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by is...@apache.org on 2013/11/20 21:30:07 UTC
svn commit: r1543937 -
/mahout/site/mahout_cms/trunk/content/users/clustering/clusteringyourdata.mdtext
Author: isabel
Date: Wed Nov 20 20:30:07 2013
New Revision: 1543937
URL: http://svn.apache.org/r1543937
Log:
MAHOUT-1245 - fix formatting
Modified:
mahout/site/mahout_cms/trunk/content/users/clustering/clusteringyourdata.mdtext
Modified: mahout/site/mahout_cms/trunk/content/users/clustering/clusteringyourdata.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/clustering/clusteringyourdata.mdtext?rev=1543937&r1=1543936&r2=1543937&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/clustering/clusteringyourdata.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/clustering/clusteringyourdata.mdtext Wed Nov 20 20:30:07 2013
@@ -1,72 +1,38 @@
Title: ClusteringYourData
+
+*Mahout_0.8*+
-After you've done the [Quickstart](quickstart.html)
- and are familiar with the basics of Mahout, it is time to cluster your own
-data.
+After you've done the [Quickstart](quickstart.html) and are familiar with the basics of Mahout, it is time to cluster your own
+data. See also [Wikipedia on cluster analysis](en.wikipedia.org/wiki/Cluster_analysis) for more background.
The following pieces *may* be useful for in getting started:
<a name="ClusteringYourData-Input"></a>
# Input
-For starters, you will need your data in an appropriate Vector format
-(which has changed since Mahout 0.1)
-
-* See [Creating Vectors](creating-vectors.html)
+For starters, you will need your data in an appropriate Vector format, see [Creating Vectors](../basics/creating-vectors.html).
+In particular for text preparation check out [Creating Vectors from Text](../basics/creating-vectors-from-text.html).
-<a name="ClusteringYourData-TextPreparation"></a>
-## Text Preparation
-
-* See [Creating Vectors from Text](creating-vectors-from-text.html)
-*
-http://www.lucidimagination.com/search/document/4a0e528982b2dac3/document_clustering
<a name="ClusteringYourData-RunningtheProcess"></a>
# Running the Process
-<a name="ClusteringYourData-Canopy"></a>
-## Canopy
-
-Background: [canopy ](-canopy-clustering.html)
-
-Documentation of running canopy from the command line: [canopy-commandline](canopy-commandline.html)
-
-<a name="ClusteringYourData-kMeans"></a>
-## kMeans
-
-Background: [K-Means Clustering](k-means-clustering.html)
-
-Documentation of running kMeans from the command line: [k-means-commandline](k-means-commandline.html)
-
-Documentation of running fuzzy kMeans from the command line: [fuzzy-k-means-commandline](fuzzy-k-means-commandline.html)
-
-<a name="ClusteringYourData-Dirichlet"></a>
-## Dirichlet
-
-Background: [dirichlet ](-dirichlet-process-clustering.html)
-
-Documentation of running dirichlet from the command line: [dirichlet-commandline](dirichlet-commandline.html)
-
-<a name="ClusteringYourData-Mean-shift"></a>
-## Mean-shift
+* [Canopy background](canopy-clustering.html) and [canopy-commandline](canopy-commandline.html).
-Background: [meanshift ](-mean-shift-clustering.html)
+* [K-Means background](k-means-clustering.html), [k-means-commandline](k-means-commandline.html), and
+[fuzzy-k-means-commandline](fuzzy-k-means-commandline.html).
-Documentation of running mean shift from the command line: [mean-shift-commandline](mean-shift-commandline.html)
+* [Dirichlet background](dirichlet-process-clustering.html) and [dirichlet-commandline](dirichlet-commandline.html).
-<a name="ClusteringYourData-LatentDirichletAllocation"></a>
-## Latent Dirichlet Allocation
+* [Meanshift background](mean-shift-clustering.html) and [mean-shift-commandline](mean-shift-commandline.html).
-Background and documentation: [LDA](-latent-dirichlet-allocation.html)
+* [LDA (Latent Dirichlet Allocation) background](-latent-dirichlet-allocation.html) and [lda-commandline](lda-commandline.html).
-Documentation of running LDA from the command line: [lda-commandline](lda-commandline.html)
<a name="ClusteringYourData-RetrievingtheOutput"></a>
# Retrieving the Output
-Mahout has a cluster dumper utility that can be used to retrieve and
-evaluate your clustering data.
+Mahout has a cluster dumper utility that can be used to retrieve and evaluate your clustering data.
./bin/mahout clusterdump <OPTIONS>
@@ -74,39 +40,43 @@ evaluate your clustering data.
<a name="ClusteringYourData-Theclusterdumperoptionsare:"></a>
## The cluster dumper options are:
- --help (-h) Print out help
- --input (-i) input The directory containing
-Sequence
+ --help (-h) Print out help
+
+ --input (-i) input The directory containing Sequence
Files for the Clusters
- --output (-o) output The output file. If not
-specified,
+
+ --output (-o) output The output file. If not specified,
dumps to the console.
- --outputFormat (-of) outputFormat The optional output format to
-write
- the results as. Options: TEXT,
-CSV, or GRAPH_ML
+
+ --outputFormat (-of) outputFormat The optional output format to write
+ the results as. Options: TEXT, CSV, or GRAPH_ML
+
--substring (-b) substring The number of chars of the
- asFormatString() to print
+ asFormatString() to print
+
--pointsDir (-p) pointsDir The directory containing points
- sequence files mapping input
-vectors
- to their cluster. If specified,
+ sequence files mapping input vectors to their cluster. If specified,
then the program will output the
points associated with a cluster
+
--dictionary (-d) dictionary The dictionary file.
+
--dictionaryType (-dt) dictionaryType The dictionary file type
(text|sequencefile)
- --distanceMeasure (-dm) distanceMeasure The classname of the
-DistanceMeasure.
+
+ --distanceMeasure (-dm) distanceMeasure The classname of the DistanceMeasure.
Default is SquaredEuclidean.
+
--numWords (-n) numWords The number of top terms to print
+
--tempDir tempDir Intermediate output directory
+
--startPhase startPhase First phase to run
+
--endPhase endPhase Last phase to run
- --evaluate (-e) Run ClusterEvaluator and
-CDbwEvaluator over the
- input. The output will be
-appended to the rest of
+
+ --evaluate (-e) Run ClusterEvaluator and CDbwEvaluator over the
+ input. The output will be appended to the rest of
the output at the end.
@@ -115,10 +85,8 @@ More information on using clusterdump ut
<a name="ClusteringYourData-ValidatingtheOutput"></a>
# Validating the Output
-From Ted Dunning's response on See
-http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output
{quote}
-A principled approach to cluster evaluation is to measure how well the
+Ted Dunning: A principled approach to cluster evaluation is to measure how well the
cluster membership captures the structure of unseen data. A natural
measure for this is to measure how much of the entropy of the data is
captured by cluster membership. For k-means and its natural L_2 metric,
@@ -149,8 +117,3 @@ is working using this kind of inspection
good at seeing (making up) patterns.
{quote}
-
-<a name="ClusteringYourData-References"></a>
-# References
-
-* [Mahout archive references](http://www.lucidimagination.com/search/p:mahout?q=clustering)