You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by ap...@apache.org on 2015/04/03 03:36:01 UTC
svn commit: r1670995 -
/mahout/site/mahout_cms/trunk/content/users/clustering/cluster-dumper.mdtext
Author: apalumbo
Date: Fri Apr 3 01:36:01 2015
New Revision: 1670995
URL: http://svn.apache.org/r1670995
Log:
removed link to missing eclips info. added CLI usage
Modified:
mahout/site/mahout_cms/trunk/content/users/clustering/cluster-dumper.mdtext
Modified: mahout/site/mahout_cms/trunk/content/users/clustering/cluster-dumper.mdtext
URL: http://svn.apache.org/viewvc/mahout/site/mahout_cms/trunk/content/users/clustering/cluster-dumper.mdtext?rev=1670995&r1=1670994&r2=1670995&view=diff
==============================================================================
--- mahout/site/mahout_cms/trunk/content/users/clustering/cluster-dumper.mdtext (original)
+++ mahout/site/mahout_cms/trunk/content/users/clustering/cluster-dumper.mdtext Fri Apr 3 01:36:01 2015
@@ -1,7 +1,7 @@
Title: Cluster Dumper
<a name="ClusterDumper-Introduction"></a>
-# Cluster Dumper - Introduction
+## Cluster Dumper - Introduction
Clustering tasks in Mahout will output data in the format of a SequenceFile
(Text, Cluster) and the Text is a cluster identifier string. To analyze
@@ -9,39 +9,58 @@ this output we need to convert the seque
format and this is achieved using the clusterdump utility.
<a name="ClusterDumper-Stepsforanalyzingclusteroutputusingclusterdumputility"></a>
-# Steps for analyzing cluster output using clusterdump utility
+## Steps for analyzing cluster output using clusterdump utility
After you've executed a clustering tasks (either examples or real-world),
-you can run clusterdumper in 2 modes.
+you can run clusterdumper in 2 modes:
+
+
+1. Hadoop Environment
+1. Standalone Java Program
-1. [Hadoop Environment](#hadoopenvironment.html)
-1. [Standalone Java Program ](#standalonejavaprogram.html)
<a name="ClusterDumper-HadoopEnvironment{anchor:HadoopEnvironment}"></a>
### Hadoop Environment
If you have setup your HADOOP_HOME environment variable, you can use the
-command line utility "mahout" to execute the ClusterDumper on Hadoop. In
+command line utility `mahout` to execute the ClusterDumper on Hadoop. In
this case we wont need to get the output clusters to our local machines.
The utility will read the output clusters present in HDFS and output the
human-readable cluster values into our local file system. Say you've just
executed the [synthetic control example ](clustering-of-synthetic-control-data.html)
- and want to analyze the output, you can execute
-
-
-### Standalone Java Program {anchor:StandaloneJavaProgram}
-
-ClusterDumper can be run using CLI. If your HADOOP_HOME environment
-variable is not set, you can execute ClusterDumper using "mahout" command
-line utility.
+ and want to analyze the output, you can execute the `mahout clusterdumper` utility from the command line.
-Get the output data from hadoop into your local machine. For example, in
-the case where you've executed a clustering example use
+#### CLI options:
+ --help Print out help
+ --input (-i) input The directory containing Sequence
+ Files for the Clusters
+ --output (-o) output The output file. If not specified,
+ dumps to the console.
+ --outputFormat (-of) outputFormat The optional output format to write
+ the results as. Options: TEXT, CSV, or GRAPH_ML
+ --substring (-b) substring The number of chars of the
+ asFormatString() to print
+ --pointsDir (-p) pointsDir The directory containing points
+ sequence files mapping input vectors
+ to their cluster. If specified,
+ then the program will output the
+ points associated with a cluster
+ --dictionary (-d) dictionary The dictionary file.
+ --dictionaryType (-dt) dictionaryType The dictionary file type
+ (text|sequencefile)
+ --distanceMeasure (-dm) distanceMeasure The classname of the DistanceMeasure.
+ Default is SquaredEuclidean.
+ --numWords (-n) numWords The number of top terms to print
+ --tempDir tempDir Intermediate output directory
+ --startPhase startPhase First phase to run
+ --endPhase endPhase Last phase to run
+ --evaluate (-e) Run ClusterEvaluator and CDbwEvaluator over the
+ input. The output will be appended to the rest of
+ the output at the end.
-This will create a folder called output inside your $MAHOUT_HOME/examples
-and will have sub-folders for each cluster outputs and ClusteredPoints
+### Standalone Java Program
-Run the clusterdump utility as follows as a standalone Java Program through Eclipse - if you are using eclipse, setup mahout-utils as a project as specified in [Working with Maven in Eclipse](../../developers/buildingmahout.html).
+Run the clusterdump utility as follows as a standalone Java Program through Eclipse. <!-- - if you are using eclipse, setup mahout-utils as a project as specified in [Working with Maven in Eclipse](../../developers/buildingmahout.html). -->
To execute ClusterDumper.java,
* Under mahout-utils, Right-Click on ClusterDumper.java