You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2010/03/30 01:07:00 UTC
[CONF] Apache Lucene Mahout > SyntheticControlData

Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: SyntheticControlData (http://cwiki.apache.org/confluence/display/MAHOUT/SyntheticControlData)


Edited by Jeff Eastman:
---------------------------------------------------------------------
h1. Introduction

This quick start page shows how to run the clustering Synthetic Control Data example. The data is described [here | http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html].


h1. Steps

* Download the data at http://archive.ics.uci.edu/ml/datasets/Synthetic+Control+Chart+Time+Series. 
* In $MAHOUT_HOME/, build the Job file
** The same job is used for all examples so this only needs to be done once
** mvn install
** The job will be generated in $MAHOUT_HOME/examples/target/ and it's name will contain the $MAHOUT_VERSION number. For example, when using Mahout 0.3 release, the job will be mahout-examples-0.3.job
* (Optional){footnote}This step should be skipped when using standalone Hadoop{footnote} Start up Hadoop: $HADOOP_HOME/bin/start-all.sh
* Put the data: $HADOOP_HOME/bin/hadoop fs -put <PATH TO DATA> testdata
* Run the Job: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job  org.apache.mahout.clustering.syntheticcontrol.kmeans.Job {footnote}Substitute in whichever Clustring Job you want here: KMeans, Canopy, etc. See subdirectories of $MAHOUT_HOME/examples/src/main/java/org/apache/mahout/clustering/syntheticcontrol/.{footnote}
** For [kmeans | k-Means]:  $HADOOP_HOME/bin/hadoop jar  $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
** For [canopy | Canopy Clustering]:  $HADOOP_HOME/bin/hadoop jar  $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job  org.apache.mahout.clustering.syntheticcontrol.canopy.Job
** For [dirichlet | Dirichlet Process Clustering]: $HADOOP_HOME/bin/hadoop jar  $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.dirichlet.Job
** For [meanshift | Mean Shift]: $HADOOP_HOME/bin/hadoop jar  $MAHOUT_HOME/examples/target/mahout-examples-$MAHOUT_VERSION.job org.apache.mahout.clustering.syntheticcontrol.meanshift.Job
* Get the data out of HDFS{footnote}See [HDFS Shell | http://hadoop.apache.org/core/docs/current/hdfs_shell.html]{footnote}{footnote}The output directory is cleared when a new run starts so the results must be retrieved before a new run{footnote} and have a look{footnote}Dirichlet also prints data to console{footnote}
** All example jobs use _testdata_ as input and output to directory _output_
** Use _bin/hadoop fs -lsr output_ to view all outputs
** Output:
*** KMeans is placed into _output/points_ 
*** Canopy and MeanShift results are placed into _output/clustered-points_ 

{display-footnotes}


Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action