You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Tharindu Mathew <mc...@gmail.com> on 2012/02/16 18:31:48 UTC

Running cluster dumper from trunk build

Hi,

I'm trying out the synthetic control example and noticed the cluster dumper
command located at [1] does not work.

Appreciate if anyone can correct my command... seems --seqFileDir is
deprecated... I tried a few more combinations of commands and it failed to
work.

[1] - https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper

Files in HDFS at output:

$ bin/hadoop fs -lsr output
drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:32
/user/mackie/output/clusteredPoints
-rw-r--r--   1 mackie supergroup          0 2012-02-16 21:32
/user/mackie/output/clusteredPoints/_SUCCESS
drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:31
/user/mackie/output/clusteredPoints/_logs
drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:31
/user/mackie/output/clusteredPoints/_logs/history
-rw-r--r--   1 mackie supergroup       7105 2012-02-16 21:31
/user/mackie/output/clusteredPoints/_logs/history/job_201202162112_0005_1329408095893_mackie_Canopy+Driver+running+clusterData+over+input%3A+outp
-rw-r--r--   1 mackie supergroup      20634 2012-02-16 21:31
/user/mackie/output/clusteredPoints/_logs/history/job_201202162112_0005_conf.xml
-rw-r--r--   1 mackie supergroup     340891 2012-02-16 21:31
/user/mackie/output/clusteredPoints/part-m-00000
drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:31
/user/mackie/output/clusters-0-final
-rw-r--r--   1 mackie supergroup          0 2012-02-16 21:31
/user/mackie/output/clusters-0-final/_SUCCESS
drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:30
/user/mackie/output/clusters-0-final/_logs
drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:30
/user/mackie/output/clusters-0-final/_logs/history
-rw-r--r--   1 mackie supergroup      10696 2012-02-16 21:30
/user/mackie/output/clusters-0-final/_logs/history/job_201202162112_0004_1329408047297_mackie_Canopy+Driver+running+buildClusters+over+input%3A+ou
-rw-r--r--   1 mackie supergroup      20920 2012-02-16 21:30
/user/mackie/output/clusters-0-final/_logs/history/job_201202162112_0004_conf.xml
-rw-r--r--   1 mackie supergroup       6747 2012-02-16 21:31
/user/mackie/output/clusters-0-final/part-r-00000
drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:30
/user/mackie/output/data
-rw-r--r--   1 mackie supergroup          0 2012-02-16 21:30
/user/mackie/output/data/_SUCCESS
drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:30
/user/mackie/output/data/_logs
drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:30
/user/mackie/output/data/_logs/history
-rw-r--r--   1 mackie supergroup       7063 2012-02-16 21:30
/user/mackie/output/data/_logs/history/job_201202162112_0003_1329408010408_mackie_Input+Driver+running+over+input%3A+testdata
-rw-r--r--   1 mackie supergroup      19845 2012-02-16 21:30
/user/mackie/output/data/_logs/history/job_201202162112_0003_conf.xml
-rw-r--r--   1 mackie supergroup     335470 2012-02-16 21:30
/user/mackie/output/data/part-m-00000

Here's my output:

$ $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-10
--pointsDir output/clusteredPoints --output
$MAHOUT_HOME/examples/output/clusteranalyze.txt
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using
HADOOP_HOME=/Users/mackie/devtools/hadoop-0.20.204.0
No HADOOP_CONF_DIR set, using /Users/mackie/devtools/hadoop-0.20.204.0/conf
MAHOUT-JOB:
/Users/mackie/source-checkouts/mahout-trunk/examples/target/mahout-examples-0.7-SNAPSHOT-job.jar
12/02/16 22:50:31 ERROR common.AbstractJob: Unexpected --seqFileDir while
processing Job-Specific Options:
usage: <command> [Generic Options] [Job-Specific Options]
Generic Options:
 -archives <paths>              comma separated archives to be unarchived
                                on the compute machines.
 -conf <configuration file>     specify an application configuration file
 -D <property=value>            use value for given property
 -files <paths>                 comma separated files to be copied to the
                                map reduce cluster
 -fs <local|namenode:port>      specify a namenode
 -jt <local|jobtracker:port>    specify a job tracker
 -libjars <paths>               comma separated jar files to include in
                                the classpath.
 -tokenCacheFile <tokensFile>   name of the file with the tokens
Unexpected --seqFileDir while processing Job-Specific
Options:
Usage:

 [--input <input> --output <output> --outputFormat <outputFormat>
--substring
<substring> --numWords <numWords> --pointsDir <pointsDir>
--samplePoints
<samplePoints> --dictionary <dictionary> --dictionaryType
<dictionaryType>
--evaluate --distanceMeasure <distanceMeasure> --help --tempDir
<tempDir>
--startPhase <startPhase> --endPhase
<endPhase>]
Job-Specific
Options:
  --input (-i) input                         Path to job input
directory.
  --output (-o) output                       The directory pathname for
output.
  --outputFormat (-of) outputFormat          The optional output format
to
                                             write the results as.
Options:
                                             TEXT, CSV or
GRAPH_ML
  --substring (-b) substring                 The number of chars of
the
                                             asFormatString() to
print
  --numWords (-n) numWords                   The number of top terms to
print
  --pointsDir (-p) pointsDir                 The directory containing
points
                                             sequence files mapping
input
                                             vectors to their cluster.
If
                                             specified, then the program
will
                                             output the points associated
with
                                             a
cluster
  --samplePoints (-sp) samplePoints          Specifies the maximum number
of
                                             points to include _per_
cluster.
                                             The default is to include
all

points
  --dictionary (-d) dictionary               The dictionary
file
  --dictionaryType (-dt) dictionaryType      The dictionary file
type

(text|sequencefile)
  --evaluate (-e)                            Run ClusterEvaluator
and
                                             CDbwEvaluator over the input.
The
                                             output will be appended to
the
                                             rest of the output at the
end.
  --distanceMeasure (-dm) distanceMeasure    The classname of
the
                                             DistanceMeasure. Default
is

SquaredEuclidean
  --help (-h)                                Print out
help
  --tempDir tempDir                          Intermediate output
directory
  --startPhase startPhase                    First phase to
run
  --endPhase endPhase                        Last phase to
run
12/02/16 22:50:31 INFO driver.MahoutDriver: Program took 308 ms (Minutes:
0.0051333333333333335)

-- 
Regards,

Tharindu

blog: http://mackiemathew.com/

Re: Running cluster dumper from trunk build

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Looks like it was just changed to -i (--input), likely for uniformity 
with other CLI operations. The documentation needs to be updated.

On 2/16/12 10:31 AM, Tharindu Mathew wrote:
> Hi,
>
> I'm trying out the synthetic control example and noticed the cluster dumper
> command located at [1] does not work.
>
> Appreciate if anyone can correct my command... seems --seqFileDir is
> deprecated... I tried a few more combinations of commands and it failed to
> work.
>
> [1] - https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper
>
> Files in HDFS at output:
>
> $ bin/hadoop fs -lsr output
> drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:32
> /user/mackie/output/clusteredPoints
> -rw-r--r--   1 mackie supergroup          0 2012-02-16 21:32
> /user/mackie/output/clusteredPoints/_SUCCESS
> drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:31
> /user/mackie/output/clusteredPoints/_logs
> drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:31
> /user/mackie/output/clusteredPoints/_logs/history
> -rw-r--r--   1 mackie supergroup       7105 2012-02-16 21:31
> /user/mackie/output/clusteredPoints/_logs/history/job_201202162112_0005_1329408095893_mackie_Canopy+Driver+running+clusterData+over+input%3A+outp
> -rw-r--r--   1 mackie supergroup      20634 2012-02-16 21:31
> /user/mackie/output/clusteredPoints/_logs/history/job_201202162112_0005_conf.xml
> -rw-r--r--   1 mackie supergroup     340891 2012-02-16 21:31
> /user/mackie/output/clusteredPoints/part-m-00000
> drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:31
> /user/mackie/output/clusters-0-final
> -rw-r--r--   1 mackie supergroup          0 2012-02-16 21:31
> /user/mackie/output/clusters-0-final/_SUCCESS
> drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:30
> /user/mackie/output/clusters-0-final/_logs
> drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:30
> /user/mackie/output/clusters-0-final/_logs/history
> -rw-r--r--   1 mackie supergroup      10696 2012-02-16 21:30
> /user/mackie/output/clusters-0-final/_logs/history/job_201202162112_0004_1329408047297_mackie_Canopy+Driver+running+buildClusters+over+input%3A+ou
> -rw-r--r--   1 mackie supergroup      20920 2012-02-16 21:30
> /user/mackie/output/clusters-0-final/_logs/history/job_201202162112_0004_conf.xml
> -rw-r--r--   1 mackie supergroup       6747 2012-02-16 21:31
> /user/mackie/output/clusters-0-final/part-r-00000
> drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:30
> /user/mackie/output/data
> -rw-r--r--   1 mackie supergroup          0 2012-02-16 21:30
> /user/mackie/output/data/_SUCCESS
> drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:30
> /user/mackie/output/data/_logs
> drwxr-xr-x   - mackie supergroup          0 2012-02-16 21:30
> /user/mackie/output/data/_logs/history
> -rw-r--r--   1 mackie supergroup       7063 2012-02-16 21:30
> /user/mackie/output/data/_logs/history/job_201202162112_0003_1329408010408_mackie_Input+Driver+running+over+input%3A+testdata
> -rw-r--r--   1 mackie supergroup      19845 2012-02-16 21:30
> /user/mackie/output/data/_logs/history/job_201202162112_0003_conf.xml
> -rw-r--r--   1 mackie supergroup     335470 2012-02-16 21:30
> /user/mackie/output/data/part-m-00000
>
> Here's my output:
>
> $ $MAHOUT_HOME/bin/mahout clusterdump --seqFileDir output/clusters-10
> --pointsDir output/clusteredPoints --output
> $MAHOUT_HOME/examples/output/clusteranalyze.txt
> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
> Running on hadoop, using
> HADOOP_HOME=/Users/mackie/devtools/hadoop-0.20.204.0
> No HADOOP_CONF_DIR set, using /Users/mackie/devtools/hadoop-0.20.204.0/conf
> MAHOUT-JOB:
> /Users/mackie/source-checkouts/mahout-trunk/examples/target/mahout-examples-0.7-SNAPSHOT-job.jar
> 12/02/16 22:50:31 ERROR common.AbstractJob: Unexpected --seqFileDir while
> processing Job-Specific Options:
> usage:<command>  [Generic Options] [Job-Specific Options]
> Generic Options:
>   -archives<paths>               comma separated archives to be unarchived
>                                  on the compute machines.
>   -conf<configuration file>      specify an application configuration file
>   -D<property=value>             use value for given property
>   -files<paths>                  comma separated files to be copied to the
>                                  map reduce cluster
>   -fs<local|namenode:port>       specify a namenode
>   -jt<local|jobtracker:port>     specify a job tracker
>   -libjars<paths>                comma separated jar files to include in
>                                  the classpath.
>   -tokenCacheFile<tokensFile>    name of the file with the tokens
> Unexpected --seqFileDir while processing Job-Specific
> Options:
> Usage:
>
>   [--input<input>  --output<output>  --outputFormat<outputFormat>
> --substring
> <substring>  --numWords<numWords>  --pointsDir<pointsDir>
> --samplePoints
> <samplePoints>  --dictionary<dictionary>  --dictionaryType
> <dictionaryType>
> --evaluate --distanceMeasure<distanceMeasure>  --help --tempDir
> <tempDir>
> --startPhase<startPhase>  --endPhase
> <endPhase>]
> Job-Specific
> Options:
>    --input (-i) input                         Path to job input
> directory.
>    --output (-o) output                       The directory pathname for
> output.
>    --outputFormat (-of) outputFormat          The optional output format
> to
>                                               write the results as.
> Options:
>                                               TEXT, CSV or
> GRAPH_ML
>    --substring (-b) substring                 The number of chars of
> the
>                                               asFormatString() to
> print
>    --numWords (-n) numWords                   The number of top terms to
> print
>    --pointsDir (-p) pointsDir                 The directory containing
> points
>                                               sequence files mapping
> input
>                                               vectors to their cluster.
> If
>                                               specified, then the program
> will
>                                               output the points associated
> with
>                                               a
> cluster
>    --samplePoints (-sp) samplePoints          Specifies the maximum number
> of
>                                               points to include _per_
> cluster.
>                                               The default is to include
> all
>
> points
>    --dictionary (-d) dictionary               The dictionary
> file
>    --dictionaryType (-dt) dictionaryType      The dictionary file
> type
>
> (text|sequencefile)
>    --evaluate (-e)                            Run ClusterEvaluator
> and
>                                               CDbwEvaluator over the input.
> The
>                                               output will be appended to
> the
>                                               rest of the output at the
> end.
>    --distanceMeasure (-dm) distanceMeasure    The classname of
> the
>                                               DistanceMeasure. Default
> is
>
> SquaredEuclidean
>    --help (-h)                                Print out
> help
>    --tempDir tempDir                          Intermediate output
> directory
>    --startPhase startPhase                    First phase to
> run
>    --endPhase endPhase                        Last phase to
> run
> 12/02/16 22:50:31 INFO driver.MahoutDriver: Program took 308 ms (Minutes:
> 0.0051333333333333335)
>