You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2011/10/02 15:17:00 UTC

[CONF] Apache Mahout > Creating Vectors from Weka's ARFF Format

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Creating Vectors from Weka's ARFF Format (https://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Weka%27s+ARFF+Format)


Edited by Joe Prasanna Kumar:
---------------------------------------------------------------------
h1. Introduction

Mahout now has capabilities for converting Weka's [ARFF|http://www.cs.waikato.ac.nz/~ml/weka/arff.html] (2.1) format to Mahout's Vector format.

h1. Running the Converter

ARFF files are easily converted using the org.apache.mahout.utils.arff.Driver program.  The input arguments can be found by running it with the \--help argument which produces results similar to:
{noformat}
Usage:
 [--input <input> --output <output> --max <max> --help --dictOut <dictOut>
--outputWriter <outputWriter> --delimiter <delimiter>]
Options
  --input (-d) input                  The file or directory containing the ARFF
                                      files.  If it is a directory, all .arff
                                      files will be converted. (Mandatory parameter)
  --output (-o) output                The output directory.  Files will have
                                      the same name as the input, but with the
                                      extension .mvc (Mandatory parameter)
  --max (-m) max                      The maximum number of vectors to output.
                                      If not specified, then it will loop over
                                      all docs (Optional parameter)
  --help (-h)                         Print out help (Optional parameter)
  --dictOut (-t) dictOut              The file to output the label bindings
                                      (Mandatory parameter)
  --outputWriter (-e) outputWriter    The VectorWriter to use, either seq
                                      (SequenceFileVectorWriter - default) or
                                      file (Writes to a File using JSON format)
                                      (Optional parameter)
  --delimiter (-l) delimiter          The delimiter for outputing the
                                      dictionary (Optional parameter)

{noformat}

You can use the parameters in its long format like \--input or using the equivalent short name \-d.  From here, running the Driver is as simple as pointing it at the ARFF file:
{noformat}
$MAHOUT_HOME/bin/mahout arff.vector -d ./content/reuters-modapte/ \
      -t ./content/reuters-modapte/output/dict.txt -o ./content/reuters-modapte/output/convert
{noformat}

Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action