You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mahout.apache.org by co...@apache.org on 2011/09/12 04:45:00 UTC

[CONF] Apache Mahout > Import Export Sequence File Formats

Space: Apache Mahout (https://cwiki.apache.org/confluence/display/MAHOUT)
Page: Import Export Sequence File Formats (https://cwiki.apache.org/confluence/display/MAHOUT/Import+Export+Sequence+File+Formats)

Added by Lance Norskog:
---------------------------------------------------------------------
h5. Status
This is a talk page.
h1. Scope of Project
There are different kinds of import/export problem. One class of problem is defining a set of SequenceFile formats that a "Mahout Job" will import and export. This page is limited to the SequenceFile problem.
h1. Use Cases
h3. Lucene "Bag-of-words" vector
This is a NamedVector file containing a String key and a sparse-encoded vector. There may be an external dictionary defining documents and/or terms.
h5. Import
The various Bayes text classification jobs like Wikipedia import Lucene bag-of-words Vector files.  
h5. Export
Feature vectors derived from text vectors are useful to text-oriented machine learning research. An example:
* Compare a feature vector to all of the original text vectors. This searches for "exemplar" documents which seem to most comprehensively match the given feature. A bunch of papers discuss this for creating document abstracts from sentence vectors.
h3. Confusion Matrix 
A classification job creates among other things a Confusion Matrix. The current example jobs log a text version of the confusion matrix.
h5. Import
Comparing confusion matrices from different classification runs lets you evaluate tuning knobs for a classifier.
h5. Export
Comparing confusion matrices from different classification runs lets you evaluate tuning knobs for a classifier.





Change your notification preferences: https://cwiki.apache.org/confluence/users/viewnotifications.action