You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Drew Farris <dr...@gmail.com> on 2010/01/13 03:38:07 UTC
Re: [jira] Resolved: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Hi Robin,

I'm seeing some strangeness from this, I've got a directory with 100k
documents. I build a sequence file using SequenceFilesFromDirectory, which
emits 4 chunks for this particular dataset. I then dump each of the chinks
using SequenceFileDumper. I only see 75,964 documents in the resulting dump.
I've tried with 10k files and it seems to work fine as long as all of the
documents can fit into a single chunk, but once I get beyond a single chunk
is seems to lose documents. In this particular case I can fit about 24k
files per chunk using the default chunk size.

The commands I'm using are:

To create the sequence file:
mvn -e exec:java
-Dexec.mainClass=org.apache.mahout.text.SequenceFilesFromDirectory
-Dexec.args="--parent /u01/test0-10k --outputDir /u01/test0-10k-seq
--keyPrefix test-10k --charset UTF-8"

Then for each chunk:
mvn exec:java -Dexec.mainClass=org.apache.mahout.utils.SequenceFileDumper
-Dexec.args="-s /u01/test0-10k-seq/chunk-0 -o
/u01/test0-10k-dump/chunk-0.dump"

Any ideas? If I find anything particular I'll follow-up

Drew

(Thanks for the commit Sean)

On Tue, Jan 12, 2010 at 8:14 PM, Sean Owen (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Sean Owen resolved MAHOUT-237.
> ------------------------------
>
>    Resolution: Fixed
>
> > Map/Reduce Implementation of Document Vectorizer
> > ------------------------------------------------
> >
> >                 Key: MAHOUT-237
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
> >             Project: Mahout
> >          Issue Type: New Feature
> >    Affects Versions: 0.3
> >            Reporter: Robin Anil
> >            Assignee: Robin Anil
> >             Fix For: 0.3
> >
> >         Attachments: DictionaryVectorizer.patch,
> DictionaryVectorizer.patch, DictionaryVectorizer.patch,
> DictionaryVectorizer.patch, DictionaryVectorizer.patch,
> SparseVector-VIntWritable.patch
> >
> >
> > Current Vectorizer uses Lucene Index to convert documents into
> SparseVectors
> > Ted is working on a Hash based Vectorizer which can map features into
> Vectors of fixed size and sum it up to get the document Vector
> > This is a pure bag-of-words based Vectorizer written in Map/Reduce.
> > The input document is in SequenceFile<Text,Text> . with key = docid,
> value = content
> > First Map/Reduce over the document collection and generate the feature
> counts.
> > Second Sequential pass reads the output of the map/reduce and converts
> them to SequenceFile<Text, LongWritable> where key=feature, value = unique
> id
> >     Second stage should create shards of features of a given split size
> > Third Map/Reduce over the document collection, using each shard and
> create Partial(containing the features of the given shard) SparseVectors
> > Fourth Map/Reduce over partial shard, group by docid, create full
> document Vector
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>