You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2010/01/13 02:14:54 UTC
[jira] Resolved: (MAHOUT-237) Map/Reduce Implementation of Document
Vectorizer
[ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved MAHOUT-237.
------------------------------
Resolution: Fixed
> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
> Key: MAHOUT-237
> URL: https://issues.apache.org/jira/browse/MAHOUT-237
> Project: Mahout
> Issue Type: New Feature
> Affects Versions: 0.3
> Reporter: Robin Anil
> Assignee: Robin Anil
> Fix For: 0.3
>
> Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce.
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id
> Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
Re: [jira] Resolved: (MAHOUT-237) Map/Reduce Implementation of
Document Vectorizer
Posted by Drew Farris <dr...@gmail.com>.
Hi Robin,
I'm seeing some strangeness from this, I've got a directory with 100k
documents. I build a sequence file using SequenceFilesFromDirectory, which
emits 4 chunks for this particular dataset. I then dump each of the chinks
using SequenceFileDumper. I only see 75,964 documents in the resulting dump.
I've tried with 10k files and it seems to work fine as long as all of the
documents can fit into a single chunk, but once I get beyond a single chunk
is seems to lose documents. In this particular case I can fit about 24k
files per chunk using the default chunk size.
The commands I'm using are:
To create the sequence file:
mvn -e exec:java
-Dexec.mainClass=org.apache.mahout.text.SequenceFilesFromDirectory
-Dexec.args="--parent /u01/test0-10k --outputDir /u01/test0-10k-seq
--keyPrefix test-10k --charset UTF-8"
Then for each chunk:
mvn exec:java -Dexec.mainClass=org.apache.mahout.utils.SequenceFileDumper
-Dexec.args="-s /u01/test0-10k-seq/chunk-0 -o
/u01/test0-10k-dump/chunk-0.dump"
Any ideas? If I find anything particular I'll follow-up
Drew
(Thanks for the commit Sean)
On Tue, Jan 12, 2010 at 8:14 PM, Sean Owen (JIRA) <ji...@apache.org> wrote:
>
> [
> https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Sean Owen resolved MAHOUT-237.
> ------------------------------
>
> Resolution: Fixed
>
> > Map/Reduce Implementation of Document Vectorizer
> > ------------------------------------------------
> >
> > Key: MAHOUT-237
> > URL: https://issues.apache.org/jira/browse/MAHOUT-237
> > Project: Mahout
> > Issue Type: New Feature
> > Affects Versions: 0.3
> > Reporter: Robin Anil
> > Assignee: Robin Anil
> > Fix For: 0.3
> >
> > Attachments: DictionaryVectorizer.patch,
> DictionaryVectorizer.patch, DictionaryVectorizer.patch,
> DictionaryVectorizer.patch, DictionaryVectorizer.patch,
> SparseVector-VIntWritable.patch
> >
> >
> > Current Vectorizer uses Lucene Index to convert documents into
> SparseVectors
> > Ted is working on a Hash based Vectorizer which can map features into
> Vectors of fixed size and sum it up to get the document Vector
> > This is a pure bag-of-words based Vectorizer written in Map/Reduce.
> > The input document is in SequenceFile<Text,Text> . with key = docid,
> value = content
> > First Map/Reduce over the document collection and generate the feature
> counts.
> > Second Sequential pass reads the output of the map/reduce and converts
> them to SequenceFile<Text, LongWritable> where key=feature, value = unique
> id
> > Second stage should create shards of features of a given split size
> > Third Map/Reduce over the document collection, using each shard and
> create Partial(containing the features of the given shard) SparseVectors
> > Fourth Map/Reduce over partial shard, group by docid, create full
> document Vector
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>