You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Robin Anil (JIRA)" <ji...@apache.org> on 2010/01/05 03:46:55 UTC

[jira] Created: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Map/Reduce Implementation of Document Vectorizer
------------------------------------------------

                 Key: MAHOUT-237
                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
             Project: Mahout
          Issue Type: New Feature
    Affects Versions: 0.3
            Reporter: Robin Anil
            Assignee: Robin Anil
             Fix For: 0.3


Current Vectorizer uses Lucene Index to convert documents into SparseVectors
Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
This is a pure bag-of-words based Vectorizer written in Map/Reduce. 

The input document is in SequenceFile<Text,Text> . with key = docid, value = content
First Map/Reduce over the document collection and generate the feature counts.
Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
    Second stage should create shards of features of a given split size
Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Resolved: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by Drew Farris <dr...@gmail.com>.

Hi Robin,

I'm seeing some strangeness from this, I've got a directory with 100k
documents. I build a sequence file using SequenceFilesFromDirectory, which
emits 4 chunks for this particular dataset. I then dump each of the chinks
using SequenceFileDumper. I only see 75,964 documents in the resulting dump.
I've tried with 10k files and it seems to work fine as long as all of the
documents can fit into a single chunk, but once I get beyond a single chunk
is seems to lose documents. In this particular case I can fit about 24k
files per chunk using the default chunk size.

The commands I'm using are:

To create the sequence file:
mvn -e exec:java
-Dexec.mainClass=org.apache.mahout.text.SequenceFilesFromDirectory
-Dexec.args="--parent /u01/test0-10k --outputDir /u01/test0-10k-seq
--keyPrefix test-10k --charset UTF-8"

Then for each chunk:
mvn exec:java -Dexec.mainClass=org.apache.mahout.utils.SequenceFileDumper
-Dexec.args="-s /u01/test0-10k-seq/chunk-0 -o
/u01/test0-10k-dump/chunk-0.dump"

Any ideas? If I find anything particular I'll follow-up

Drew

(Thanks for the commit Sean)

On Tue, Jan 12, 2010 at 8:14 PM, Sean Owen (JIRA) <ji...@apache.org> wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Sean Owen resolved MAHOUT-237.
> ------------------------------
>
>    Resolution: Fixed
>
> > Map/Reduce Implementation of Document Vectorizer
> > ------------------------------------------------
> >
> >                 Key: MAHOUT-237
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
> >             Project: Mahout
> >          Issue Type: New Feature
> >    Affects Versions: 0.3
> >            Reporter: Robin Anil
> >            Assignee: Robin Anil
> >             Fix For: 0.3
> >
> >         Attachments: DictionaryVectorizer.patch,
> DictionaryVectorizer.patch, DictionaryVectorizer.patch,
> DictionaryVectorizer.patch, DictionaryVectorizer.patch,
> SparseVector-VIntWritable.patch
> >
> >
> > Current Vectorizer uses Lucene Index to convert documents into
> SparseVectors
> > Ted is working on a Hash based Vectorizer which can map features into
> Vectors of fixed size and sum it up to get the document Vector
> > This is a pure bag-of-words based Vectorizer written in Map/Reduce.
> > The input document is in SequenceFile<Text,Text> . with key = docid,
> value = content
> > First Map/Reduce over the document collection and generate the feature
> counts.
> > Second Sequential pass reads the output of the map/reduce and converts
> them to SequenceFile<Text, LongWritable> where key=feature, value = unique
> id
> >     Second stage should create shards of features of a given split size
> > Third Map/Reduce over the document collection, using each shard and
> create Partial(containing the features of the given shard) SparseVectors
> > Fourth Map/Reduce over partial shard, group by docid, create full
> document Vector
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

Re: [jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by Ted Dunning <te...@gmail.com>.

That was my first thought as well.

But I think a better answer is to mark the vector as stretchy so that it
reports the high water size as the actual size, but if you insert a non-zero
above that size, it will report the new high water mark thereafter.

This makes the code simple and clear.  The only change needed is to soften
the out of bounds checks for put.

On Tue, Feb 9, 2010 at 5:57 AM, Sean Owen (JIRA) <ji...@apache.org> wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831457#action_12831457]
>
> Sean Owen commented on MAHOUT-237:
> ----------------------------------
>
> Sounds like what you really need (and what I could use) is something like
> getHighestNonZeroIndex() ?
>
> > Map/Reduce Implementation of Document Vectorizer
> > ------------------------------------------------
> >
> >                 Key: MAHOUT-237
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
> >             Project: Mahout
> >          Issue Type: New Feature
> >    Affects Versions: 0.3
> >            Reporter: Robin Anil
> >            Assignee: Robin Anil
> >             Fix For: 0.3
> >
> >         Attachments: DictionaryVectorizer.patch,
> DictionaryVectorizer.patch, DictionaryVectorizer.patch,
> DictionaryVectorizer.patch, DictionaryVectorizer.patch,
> MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch,
> SparseVector-VIntWritable.patch
> >
> >
> > Current Vectorizer uses Lucene Index to convert documents into
> SparseVectors
> > Ted is working on a Hash based Vectorizer which can map features into
> Vectors of fixed size and sum it up to get the document Vector
> > This is a pure bag-of-words based Vectorizer written in Map/Reduce.
> > The input document is in SequenceFile<Text,Text> . with key = docid,
> value = content
> > First Map/Reduce over the document collection and generate the feature
> counts.
> > Second Sequential pass reads the output of the map/reduce and converts
> them to SequenceFile<Text, LongWritable> where key=feature, value = unique
> id
> >     Second stage should create shards of features of a given split size
> > Third Map/Reduce over the document collection, using each shard and
> create Partial(containing the features of the given shard) SparseVectors
> > Fourth Map/Reduce over partial shard, group by docid, create full
> document Vector
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


-- 
Ted Dunning, CTO
DeepDyve

Re: [jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by Ted Dunning <te...@gmail.com>.

:-)

On Tue, Feb 2, 2010 at 1:13 PM, Jake Mannix <ja...@gmail.com> wrote:

> You volunteering to port to avro, Ted?  Awesome! :)
>



-- 
Ted Dunning, CTO
DeepDyve

Re: [jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by Drew Farris <dr...@gmail.com>.

I'm going to get back to it eventually, honest!

On Tue, Feb 2, 2010 at 4:13 PM, Jake Mannix <ja...@gmail.com> wrote:
> You volunteering to port to avro, Ted?  Awesome! :)
>
>  -jake

Re: [jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by Jake Mannix <ja...@gmail.com>.

You volunteering to port to avro, Ted?  Awesome! :)

  -jake

On Feb 2, 2010 1:10 PM, "Ted Dunning (JIRA)" <ji...@apache.org> wrote:

   [
https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828763#action_12828763]

Ted Dunning commented on MAHOUT-237:
------------------------------------

{quote}
Seems like the Text field Vector Class Name (i.e RandomAccessSparseVector
etc) is taking most of the space in the sequencefile. Cant we compact
it(with an integer id and a factory)?
{quote}

What about switching to Avro to avoid this?

> Map/Reduce Implementation of Document Vectorizer >
----------------------------------------------...
>         Attachments: DictionaryVectorizer.patch,
DictionaryVectorizer.patch, DictionaryVectorizer.patch,
DictionaryVectorizer.patch, DictionaryVectorizer.patch,
MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch

> > > Current Vectorizer uses Lucene Index to convert documents into
SparseVectors > Ted is working ...

[jira] Resolved: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-237.
------------------------------

    Resolution: Fixed

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-237:
------------------------------

    Status: Patch Available  (was: Reopened)

Working Implementation DictionaryVectorizer using with tf, tfidf weighting and normalization. 

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831410#action_12831410 ] 

Sean Owen commented on MAHOUT-237:
----------------------------------

I dunno, I think of it as exactly that flag, doesn't seem bad to me. How about defining a constant "INFINITE_DIMENSION" that has this value?

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-237:
------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831396#action_12831396 ] 

Jake Mannix commented on MAHOUT-237:
------------------------------------

{code}
    RandomAccessSparseVector vector =
        new RandomAccessSparseVector(key.toString(), Integer.MAX_VALUE,
            valueString.length() / 5); // guess at initial size
{code}

This whole Integer.MAX_VALUE thing is killing me whenever I try to move back and forth between sparse and dense vectors (which is necessary for performance in the DistributedLanczos I'm working on).  Ugh.  

We really need to have a vector flag which says "I'm infinite dimensional, I just return 0 whenever you ask me about dimensions I don't know about", so we don't have to have this hack of Integer.MAX_VALUE as the dimension.  I've suggested it to people myself, but it's such a baaaaad hack.

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831428#action_12831428 ] 

Robin Anil commented on MAHOUT-237:
-----------------------------------

You just needed the count? You could always map/reduce it and store it. TFIDF methods is very sneaky(keeping the count as a static key) is not very extensible

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799625#action_12799625 ] 

Jake Mannix commented on MAHOUT-237:
------------------------------------

Looking at this a little: 

is there a reason why the termFrequency map needs to exist?  

if instead of:

{code}

      SparseVector vector;
      Map<String,MutableInt> termFrequency = new HashMap<String,MutableInt>();

      token = new Token();
      ts.reset();
      while ((token = ts.next(token)) != null) {
        String tk = new String(token.termBuffer(), 0, token.termLength());
        if(dictionary.containsKey(tk) == false) continue;
        if (termFrequency.containsKey(tk) == false) {
          count += tk.length() + 1;
          termFrequency.put(tk, new MutableInt(0));
        }
        termFrequency.get(tk).increment();
      }

      vector =
          new SparseVector(key.toString(), Integer.MAX_VALUE, termFrequency.size());

      for (Map.Entry<String,MutableInt> pair : termFrequency.entrySet()) {
        String tk = pair.getKey();
        if (dictionary.containsKey(tk) == false) continue;
        vector.setQuick(dictionary.get(tk).intValue(), pair.getValue()
            .doubleValue());
      }
{code}
 
we instead just built it up on the vector itself:

{code}
      String valueStr = value.toString();
      vector =
          new SparseVector(key.toString(), Integer.MAX_VALUE, valueString.length / 5); // guess at initial size

      token = new Token();
      ts.reset();
      while ((token = ts.next(token)) != null) {
        String tk = new String(token.termBuffer(), 0, token.termLength());
        if(dictionary.containsKey(tk) == false) continue;
        int tokenKey = dictionary.get(tk);
        vector.setQuick(tokenKey, vector.getQuick(tokenKey) + 1);
      }
{code}

At least when I micro-benchmark this, it's about 10% faster this way.  Not much, but it's also simpler code.

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831681#action_12831681 ] 

Jake Mannix commented on MAHOUT-237:
------------------------------------

Yes, this is actually what I was suggesting as well.  

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831680#action_12831680 ] 

Sean Owen commented on MAHOUT-237:
----------------------------------

PS I think ted's suggestion that we need 'stretchable' and non-stretchable versions of implementations, perhaps some boolean flag that causes it to expand versus error. A 'stretchable' vector is exactly what I need actually.

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831432#action_12831432 ] 

Jake Mannix commented on MAHOUT-237:
------------------------------------

Yeah, well, I need the count, and I also am modifying the vectorizer and tfidf version to a) create vectors of the proper dimension (after doing createDictionaryChunks(), we know what the dimension of the output vectors is bound to be), and b) take a "--sequentialAccessOutput" cmdline flag to allow for the possibility of (in the final reducer) converting the vectors from their mutation-friendly form of RandomAccessSparseVector, and sealing them up in the not-very-mutation-friendly, but zippily faster for some tasks, SequentialAccessSparseVector form.

I'll put up a patch with all my DistributedLanczosSolver stuff, because that's where this is needed (plus anyone who wants finite dimensional vectors, sometimes of SequentialAccess optimized form [like the clusterers])

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Isabel Drost (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12803275#action_12803275 ] 

Isabel Drost commented on MAHOUT-237:
-------------------------------------

Hmm, Robin your last comment is "ok. done" however the issue is still open?

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831457#action_12831457 ] 

Sean Owen commented on MAHOUT-237:
----------------------------------

Sounds like what you really need (and what I could use) is something like getHighestNonZeroIndex() ?

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-237:
------------------------------

    Attachment: DictionaryVectorizer.patch

Uses String Reader. Removes unused imports and added License headers and unused variables.
Still bzip decompressing wikipedia dump. Wish there was a mapreduce for that

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799508#action_12799508 ] 

Sean Owen commented on MAHOUT-237:
----------------------------------

I'll commit -- still seeing some code inspection warnings but we can look at it later.

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831413#action_12831413 ] 

Jake Mannix commented on MAHOUT-237:
------------------------------------

I think of it as that flag as well, but when doing decompositions of matrices, you will have a matrix with a bunch (N = numRows == 10^{8+}) of sparse vectors, each of which has some dimension (M = numCols).  Your goal is to find a matrix which has some (k = desiredRank = 100's) *dense* vectors which each has cardinality M.  If M is Integer.MAX_VALUE, you need a couple TB of RAM to even construct this final product.  You most certainly do not want to have eigenvectors which are represented as sparse vectors, because this is a) horribly inefficient storage for them (they're dense) and b) horribly inefficient CPU-wise (ditto).

We could certainly *use* Vector.size() == Integer.MAX_VALUE as an effective "flag" that it's unbounded, but the important thing is what we do with that information: when you create a DenseVector of this size, the key would be to *not* initialize the array this large, but instead initialize it to some small value, but set an internal flag saying that whenever some does a get/getQuick outside of range, return 0, and if someone does a set/setQuick outside of range, the vector automagically resizes internally to that size, and then sets.  So it becomes a "grow as needed" dense vector of infinite dimension.

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil reopened MAHOUT-237:
-------------------------------


reopening this to let in further review

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-237:
------------------------------

    Attachment: SparseVector-VIntWritable.patch
                DictionaryVectorizer.patch

Working patch. 20newsgroups take about a minute to convert to vectors on single mapper/reducer. Will test with larger collections and see

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-237:
------------------------------

    Attachment: MAHOUT-237-tfidf.patch

4 Main Entry points
DocumentProcessor - does SequenceFile => StringTuple(later replaced by StructuredDocumentWritable backed by AvroWritable)
DictionaryVectorizer - StringTuple of documents => Tf Vector
PartialVectorMerger - merges partial vectors based on their doc id. Does optional normalizing(used by both DictionaryVectorizer(no normalizing) and TFIDFConverter (optional normalizing0
TfidfConverter - Converts tf vector to tfidf vector with optional normalizing

An example which uses all of them
hadoop jar examples/target/mahout-examples-0.3-SNAPSHOT.job org.apache.mahout.text.SparseVectorsFromSequenceFiles -i reuters-seqfiles -o reuters-vectors -w (tfidf|tf) --norm 2(works only when tfidf enabled not with tf)

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-237:
------------------------------

    Attachment: DictionaryVectorizer.patch

Example code which converts an input directory to sequence files recursively  and assigns docid as the relative path from the parent directory along with with a prefix 
The code also chunks the sequence files as specified by the chunksize

Dictionary vectorizer also chunks the dictionary file and runs multiple map/reduces over them to create partial vectors which then get summed to create the final document vector


> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-237:
------------------------------

    Attachment: MAHOUT-237-tfidf.patch

Added IDF job which takes a sequence file of doc-id=>Vector. Calculates Tf-Idf using TFIDF class(internally uses Lucene DefaultSimilarity class, not yet modifiable)

has similar options to lucene driver (minDf, maxDfPercent). 

Purely Map/reduce solution. Chunks the Document Frequency Sequence File. and does multiple map/reduces over the input vectors as specified by the chunk size.


Seems like the Text field Vector Class Name (i.e RandomAccessSparseVector etc) is taking most of the space in the sequencefile. Cant we compact it(with an integer id and a factory)?



> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799557#action_12799557 ] 

Jake Mannix commented on MAHOUT-237:
------------------------------------

Given the following code in PartialVectorGenerator:

{code}	      
for (Entry<String,MutableInt> pair : termFrequency.entrySet()) {
  String tk = pair.getKey();
  if (dictionary.containsKey(tk) == false) continue;
  vector.setQuick(dictionary.get(tk).intValue(), pair.getValue().doubleValue());
}	      
assert (vector.getNumNondefaultElements() == termFrequency.size());
{code}

Why is this assert expected to pass?  If there are dictionary.containsKey(tk) ever returns true, this will fail...

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-237:
------------------------------

    Attachment: DictionaryVectorizer.patch

single shard based document vectorizer. Needs some tidying up. (Work in progress)

TODO: Split the output of first Map/Reduce into shards of given chunk size
TODO: Map/Reduce job for the merge phase 

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12831420#action_12831420 ] 

Jake Mannix commented on MAHOUT-237:
------------------------------------

I do notice that recently added to this set of classes is the TFIDFConverter, which does keep track of featureCount, which actually solves most of my issue in this particular case, I think.  Looking into it further.

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robin Anil updated MAHOUT-237:
------------------------------

    Attachment: DictionaryVectorizer.patch

Some tidying up. Still the large output bug remains

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Jake Mannix (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799560#action_12799560 ] 

Jake Mannix commented on MAHOUT-237:
------------------------------------

It appears that there is just a missing line above when the termFrequency map is created:

{code}
        if(dictionary.containsKey(tk) == false) continue;
{code}

needs to be inserted to get the map and the vector to have the same sizes. 


> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Robin Anil (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799631#action_12799631 ] 

Robin Anil commented on MAHOUT-237:
-----------------------------------

Ok. Done

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-237) Map/Reduce Implementation of Document Vectorizer

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828763#action_12828763 ] 

Ted Dunning commented on MAHOUT-237:
------------------------------------

{quote}
Seems like the Text field Vector Class Name (i.e RandomAccessSparseVector etc) is taking most of the space in the sequencefile. Cant we compact it(with an integer id and a factory)?
{quote}

What about switching to Avro to avoid this?

> Map/Reduce Implementation of Document Vectorizer
> ------------------------------------------------
>
>                 Key: MAHOUT-237
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-237
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.3
>            Reporter: Robin Anil
>            Assignee: Robin Anil
>             Fix For: 0.3
>
>         Attachments: DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, DictionaryVectorizer.patch, MAHOUT-237-tfidf.patch, SparseVector-VIntWritable.patch
>
>
> Current Vectorizer uses Lucene Index to convert documents into SparseVectors
> Ted is working on a Hash based Vectorizer which can map features into Vectors of fixed size and sum it up to get the document Vector
> This is a pure bag-of-words based Vectorizer written in Map/Reduce. 
> The input document is in SequenceFile<Text,Text> . with key = docid, value = content
> First Map/Reduce over the document collection and generate the feature counts.
> Second Sequential pass reads the output of the map/reduce and converts them to SequenceFile<Text, LongWritable> where key=feature, value = unique id 
>     Second stage should create shards of features of a given split size
> Third Map/Reduce over the document collection, using each shard and create Partial(containing the features of the given shard) SparseVectors 
> Fourth Map/Reduce over partial shard, group by docid, create full document Vector

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.