You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Burke Webster <bu...@gmail.com> on 2014/10/09 05:59:45 UTC

DictionaryVectorizer performance

I'm trying to turn a corpus of around 2.3 million docs into a sparse
vectors for input into RowSimilarityJob and seem to be running into some
performance issues with the DictionaryVectorizer.createDictionaryChunks.
It seems that the goal is to number each "term" (in my case bi-grams).
This is done in-memory and attempts to enforce a max chunk size.

Is there a reason we wouldn't update this code to use an approach similar
the one presented here - to
http://waredingen.nl/monotonically-increasing-row-ids-with-mapredu.  Using
a special comparator and grouping partitioner allows us to parallelize this
operation across a mapreduce cluster.

I'm happy to incorporate these changes (and am testing this locally).  Just
curious if I might be missing something that forces the use of the current
single-threaded approach?  Also, if this is better suited for the
mahout-dev list I'm happy to send it there.

Thanks,
Burke