You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Matt Molek <mp...@gmail.com> on 2012/10/09 20:48:26 UTC
Trying to modify maxDFPercent pruning in seq2sparse

I am working with the source code from the 0.7 release. I've been looking
at the code related to pruning words with a high document frequency because
for my uses I'd like to be able to set the max document frequency to a
value smaller than 1%. The current implementation uses an int to represent
maxDFPercent. I want to set maxDFPercent to something like 0.2% of
documents. I'm using the tfidf vectors from seq2sparse as input to a
rowsimilarity job which needs very sparse vectors to perform well. I'm
working with a corpus of several million documents, so limiting
maxDFPercent to 0.2% would still allow words that had occurred in ~10,000
documents. I haven't been able to test it yet, but I think I'll still get
acceptable comparison results with those numbers.

To that end, I am trying to modify
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles (and its
related classes) to represent maxDFPercent as a float instead of an int.
Initially it seemed very simple, but I'm having doubts about one part of
the code now.

A quick rundown of my understanding:
1) SparseVectorsFromSequenceFiles parses the maxDFPercentOpt and passes it
to org.apache.mahout.vectorizer.HighDFWordsPruner. As long as maxDFSigma is
not set, SparseVectorsFromSequenceFiles does not alter the value of maxDF
from the original maxDFPercentOpt
2) In HighDFWordsPruner.pruneVectorsPartial(...) the unaltered maxDF from
SparseVectorsFromSequenceFiles is added to a Configuration with
conf.setInt(MAX_DF,maxDf)
3) That Configuration is read by
org.apache.mahout.vectorizer.pruner.WordsPrunerReducer where for each word
in the dictionary, if(maxDF > -1) and if(df > maxDF), that word's df is set
to 0.

Is that correct? If so, that comparison in WordsPrunerReducer looks wrong
to me. To give this some numbers, say I have 200 documents, and a word
"two" that occurs in 2 of them, and a word "three" occurs in 3 of them. Now
I will run seq2sparse with a maxDFPercent of 1%. As I understand the code,
a maxDF value of 1 will be passed through to WordsPrunerReducer where the
following comparisions will happen:

For word "two":
if(1 > -1) and if(2 > 1) set df of "one" to 0.

For word "three"
if(1 > -1) and if(3 > 1) set df of "two" to 0.

So they'll both be set to 0 when only "three" should have been. That's not
the result that the unmodified code gives though. It performs as expected,
pruning "three" but not "two."

So I feel like I just must be missing something. Could anybody help clear
this up for me? Thanks for the help!
-Matt