You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Dave Byrne <db...@mdb.com> on 2012/09/20 19:55:31 UTC

TFIDFPartialVectorReducer minDf

In TFIDFPartialVectorReducer.java:

If docFreq > maxDocFreq then the vector at that index is not set (ignored)
If docFreq < minDocFreq then the vector at that index is set to the TfIdf calculation using minDocFreq instead of the actual document frequency.

Should minDocFreq not be treated the same as maxDocFreq by skipping setting the vector at that index?

In both cases, the vector length remains the same and these settings have no effect on pruning the vector length / term reduction?


NOTICE: This message and any attachments are intended only for the use of the addressee and may contain confidential, proprietary and/or privileged information. If you are not the intended recipient, any review, use, distribution, dissemination or copying of this email is prohibited. If you have received this email in error, please notify the sender by replying to this message and delete this email immediately. Securities trading, account management, and investment banking services are offered by MDB Capital Group LLC, a registered broker-dealer and member of FINRA and SIPC. Unless clearly stated, nothing herein shall be construed to be an offer to sell, nor a solicitation of an offer to buy, any financial product.

Re: TFIDFPartialVectorReducer minDf

Posted by Grant Ingersoll <gs...@apache.org>.
On Sep 20, 2012, at 1:55 PM, Dave Byrne wrote:

> In TFIDFPartialVectorReducer.java:
> 
> If docFreq > maxDocFreq then the vector at that index is not set (ignored)
> If docFreq < minDocFreq then the vector at that index is set to the TfIdf calculation using minDocFreq instead of the actual document frequency.
> 
> Should minDocFreq not be treated the same as maxDocFreq by skipping setting the vector at that index?

I think the idea is that it is being rounded up to provide some minimum level of input.  It's always a bit of a hedge w/ these rare terms.  Sometimes they are just garbage, other times, they are valuable.  My leaning would be towards keeping it as is.

> 
> In both cases, the vector length remains the same and these settings have no effect on pruning the vector length / term reduction?
> 
> 
> NOTICE: This message and any attachments are intended only for the use of the addressee and may contain confidential, proprietary and/or privileged information. If you are not the intended recipient, any review, use, distribution, dissemination or copying of this email is prohibited. If you have received this email in error, please notify the sender by replying to this message and delete this email immediately. Securities trading, account management, and investment banking services are offered by MDB Capital Group LLC, a registered broker-dealer and member of FINRA and SIPC. Unless clearly stated, nothing herein shall be construed to be an offer to sell, nor a solicitation of an offer to buy, any financial product.

--------------------------------------------
Grant Ingersoll
http://www.lucidworks.com