You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Pat Ferrel <pa...@farfetchers.com> on 2012/03/31 19:02:54 UTC

Filter out small docs

After calculating TFIDF vectors some docs are very small in my 
collection. They have 0 or only a few tokens, they are usually very 
small docs to begin some are even empty. I'd like to drop them from the 
doc collection. My question is should I drop them before calculating 
TFIDF by removing them from the input or after (this will affect IDF 
calculation) and is there a way to do that with mahout or do I need to 
create a custom step in TFIDF to prune the small vectors/docs?

Re: Filter out small docs

Posted by Robin Anil <ro...@gmail.com>.
I would suggesting adding a preprocess step to generate the input sequence
file which mahout reads instead of relying on the seqdirectory tool. Most
of the time you will spend tuning will be mostly in tweaking your processed
document.
------
Robin Anil


On Sat, Mar 31, 2012 at 12:02 PM, Pat Ferrel <pa...@farfetchers.com> wrote:

> After calculating TFIDF vectors some docs are very small in my collection.
> They have 0 or only a few tokens, they are usually very small docs to begin
> some are even empty. I'd like to drop them from the doc collection. My
> question is should I drop them before calculating TFIDF by removing them
> from the input or after (this will affect IDF calculation) and is there a
> way to do that with mahout or do I need to create a custom step in TFIDF to
> prune the small vectors/docs?
>
>