You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by David Spencer <da...@tropo.com> on 2005/01/14 01:03:57 UTC

stop words and index size

Does anyone know how much stop words are supposed to affect the index size?

I did an experiment of building an index once with, and once without, 
stop words.

The corpus is the English Wikipedia, and I indexed the title and body of 
the articles. I used a list of 525 stop words.

With stopwords removed the index is 227MB.
With stopwords kept the index is 331MB.

Thus, the index grows by 45% in this case, which I found suprising, as I 
expected it to not grow as much. I haven't dug into the details of the 
Lucene file formats but thought compression (field/term vector/sparse 
lists/ vints) would negate the affect of stopwords to a large extent.

Some more details + a link to my stopword list are here:
http://www.searchmorph.com/weblog/index.php?id=36

-- Dave

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: stop words and index size

Posted by Doug Cutting <cu...@apache.org>.

David Spencer wrote:
> Does anyone know how much stop words are supposed to affect the index size?
> 
> I did an experiment of building an index once with, and once without, 
> stop words.
> 
> The corpus is the English Wikipedia, and I indexed the title and body of 
> the articles. I used a list of 525 stop words.
> 
> With stopwords removed the index is 227MB.
> With stopwords kept the index is 331MB.

The unstopped version is indeed bigger and slower to build, but it's 
only slower to search when folks search on stop words.  One approach to 
minimizing stopwords in searches (used by, e.g. Nutch & Google) is to 
index all stop words but remove them from queries unless they're (a) in 
a phrase or (b) explicitly required with a "+".  (It might be nice if 
Lucene included a query parser that had this feature.)

Nutch also optimizes phrase searches involving a few very common stop 
words (e.g., "the", "a", "to") by indexing these as bigrams and 
converting phrases involving them to bigram phrases.  So, if someone 
searches for "to be or not to be" then this turns into a search for 
"to-be be or not-to to-be" which is considerably faster since it 
involves rarer terms.  But the more words you bigram the bigger the 
index gets and the slower updates get, so you probably can't afford to 
do this for your full stop list.  (It might be nice if Lucene included 
support for this technique too!)

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org

Re: stop words and index size

Posted by Chris Hostetter <ho...@fucit.org>.


: The corpus is the English Wikipedia, and I indexed the title and body of
: the articles. I used a list of 525 stop words.
:
: With stopwords removed the index is 227MB.
: With stopwords kept the index is 331MB.

That doesn't seem horribly surprising.

consider that for every Term in the index, lucene is keeping track of the
list of <docId, freq> pairs for every document that contains that term.

Assume that something has to be in at least 25% of the docs before you
decide it's worth making it a stop word.  your URL indicates you are
dealing with 400k docs, which means that for each stop word, the space
need to store the int pairs for <docId, freq> is...

        (4B + 4B) * 100,000 =~ 780KB  (per stop word Term, minimum)

...not counting any indexing structures that may be used internally to
improve the lookup of a Term.  assuming some of those words are in more or
less then 25% of your documents, that could easily account for a
differents of 100MB.

I suspect that an interesting excersize would be to use some of the code
I've seen tossed arround on this list that lets you iterate over all Terms
and find the most common once to help you determine your stopword list
progromaticly.  Then remove/reindex any documents that have each word as
you add it to your stoplist (one word at a time) and watch your index
shrink.




-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-user-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-user-help@jakarta.apache.org