You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by blargy <zm...@hotmail.com> on 2010/03/17 02:51:38 UTC

Stopwords

I was reading "Scaling Lucen and Solr"
(http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/)
and I came across the section StopWords. 

In there it mentioned that its not recommended to remove stop words at index
time. Why is this the case? Don't all the extraneous stopwords bloat the
index and lead to less relevant results? Can someone please explain this to
me. Thanks
-- 
View this message in context: http://old.nabble.com/Stopwords-tp27927028p27927028.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Stopwords

Posted by Mark Miller <ma...@gmail.com>.

On 03/17/2010 12:03 PM, Robert Muir wrote:
> On Wed, Mar 17, 2010 at 11:48 AM, Grant Ingersoll<gs...@apache.org>  wrote:
>
>    
>> Yes and no.  Putting our historian hat on, stop words were often seen as contributing very little to scores and also taking up a lot of room on disk back in the days when disk was very precious.  Times, as they say, have changed.  Disk is cheap, so that is no longer a concern.
>>
>>      
> Yes, and the take-away from the Dolamic and Savoy paper is that,
> performance-aside, removing stopwords is still a necessary evil for
> good relevance, at least for some languages.
>
> Ideally we wouldn't have to remove information to have good relevance,
> and a good step forward would be to support relevance-ranking
> algorithms such as the BM25* mentioned in the paper, that provide good
> relevance without the need to remove stopwords.
>
> For now, at least the CommonGrams solution is available in Solr that
> provides an alternative which can address both concerns (performance
> and relevance) to some degree.
>
>    

In general I prefer to have the option of removing stopwords at query 
time (common grams solution aside).

Too many times have I removed stopwords and had user complaints about 
phrase and proximity queries, and no server downtime to reindex and fix 
the issue.

It was never fun supporting Librarians.

-- 
- Mark

http://www.lucidimagination.com

Re: Stopwords

Posted by Robert Muir <rc...@gmail.com>.

On Wed, Mar 17, 2010 at 11:48 AM, Grant Ingersoll <gs...@apache.org> wrote:

> Yes and no.  Putting our historian hat on, stop words were often seen as contributing very little to scores and also taking up a lot of room on disk back in the days when disk was very precious.  Times, as they say, have changed.  Disk is cheap, so that is no longer a concern.
>

Yes, and the take-away from the Dolamic and Savoy paper is that,
performance-aside, removing stopwords is still a necessary evil for
good relevance, at least for some languages.

Ideally we wouldn't have to remove information to have good relevance,
and a good step forward would be to support relevance-ranking
algorithms such as the BM25* mentioned in the paper, that provide good
relevance without the need to remove stopwords.

For now, at least the CommonGrams solution is available in Solr that
provides an alternative which can address both concerns (performance
and relevance) to some degree.

-- 
Robert Muir
rcmuir@gmail.com

Re: Stopwords

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 16, 2010, at 9:51 PM, blargy wrote:

> 
> I was reading "Scaling Lucen and Solr"
> (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/)
> and I came across the section StopWords. 
> 
> In there it mentioned that its not recommended to remove stop words at index
> time. Why is this the case? Don't all the extraneous stopwords bloat the
> index and lead to less relevant results? Can someone please explain this to
> me. Thanks

Yes and no.  Putting our historian hat on, stop words were often seen as contributing very little to scores and also taking up a lot of room on disk back in the days when disk was very precious.  Times, as they say, have changed.  Disk is cheap, so that is no longer a concern.  

Think about stop words a little bit from a language perspective, while it is true that they are of little value in search, they are not of "no value" (if they are of no value in a language, one could argue that the word shouldn't even exist, right?).  This is especially true when the user enters a query that is entirely stop words (for instance, there is a band called "The THE").  Thus, the trick becomes knowing when to use stop words and when not to.  If you remove them at indexing time, you have no choice, as the information is lost, so that is why more and more people keep them during indexing and then deal with them at query time.  Turns out, stop words are often also useful as part of phrases.  Consider the following two documents:

1. The President of the United States went to China last week.
2. Joe is the President.  The United States is investigating him for corruption.

If the user enters the query "The President of the United States" and stop words are removed at indexing and search time, then both documents will match, whereas with stop words, the first is the only (and correct) match at least based on my intent.

To deal with them at query time, you need an intelligent query parser that:
1. Recognizes when the query is all stop words
2. Keeps stop words as part of phrases

Unfortunately, none of the existing Solr Query Parsers address these two things.

HTH,
Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search

Re: Stopwords

Posted by Anthony Serfes <as...@optonline.net>.

They apparently moved it .. it's here now:
http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf


--------------------------------------------------
From: "Glen Newton" <gl...@gmail.com>
Sent: Wednesday, March 17, 2010 11:13 AM
To: <so...@lucene.apache.org>
Subject: Re: Stopwords

> That discussion cites a paper via a URL:
> http://doc.rero.ch/lm.php?url#16;00,43,4,20091218142456-GY/Dolamic_Ljiljana__When_Stopword_Lists_Make_the_Difference_20091218.pdf
>
> Unfortunately when I go to this URL I get:
> "L'accès à ce document est limité."
>
> But I tracked down the paper. Here is its reference (which may require
> a subscription: sorry):
> US: http://dx.doi.org/10.1002/asi.21186
> AU: Ljiljana Dolamic
> AU: Jacques Savoy
> TI: When stopword lists make the difference
> SO: Journal of the American Society for Information Science and Technology
> VL: 61
> NO: 1
> PG: 200-203
> YR: 2010
> CP: © 2009 ASIS&T
> ON: 1532-2890
> PN: 1532-2882
> AD: Computer Science Department, University of Neuchâtel, 2009
> Neuchâtel, Switzerland
> DOI: 10.1002/asi.21186
>
> -Glen
>
> On 17 March 2010 06:02, Ahmet Arslan <io...@yahoo.com> wrote:
>>
>>> I was reading "Scaling Lucen and Solr"
>>> (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/)
>>> and I came across the section StopWords.
>>>
>>> In there it mentioned that its not recommended to remove
>>> stop words at index
>>> time. Why is this the case? Don't all the extraneous
>>> stopwords bloat the
>>> index and lead to less relevant results? Can someone please
>>> explain this to
>>> me. Thanks
>>
>> There were a discussion about stopwords (remove them, not to remove them, 
>> or index them with CommonGramsFilterFactory) and good references in this 
>> thread.
>>
>> http://search-lucene.com/m/QvJtF1mIPP22/When+Stopword+Lists+Make+the+Difference
>>
>>
>>
>>
>
>
>
> -- 
>
> -

Re: Stopwords

Posted by Glen Newton <gl...@gmail.com>.

That discussion cites a paper via a URL:
http://doc.rero.ch/lm.php?url#16;00,43,4,20091218142456-GY/Dolamic_Ljiljana__When_Stopword_Lists_Make_the_Difference_20091218.pdf

Unfortunately when I go to this URL I get:
 "L'accès à ce document est limité."

But I tracked down the paper. Here is its reference (which may require
a subscription: sorry):
US: http://dx.doi.org/10.1002/asi.21186
AU: Ljiljana Dolamic
AU: Jacques Savoy
TI: When stopword lists make the difference
SO: Journal of the American Society for Information Science and Technology
VL: 61
NO: 1
PG: 200-203
YR: 2010
CP: © 2009 ASIS&T
ON: 1532-2890
PN: 1532-2882
AD: Computer Science Department, University of Neuchâtel, 2009
Neuchâtel, Switzerland
DOI: 10.1002/asi.21186

-Glen

On 17 March 2010 06:02, Ahmet Arslan <io...@yahoo.com> wrote:
>
>> I was reading "Scaling Lucen and Solr"
>> (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/)
>> and I came across the section StopWords.
>>
>> In there it mentioned that its not recommended to remove
>> stop words at index
>> time. Why is this the case? Don't all the extraneous
>> stopwords bloat the
>> index and lead to less relevant results? Can someone please
>> explain this to
>> me. Thanks
>
> There were a discussion about stopwords (remove them, not to remove them, or index them with CommonGramsFilterFactory) and good references in this thread.
>
> http://search-lucene.com/m/QvJtF1mIPP22/When+Stopword+Lists+Make+the+Difference
>
>
>
>

-- 

-

Re: Stopwords

Posted by Ahmet Arslan <io...@yahoo.com>.

> I was reading "Scaling Lucen and Solr"
> (http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr/)
> and I came across the section StopWords. 
> 
> In there it mentioned that its not recommended to remove
> stop words at index
> time. Why is this the case? Don't all the extraneous
> stopwords bloat the
> index and lead to less relevant results? Can someone please
> explain this to
> me. Thanks

There were a discussion about stopwords (remove them, not to remove them, or index them with CommonGramsFilterFactory) and good references in this thread.

http://search-lucene.com/m/QvJtF1mIPP22/When+Stopword+Lists+Make+the+Difference