You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by David Causse <dc...@spotter.com> on 2008/11/27 14:34:00 UTC

[OT] About stopwords

Hi,

Look at this google query : 
http://www.google.fr/search?q=%22HOW+at+at+of+a+A+a%22

What do you think about that concerning stop words?
Google has no stop words?

David.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: [OT] About stopwords

Posted by Michael McCandless <lu...@mikemccandless.com>.
That's a phrase search, so it's conceivable google could be doing  
something similar to nutch, whereby adjacent ngrams are indexed as  
unique terms.

But if you do the same search without quotes:

     http://www.google.fr/search?hl=fr&q=HOW+at+at+of+a+A+a&btnG=Rechercher&meta=

they still find many matches (though, curiously the one result  
returned for the phrase search seems not to make the first page for  
the non-phrase search).

So it does seem like Google has no stop words.

It actually makes some sense, because Google obviously has to deal  
with non-stopword terms that have tremendous frequency (eg "1" and  
"2", which occur more frequently than "a" or "the") by scaling out  
across machines, so since they already solved that scaleout anyway,  
the added incremental cost of including stopwords is probably minor.

Mike

David Causse wrote:

> Hi,
>
> Look at this google query : http://www.google.fr/search?q=%22HOW+at+at+of+a+A+a%22
>
> What do you think about that concerning stop words?
> Google has no stop words?
>
> David.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: [OT] About stopwords

Posted by David Causse <dc...@spotter.com>.
Thanks for the tip,

but I can't imagine the number of documents google has to join in order 
process such results...
There must be a trick.
Maybe stopwords are not indexed alone but twice with previous and next 
token, some sort of 2-gram index?

David.

Aleksander M. Stensby a écrit :
> Your query includeds apostrophes which tells google to include common 
> words in the query.
> But, if you remove the apostrophes, you will still get results, as 
> google states:
>
> "Google ignores stop words when they're placed in searches alongside 
> less common words. For example, a search for [ The Sound and the Fury 
> ] will only return results for the terms "Sound" and "Fury." However, 
> a search that only includes stop words -- [ The Who ], for example -- 
> will be processed as is."
>
> The key here is "when they're placed in searches alongside less common 
> words".
> http://www.google.com/support/bin/answer.py?hl=en&answer=981
>
>
> Hope that answers your questions.
> Regards,
>  Aleks
>
>
> On Thu, 27 Nov 2008 14:34:00 +0100, David Causse <dc...@spotter.com> 
> wrote:
>
>> Hi,
>>
>> Look at this google query : 
>> http://www.google.fr/search?q=%22HOW+at+at+of+a+A+a%22
>>
>> What do you think about that concerning stop words?
>> Google has no stop words?
>>
>> David.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: [OT] About stopwords

Posted by "Aleksander M. Stensby" <al...@integrasco.no>.
Your query includeds apostrophes which tells google to include common  
words in the query.
But, if you remove the apostrophes, you will still get results, as google  
states:

"Google ignores stop words when they're placed in searches alongside less  
common words. For example, a search for [ The Sound and the Fury ] will  
only return results for the terms "Sound" and "Fury." However, a search  
that only includes stop words -- [ The Who ], for example -- will be  
processed as is."

The key here is "when they're placed in searches alongside less common  
words".
http://www.google.com/support/bin/answer.py?hl=en&answer=981


Hope that answers your questions.
Regards,
  Aleks


On Thu, 27 Nov 2008 14:34:00 +0100, David Causse <dc...@spotter.com>  
wrote:

> Hi,
>
> Look at this google query :  
> http://www.google.fr/search?q=%22HOW+at+at+of+a+A+a%22
>
> What do you think about that concerning stop words?
> Google has no stop words?
>
> David.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org