You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by David Causse <dc...@spotter.com> on 2008/11/27 14:34:00 UTC
[OT] About stopwords
Hi,
Look at this google query :
http://www.google.fr/search?q=%22HOW+at+at+of+a+A+a%22
What do you think about that concerning stop words?
Google has no stop words?
David.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: [OT] About stopwords
Posted by Michael McCandless <lu...@mikemccandless.com>.
That's a phrase search, so it's conceivable google could be doing
something similar to nutch, whereby adjacent ngrams are indexed as
unique terms.
But if you do the same search without quotes:
http://www.google.fr/search?hl=fr&q=HOW+at+at+of+a+A+a&btnG=Rechercher&meta=
they still find many matches (though, curiously the one result
returned for the phrase search seems not to make the first page for
the non-phrase search).
So it does seem like Google has no stop words.
It actually makes some sense, because Google obviously has to deal
with non-stopword terms that have tremendous frequency (eg "1" and
"2", which occur more frequently than "a" or "the") by scaling out
across machines, so since they already solved that scaleout anyway,
the added incremental cost of including stopwords is probably minor.
Mike
David Causse wrote:
> Hi,
>
> Look at this google query : http://www.google.fr/search?q=%22HOW+at+at+of+a+A+a%22
>
> What do you think about that concerning stop words?
> Google has no stop words?
>
> David.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: [OT] About stopwords
Posted by David Causse <dc...@spotter.com>.
Thanks for the tip,
but I can't imagine the number of documents google has to join in order
process such results...
There must be a trick.
Maybe stopwords are not indexed alone but twice with previous and next
token, some sort of 2-gram index?
David.
Aleksander M. Stensby a écrit :
> Your query includeds apostrophes which tells google to include common
> words in the query.
> But, if you remove the apostrophes, you will still get results, as
> google states:
>
> "Google ignores stop words when they're placed in searches alongside
> less common words. For example, a search for [ The Sound and the Fury
> ] will only return results for the terms "Sound" and "Fury." However,
> a search that only includes stop words -- [ The Who ], for example --
> will be processed as is."
>
> The key here is "when they're placed in searches alongside less common
> words".
> http://www.google.com/support/bin/answer.py?hl=en&answer=981
>
>
> Hope that answers your questions.
> Regards,
> Aleks
>
>
> On Thu, 27 Nov 2008 14:34:00 +0100, David Causse <dc...@spotter.com>
> wrote:
>
>> Hi,
>>
>> Look at this google query :
>> http://www.google.fr/search?q=%22HOW+at+at+of+a+A+a%22
>>
>> What do you think about that concerning stop words?
>> Google has no stop words?
>>
>> David.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: [OT] About stopwords
Posted by "Aleksander M. Stensby" <al...@integrasco.no>.
Your query includeds apostrophes which tells google to include common
words in the query.
But, if you remove the apostrophes, you will still get results, as google
states:
"Google ignores stop words when they're placed in searches alongside less
common words. For example, a search for [ The Sound and the Fury ] will
only return results for the terms "Sound" and "Fury." However, a search
that only includes stop words -- [ The Who ], for example -- will be
processed as is."
The key here is "when they're placed in searches alongside less common
words".
http://www.google.com/support/bin/answer.py?hl=en&answer=981
Hope that answers your questions.
Regards,
Aleks
On Thu, 27 Nov 2008 14:34:00 +0100, David Causse <dc...@spotter.com>
wrote:
> Hi,
>
> Look at this google query :
> http://www.google.fr/search?q=%22HOW+at+at+of+a+A+a%22
>
> What do you think about that concerning stop words?
> Google has no stop words?
>
> David.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
--
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org