You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by John Byrne <jo...@propylon.com> on 2008/03/03 11:40:32 UTC

bigram analysis

Hi,

I need to use stop-word bigrams, liike the Nutch analyzer, as described 
in LIA 4.8 (Nutch Analysis). What I don't understand is, why does it 
keep the original stop word intact? I can see great advantage to being 
able to search for a combination of stop word + real word, but I don't 
see the point of keeping the stop word as a token on it's own. Searches 
with just that word would be as pointless as ever.

Is the idea to allow searching on all stop words, even on their own, and 
the bigrams are just an optimization that will improve things 90% of the 
time? Or is it just a side effect of the bigram analyzer that it 
produces a token from the stop word, and therefore it could just be 
filtered out by a stop word filter afterwards, leaving only the bigram 
and the original (non-stop) word?

I'm sure either way would work fr me - just wondering what is normally 
done, and if I'm missing something important here...

Thanks!
-John

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: bigram analysis

Posted by John Byrne <jo...@propylon.com>.

Yes, this makes sense to me. I think I'll just keep all words, including 
stop words, and if performance ever becomes an issue, I'll look at 
bigrams again. But I think there's a good chance that I'll never see 
significant impact either way.

Thanks guys!

Grant Ingersoll wrote:
> Yep, still good reasons like I said, but becoming less important as 
> the hardware, etc. gets faster and cheaper, IMO, especially in the 
> context of more advanced search capabilities.
>
> On Mar 3, 2008, at 10:49 AM, Mathieu Lecarme wrote:
>
>>
>>> Not sure, you might want to ask on Nutch.  From a strict language 
>>> standpoint, the notion of a stopword in my mind is a bit dubious.  
>>> If the word really has no meaning, then why does the language have 
>>> it to begin with?  In a search context, it has been treated as of 
>>> minimal use in the early days mostly because of space and memory 
>>> considerations.  Now a days, as we get more sophisticated in our 
>>> search capabilities, I think it can be useful for doing better 
>>> phrase matching, etc. as well as more advanced NLP search.  Now it 
>>> seems like the general response is disk is cheap, why throw away 
>>> information?
>> To limit writing on disk, to simplify merge ?
>>
>> I don't know the ratio of stop word in current texts.
>>
>> M.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: bigram analysis

Posted by Grant Ingersoll <gs...@apache.org>.

Yep, still good reasons like I said, but becoming less important as  
the hardware, etc. gets faster and cheaper, IMO, especially in the  
context of more advanced search capabilities.

On Mar 3, 2008, at 10:49 AM, Mathieu Lecarme wrote:

>
>> Not sure, you might want to ask on Nutch.  From a strict language  
>> standpoint, the notion of a stopword in my mind is a bit dubious.   
>> If the word really has no meaning, then why does the language have  
>> it to begin with?  In a search context, it has been treated as of  
>> minimal use in the early days mostly because of space and memory  
>> considerations.  Now a days, as we get more sophisticated in our  
>> search capabilities, I think it can be useful for doing better  
>> phrase matching, etc. as well as more advanced NLP search.  Now it  
>> seems like the general response is disk is cheap, why throw away  
>> information?
> To limit writing on disk, to simplify merge ?
>
> I don't know the ratio of stop word in current texts.
>
> M.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: bigram analysis

Posted by Mathieu Lecarme <ma...@garambrogne.net>.

> Not sure, you might want to ask on Nutch.  From a strict language 
> standpoint, the notion of a stopword in my mind is a bit dubious.  If 
> the word really has no meaning, then why does the language have it to 
> begin with?  In a search context, it has been treated as of minimal 
> use in the early days mostly because of space and memory 
> considerations.  Now a days, as we get more sophisticated in our 
> search capabilities, I think it can be useful for doing better phrase 
> matching, etc. as well as more advanced NLP search.  Now it seems like 
> the general response is disk is cheap, why throw away information?
To limit writing on disk, to simplify merge ?

I don't know the ratio of stop word in current texts.

M.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: bigram analysis

Posted by Grant Ingersoll <gs...@apache.org>.

On Mar 3, 2008, at 5:40 AM, John Byrne wrote:

> Hi,
>
> I need to use stop-word bigrams, liike the Nutch analyzer, as  
> described in LIA 4.8 (Nutch Analysis). What I don't understand is,  
> why does it keep the original stop word intact? I can see great  
> advantage to being able to search for a combination of stop word +  
> real word, but I don't see the point of keeping the stop word as a  
> token on it's own. Searches with just that word would be as  
> pointless as ever.

I don't know, Google allows for stopword searches.  Just try "the" as  
a query (although it is kind of funny what the results are: "The  
Onion" is the top use of the in the world?  And it is even more  
curious that people actually bought ads for the word "the", but that  
is a digression).  I don't exactly know Nutch's analyzer, but it could  
be that it helps with phrases.  I suppose one would have to look at  
Nutch's query parser as well to get a sense of how they are used.

>
>
> Is the idea to allow searching on all stop words, even on their own,  
> and the bigrams are just an optimization that will improve things  
> 90% of the time? Or is it just a side effect of the bigram analyzer  
> that it produces a token from the stop word, and therefore it could  
> just be filtered out by a stop word filter afterwards, leaving only  
> the bigram and the original (non-stop) word?

Not sure, you might want to ask on Nutch.  From a strict language  
standpoint, the notion of a stopword in my mind is a bit dubious.  If  
the word really has no meaning, then why does the language have it to  
begin with?  In a search context, it has been treated as of minimal  
use in the early days mostly because of space and memory  
considerations.  Now a days, as we get more sophisticated in our  
search capabilities, I think it can be useful for doing better phrase  
matching, etc. as well as more advanced NLP search.  Now it seems like  
the general response is disk is cheap, why throw away information?

>
>
> I'm sure either way would work fr me - just wondering what is  
> normally done, and if I'm missing something important here...

"It depends".  I think most Lucene users just use the generally held  
assumption that you should remove stopwords, but I am not sure.  At a  
minimum, I think the answer is it depends on the application.  If you  
want to do what you describe above, I would keep them.  In the end,  
the IDF factor should handle the commonality of them quite nicely so  
as any use of them as a general term (and not part of a phrase) will  
not affect relevance all that much.

-Grant

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org