You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by robert engels <re...@ix.netcom.com> on 2007/12/09 18:46:06 UTC

RE: Phrase Query Performance Question and score threshold

This subject brings up an interesting idea.

I question the value of any search that returns 100k-200k hits. What  
is the point?

The question then becomes when is it relevant?   It seems that it is  
only relevant when combined with other terms.

For example, I search for "hurricane katrina" and I get 100k-200k  
hits. Anything other than the top 1000? are probably irrelevant.

But, you still need to search/score those hits in order to find the  
top hits.

But, if I search for "hurricane katrina", and "president bush", maybe  
I only get 1000 documents, and possibly a far different set than the  
top 1000 when only searching on "hurricane katrina".

It seems that an efficient fix for this would be to add a "relevancy  
bit" to each document in the posting for the term.  It is basically a  
single bit norm by document & term.

When a query is run, it ignores any document without the relevancy  
bit set for that document/term in skipTo(), and sets a flag that  
documents were skipped.

If the query completes without the finding the requested number of  
documents, and documents were skipped, the query is rerun without the  
skipping.  Also, if during query scoring it seems that the number of  
documents is not going to be reached, it can disable the skipping at  
that point, and if reached, re-enable the skipping.  In order to make  
this work, you should score on the least frequent terms first. It  
would also only check the 'relevancy bit' for high frequency terms.

Has anyone implemented something like this? Thoughts on this?


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org