You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by qaz zaq <fo...@yahoo.com> on 2006/10/03 18:50:24 UTC

Lucene scoring question (how to boost leading terms match)

Hi,
   
  I have a question about the lucene scoring. In my following example, how can I ensure the doc1 has the higher score than doc2, if I search for "A*". In another words, I want to boost the docs which match their leading terms. 
   
  doc1: Aterm  Bterm  Cterm
  doc2: Bterm  Aterm  Cterm

 __________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: Lucene scoring question (how to boost leading terms match)

Posted by Chris Hostetter <ho...@fucit.org>.

take a look at the SpanFirstQuery ... do do "A*" type searches inside of a
SpanQuery you'll eithe need to use SpanRegexQuery, or roll your own
SpamPrefixQuery out of a SpanOrQuery containing SpamTermQueries.

:   I have a question about the lucene scoring. In my following example,
: how can I ensure the doc1 has the higher score than doc2, if I search
: for "A*". In another words, I want to boost the docs which match their
: leading terms.
:
:   doc1: Aterm  Bterm  Cterm
:   doc2: Bterm  Aterm  Cterm



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene scoring question (how to boost leading terms match)

Posted by Chris Hostetter <ho...@fucit.org>.

: does not pour affinity information into the score - i.e. both doc1 and doc2
: in your example would get the same score, and the SpanFirstQurey would only
: allow you to limit the set of returned documents - Hoss, do you agree with
: this?

Oh ... hmmm ... i think you're right.  SpanScorer scores smaller spans
higher, and I keep thinking that SpanFirst to create Spans that include
the "0" position but you just reminded me that it doesn't (i discovered
that when writting some test cases for SpanScorer.explain) ... it just
returns the Span of the nested query

I recall thinking that it would be really easy to add a boolean arg to
SpanFirstQuery that would make the Span start at 0 to change that behavior
... but i never tried it because i don't personally use Spans so i
couldn't really judge if it was worthwhile/effective ... not to mention i
had bigger fish to fry.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Lucene scoring question (how to boost leading terms match)

Posted by Doron Cohen <DO...@il.ibm.com>.

If I understand the question, you do not want to boost in advance a certain
doc, but rather score higher those documents containing the search term
closer to the start of the document.

There is more to define here - for instance, if doc1 has 5 words but doc2
has 1,000,000 words, would you still prefer doc1? There is a field norm
factor in Lucene that assign higher scores to matches in shorter field -
would you like to override this as well?

To your question, I can think of these possibilities:

  (1) Write your own query, with a scorer that scores based on term
position (possibly with some relation to field length). This is not
straightforward, and I'm not sure this is the solution you were hoping for.

  (2) Use SpanFirstQuery - something like: new SpanFirstQuery(new
SpanTermQuery(new Term("fieldName","word")),8) - as Hoss suggested. But I
think that here again you would need to modify the scorer to score first
matches higher, because as far as I can see the SpanScorer in use there
does not pour affinity information into the score - i.e. both doc1 and doc2
in your example would get the same score, and the SpanFirstQurey would only
allow you to limit the set of returned documents - Hoss, do you agree with
this?

  (3) When adding the documents to the index, add a special <doc-start>
token to each document - for instance by pre-padding this special token to
the text of the indexed document's field. Then and use a Lucene query that
scores higher terms that are not "too" far away, for instance using a
PhraseQuery with a slope factor greater than 0.
Lastly, modify the query
    ABC
to a phrase query:
    "<doc-start> ABC"
with a slope factor that suitss your needs.
One problem I see with this is that all the documents in your index would
have this token.
Another problem is I don't think prefix queries (e.g. A*) are supported in
a phrase, and if so you would need to extend it a bit..

Hope this helps,
Doron

qaz zaq <fo...@yahoo.com> wrote on 03/10/2006 09:50:24:

> Hi,
>
>   I have a question about the lucene scoring. In my following
> example, how can I ensure the doc1 has the higher score than doc2,
> if I search for "A*". In another words, I want to boost the docs
> which match their leading terms.
>
>   doc1: Aterm  Bterm  Cterm
>   doc2: Bterm  Aterm  Cterm
>
>  __________________________________________________
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org