You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Doron Cohen <DO...@il.ibm.com> on 2006/09/26 03:52:02 UTC

highlight - scoring fragments with more of the same token

This question was raised in the user's list -
http://www.nabble.com/highlighting-tf2322109.html

Assume three fragments and two queries:
  f1 = aa  11  bb  33  cc
  f2 = aa  11  bb  11  cc
  f3 = aa  11  bb  22  cc
  q1 = 11 22
  q2 = 11
Now we call highlighter.getBestFragment(q);
For q1, f3 is returned, as expected.
For q2, f1 is returned, although "11" appears twice in f2 but only once in
f1.

This is because QueryScorer.getTokenScore(Token) counts only unique
fragment tokens.

Would it make sense to make this behavior controllable?
(It is easily done but I am not sure about the consequences.)

Or perhaps there is a way to achieve this behavior (preferring f2 on f1 for
q2 above) that I missed?



---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: highlight - scoring fragments with more of the same token

Posted by Chris Hostetter <ho...@fucit.org>.

: TF is not a factor in fragment scores because I found its typically more
: useful to look for fragments containing a strong mix of the query terms
: - not merely repetitions of the same term. The idea is the choice of
: scorer is pluggable if you don't like the default behaviour.

Taking a "coord" factor into consideration in that case may help balance
out the benefits of tf weighting vs mixed terms.  (myabe the default
highlighting options already do that, i'm not sure ... just tossing it out
as a comment from the peanut gallery)



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: highlight - scoring fragments with more of the same token

Posted by markharw00d <ma...@yahoo.co.uk>.

>>I was somewhat surprised to find that highlighting scoring simply counts
>>how many unique query terms appear in the fragment. Guess was expecting a

See QueryScorer(Query query, IndexReader reader, String fieldName) constructor - this will factor IDF into weighting for terms. Query boosts are automatically factored in too.
TF is not a factor in fragment scores because I found its typically more useful to look for fragments containing a strong mix of the query terms - not merely repetitions of the same term. The idea is the choice of scorer is pluggable if you don't like the default behaviour.

The possibility of adding smarter fragmenting is also enabled by the interface for Fragmenter - no "smarter" alternatives to the simple one have been implemented as yet though (as far as I am aware).

Cheers
Mark




		
___________________________________________________________ 
Win a BlackBerry device from O2 with Yahoo!. Enter now. http://www.yahoo.co.uk/blackberry

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: highlight - scoring fragments with more of the same token

Posted by Doron Cohen <DO...@il.ibm.com>.

markharw00d <ma...@yahoo.co.uk> wrote on 26/09/2006 00:11:12:
> If you were to score repeated terms then I suspect it would have to be
> done so that the repetitions didn't score as highly as the first
> occurrence - otherwise f2 could be selected as a better fragment than f3
> for the query q1 in your example.
> Repetitions of a term in a fragment could be scored as a very small
> fraction of the score given to the first occurrence. This would at least
> rank  f2 higher than f1 for query q2.
> Another potentially useful ranking factor may be to boost fragments
> found at the beginning of a document - that's where people tend to write
> summaries or introductions.

Yes, it makes sense to add these heuristics.

I was somewhat surprised to find that highlighting scoring simply counts
how many unique query terms appear in the fragment. Guess was expecting a
more similarity like ranking of fragments - something that would perhaps
have tf related to the frequency of a term in a fragment, and idf related
to the frequency of the term in the entire text. Idf would be meaningless
for a single term query. Possibly, idf could relate to "iff" ~ inverse
number of fragments containing the term. I am not sure if this is worth the
effort, but it seems more correct...?

Another thing I saw is that Highlighter seems to break the text arbitrarily
by max-fragment-size, so for text:
  1 2 x 4 a b x d y B C D
if it happens to be broken into 4 tokens fragments, for query "x y" result
would be:
  1 2 x 4 - score 1
  a b x d - score 1
  y B C D - score 1
and the first fragment would be selected 'best', although the fragment "x d
y B" that appears in that text is better. Again, not sure if this is worth
the effort - having overlapping between candidate fragments - just
something to think about.

>
>
> Doron Cohen wrote:
> > This question was raised in the user's list -
> > http://www.nabble.com/highlighting-tf2322109.html
> >
> > Assume three fragments and two queries:
> >   f1 = aa  11  bb  33  cc
> >   f2 = aa  11  bb  11  cc
> >   f3 = aa  11  bb  22  cc
> >   q1 = 11 22
> >   q2 = 11
> > Now we call highlighter.getBestFragment(q);
> > For q1, f3 is returned, as expected.
> > For q2, f1 is returned, although "11" appears twice in f2 but only once
in
> > f1.
> >
> > This is because QueryScorer.getTokenScore(Token) counts only unique
> > fragment tokens.
> >
> > Would it make sense to make this behavior controllable?
> > (It is easily done but I am not sure about the consequences.)
> >
> > Or perhaps there is a way to achieve this behavior (preferring f2 on f1
for
> > q2 above) that I missed?
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-dev-help@lucene.apache.org
> >
> >
> >
> >
>
>
>
>
> ___________________________________________________________
> Copy addresses and emails from any email account to Yahoo! Mail -
> quick, easy and free. http://uk.docs.yahoo.com/trueswitch2.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: highlight - scoring fragments with more of the same token

Posted by markharw00d <ma...@yahoo.co.uk>.

If you were to score repeated terms then I suspect it would have to be 
done so that the repetitions didn't score as highly as the first 
occurrence - otherwise f2 could be selected as a better fragment than f3 
for the query q1 in your example.
Repetitions of a term in a fragment could be scored as a very small 
fraction of the score given to the first occurrence. This would at least 
rank  f2 higher than f1 for query q2.
Another potentially useful ranking factor may be to boost fragments 
found at the beginning of a document - that's where people tend to write 
summaries or introductions.

Doron Cohen wrote:
> This question was raised in the user's list -
> http://www.nabble.com/highlighting-tf2322109.html
>
> Assume three fragments and two queries:
>   f1 = aa  11  bb  33  cc
>   f2 = aa  11  bb  11  cc
>   f3 = aa  11  bb  22  cc
>   q1 = 11 22
>   q2 = 11
> Now we call highlighter.getBestFragment(q);
> For q1, f3 is returned, as expected.
> For q2, f1 is returned, although "11" appears twice in f2 but only once in
> f1.
>
> This is because QueryScorer.getTokenScore(Token) counts only unique
> fragment tokens.
>
> Would it make sense to make this behavior controllable?
> (It is easily done but I am not sure about the consequences.)
>
> Or perhaps there is a way to achieve this behavior (preferring f2 on f1 for
> q2 above) that I missed?
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>
>
>
>   

___________________________________________________________ 
Copy addresses and emails from any email account to Yahoo! Mail - quick, easy and free. http://uk.docs.yahoo.com/trueswitch2.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org