You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Pierre GOSSE <pi...@arisem.com> on 2011/01/17 16:26:46 UTC
Highlighting overlapping tokens
Hi all,
I'm having an issue when highlighting fields that have overlapping tokens. There was a bug opened in Jira some year ago https://issues.apache.org/jira/browse/LUCENE-627 but I'm a bit confused about this. In jira bug's status is "resolved", but still I got the exact same problem with a genuine lucene 2.9.3.
Looking for what was going on, I checked org.apache.lucene.search.highlight.TokenSources that rebuilds a tokenStream from TermVectors and I found that token where not sorted by offset, as one would expect.
When sorting tokens, the following comparer is used :
public int compare(Object o1, Object o2)
{
Token t1=(Token) o1;
Token t2=(Token) o2;
if(t1.startOffset()>t2.endOffset())
return 1;
if(t1.startOffset()<t2.startOffset())
return -1;
return 0;
}
I'm not sure why endOffset is used instead of startOffset in first test (looks like a typo), and with non-overlapping token this works just fine.
But with overlapping tokens longest token get pushed to the end of their "overlapping zone" : (big,3,6), (fish,7,11), ({big fish},3,11) would end up sorted in this exact order, where I would have expected (big,3,6) ({big fish},3,11) (fish,7,11) or ({big fish},3,11) (big,3,6) (fish,7,11).
Highligthing with the term "{big fish}" builds a fragment by concatenating "big", "{big fish}", and "fish", giving this phrase : "big<em>big fish</em> fish".
I tested a quick fix by having preceding comparer changed like this :
public int compare(Object o1, Object o2)
{
Token t1 = (Token)o1;
Token t2 = (Token)o2;
if (t1.startOffset() > t2.startOffset())
return 1;
if (t1.startOffset() < t2.startOffset())
return -1;
if (t1.endOffset() < t2.endOffset())
return -1;
if (t1.endOffset() > t2.endOffset())
return 1;
return 0;
}
Highlight behavior is now correct as far as I tested it.
Maybe the original sorting order has a purpose I don't understand, but to me this slight modification seams to fix everything. What should I do ? (I'm very new to this list and this community).
If someone with better understanding of lucene highlight could give me some feedback, I would be grateful.
Thanks for your time.
Pierre
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org