You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Pierre GOSSE <pi...@arisem.com> on 2011/01/17 16:26:46 UTC

Highlighting overlapping tokens

Hi all,

I'm having an issue when highlighting fields that have overlapping tokens. There was a bug opened in Jira some year ago https://issues.apache.org/jira/browse/LUCENE-627 but I'm a bit confused about this. In jira bug's status is "resolved", but still I got the exact same problem with a genuine lucene 2.9.3.

Looking for what was going on, I checked org.apache.lucene.search.highlight.TokenSources that rebuilds a tokenStream from TermVectors and I found that token where not sorted by offset, as one would expect.

When sorting tokens, the following comparer is used :

	public int compare(Object o1, Object o2)
	{
		Token t1=(Token) o1;
		Token t2=(Token) o2;
		if(t1.startOffset()>t2.endOffset())
			return 1;
		if(t1.startOffset()<t2.startOffset())
			return -1;
		return 0;
	}

I'm not sure why endOffset is used instead of startOffset in first test (looks like a typo), and with non-overlapping token this works just fine. 

But with overlapping tokens longest token get pushed to the end of their "overlapping zone" : (big,3,6), (fish,7,11), ({big fish},3,11) would end up sorted in this exact order, where I would have expected (big,3,6) ({big fish},3,11) (fish,7,11) or ({big fish},3,11) (big,3,6) (fish,7,11).
Highligthing with the term "{big fish}" builds a fragment by concatenating "big", "{big fish}", and "fish", giving this phrase : "big<em>big fish</em> fish".

I tested a quick fix by having preceding comparer changed like this :

	public int compare(Object o1, Object o2)
	{
		Token t1 = (Token)o1;
		Token t2 = (Token)o2;
		if (t1.startOffset() > t2.startOffset())
			return 1;
		if (t1.startOffset() < t2.startOffset())
			return -1;
		if (t1.endOffset() < t2.endOffset())
			return -1;
		if (t1.endOffset() > t2.endOffset())
			return 1;
		return 0;
	}

Highlight behavior is now correct as far as I tested it. 

Maybe the original sorting order has a purpose I don't understand, but to me this slight modification seams to fix everything. What should I do ? (I'm very new to this list and this community). 

If someone with better understanding of lucene highlight could give me some feedback, I would be grateful.

Thanks for your time.

Pierre


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org