You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Kerang Lv (JIRA)" <ji...@apache.org> on 2006/09/12 10:13:25 UTC
[jira] Commented: (LUCENE-627) highlighter problems with
overlapping tokens
[ http://issues.apache.org/jira/browse/LUCENE-627?page=comments#action_12434087 ]
Kerang Lv commented on LUCENE-627:
----------------------------------
Hi Yonik,
I'm trying to add support for some overlapping bigram analyzer, e.g. the CJKAnalyzer(http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cjk/) onto your patch.
With your patch, the following test fails with:
Expected :一<B>二三</B>四五<B>六七</B>八九十
Actual :一<B>二三四五六七</B>
public void testOverlapAnalyzer4() throws Exception
{
String s = "一二三四五六七八九十";
// the token stream for the string above:
TokenStream ts = new TokenStream() {
Iterator iter;
{
List lst = new ArrayList();
Token t;
t = new Token("一二",0,2);
lst.add(t);
t = new Token("二三",1,3);
lst.add(t);
t = new Token("三四",2,4);
lst.add(t);
t = new Token("四五",3,5);
lst.add(t);
t = new Token("五六",4,6);
lst.add(t);
t = new Token("六七",5,7);
lst.add(t);
t = new Token("七八",6,8);
lst.add(t);
t = new Token("八九",7,9);
lst.add(t);
t = new Token("九十",8,10);
lst.add(t);
iter = lst.iterator();
}
public Token next() throws IOException {
return iter.hasNext() ? (Token)iter.next() : null;
}
};
String srchkey = "二三 六七";
QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer());
Query query = parser.parse(srchkey);
Highlighter highlighter = new Highlighter(new QueryScorer(query));
// Get 3 best fragments and seperate with a "..."
String result = highlighter.getBestFragments(ts, s, 3, "...");
String expectedResult="一<B>二三</B>四五<B>六七</B>八九十";
assertEquals(expectedResult,result);
}
With some overlapping bigram analyzer, the current token's startOffset is the previous token's endOffset - 1, so the TokenGroup.isDistinct(token) returns false the most time, which lead to bad range tokenText.
Here is a patch that makes the tests work.
> highlighter problems with overlapping tokens
> --------------------------------------------
>
> Key: LUCENE-627
> URL: http://issues.apache.org/jira/browse/LUCENE-627
> Project: Lucene - Java
> Issue Type: Bug
> Components: Other
> Affects Versions: 2.0.1
> Reporter: Yonik Seeley
> Fix For: 2.0.1
>
> Attachments: highlight_overlap.diff
>
>
> The lucene highlighter has problems when tokens that overlap are generated.
> For example, if analysis of iPod generates the tokens "i", "pod", "ipod" (with pod and ipod in the same position),
> then the highlighter will output this as iipod, regardless of if any of those tokens are highlighted.
> Discovered via http://issues.apache.org/jira/browse/SOLR-24
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org