You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Kerang Lv (JIRA)" <ji...@apache.org> on 2006/09/12 10:13:25 UTC
[jira] Commented: (LUCENE-627) highlighter problems with overlapping tokens

    [ http://issues.apache.org/jira/browse/LUCENE-627?page=comments#action_12434087 ] 
            
Kerang Lv commented on LUCENE-627:
----------------------------------

Hi Yonik, 
I'm trying to add support for some overlapping bigram analyzer, e.g. the CJKAnalyzer(http://svn.apache.org/repos/asf/lucene/java/trunk/contrib/analyzers/src/java/org/apache/lucene/analysis/cjk/) onto your patch.

With your patch, the following test fails with:
Expected :一<B>二三</B>四五<B>六七</B>八九十
Actual :一<B>二三四五六七</B>

public void testOverlapAnalyzer4() throws Exception
{
    String s = "一二三四五六七八九十";
    // the token stream for the string above:
    TokenStream ts = new TokenStream() {
      Iterator iter;
      {
        List lst = new ArrayList();
        Token t;
        t = new Token("一二",0,2);
        lst.add(t);
        t = new Token("二三",1,3);
        lst.add(t);
        t = new Token("三四",2,4);
        lst.add(t);
        t = new Token("四五",3,5);
        lst.add(t);
        t = new Token("五六",4,6);
        lst.add(t);
        t = new Token("六七",5,7);
        lst.add(t);
        t = new Token("七八",6,8);
        lst.add(t);
        t = new Token("八九",7,9);
        lst.add(t);
        t = new Token("九十",8,10);
        lst.add(t);
        iter = lst.iterator();
      }
      public Token next() throws IOException {
        return iter.hasNext() ? (Token)iter.next() : null;
      }
    };

    String srchkey = "二三 六七";

    QueryParser parser=new QueryParser("text",new WhitespaceAnalyzer());
    Query query = parser.parse(srchkey);

    Highlighter highlighter = new Highlighter(new QueryScorer(query));

    // Get 3 best fragments and seperate with a "..."
    String result = highlighter.getBestFragments(ts, s, 3, "...");
    String expectedResult="一<B>二三</B>四五<B>六七</B>八九十";
    assertEquals(expectedResult,result);
} 

With some overlapping bigram analyzer, the current token's startOffset is the previous token's endOffset - 1, so the TokenGroup.isDistinct(token) returns false the most time, which lead to bad range tokenText.

Here is a patch that makes the tests work.

> highlighter problems with overlapping tokens
> --------------------------------------------
>
>                 Key: LUCENE-627
>                 URL: http://issues.apache.org/jira/browse/LUCENE-627
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: Other
>    Affects Versions: 2.0.1
>            Reporter: Yonik Seeley
>             Fix For: 2.0.1
>
>         Attachments: highlight_overlap.diff
>
>
> The lucene highlighter has problems when tokens that overlap are generated.
> For example, if analysis of iPod generates the tokens "i", "pod", "ipod" (with pod and ipod in the same position),
> then the highlighter will output this as iipod, regardless of if any of those tokens are highlighted.
> Discovered via http://issues.apache.org/jira/browse/SOLR-24

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

       

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org