You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Koji Sekiguchi (JIRA)" <ji...@apache.org> on 2008/12/12 02:36:44 UTC

[jira] Created: (LUCENE-1489) highlighter problem with n-gram tokens

highlighter problem with n-gram tokens
--------------------------------------

                 Key: LUCENE-1489
                 URL: https://issues.apache.org/jira/browse/LUCENE-1489
             Project: Lucene - Java
          Issue Type: Bug
          Components: contrib/highlighter
            Reporter: Koji Sekiguchi
            Priority: Minor


I have a problem when using n-gram and highlighter. I thought it had been solved in LUCENE-627...

Actually, I found this problem when I was using CJKTokenizer on Solr, though, here is lucene program to reproduce it using NGramTokenizer(min=2,max=2) instead of CJKTokenizer:

{code:java}
public class TestNGramHighlighter {

  public static void main(String[] args) throws Exception {
    Analyzer analyzer = new NGramAnalyzer();
    final String TEXT = "Lucene can make index. Then Lucene can search.";
    final String QUERY = "can";
    QueryParser parser = new QueryParser("f",analyzer);
    Query query = parser.parse(QUERY);
    QueryScorer scorer = new QueryScorer(query,"f");
    Highlighter h = new Highlighter( scorer );
    System.out.println( h.getBestFragment(analyzer, "f", TEXT) );
  }

  static class NGramAnalyzer extends Analyzer {
    public TokenStream tokenStream(String field, Reader input) {
      return new NGramTokenizer(input,2,2);
    }
  }
}
{code}

expected output is:
Lucene <B>can</B> make index. Then Lucene <B>can</B> search.

but the actual output is:
Lucene <B>can make index. Then Lucene can</B> search.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1489) highlighter problem with n-gram tokens

Posted by "David Bowen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Bowen updated LUCENE-1489:
--------------------------------

    Attachment: lucene1489.patch

Here's a patch to Highlighter.java that fixes the examples.  The basic idea is to throw away (or ignore) overlapping tokens when they don't have a score, so that a token group doesn't get expanded beyond a sequence of tokens that should be highlighted.

> highlighter problem with n-gram tokens
> --------------------------------------
>
>                 Key: LUCENE-1489
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1489
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: lucene1489.patch
>
>
> I have a problem when using n-gram and highlighter. I thought it had been solved in LUCENE-627...
> Actually, I found this problem when I was using CJKTokenizer on Solr, though, here is lucene program to reproduce it using NGramTokenizer(min=2,max=2) instead of CJKTokenizer:
> {code:java}
> public class TestNGramHighlighter {
>   public static void main(String[] args) throws Exception {
>     Analyzer analyzer = new NGramAnalyzer();
>     final String TEXT = "Lucene can make index. Then Lucene can search.";
>     final String QUERY = "can";
>     QueryParser parser = new QueryParser("f",analyzer);
>     Query query = parser.parse(QUERY);
>     QueryScorer scorer = new QueryScorer(query,"f");
>     Highlighter h = new Highlighter( scorer );
>     System.out.println( h.getBestFragment(analyzer, "f", TEXT) );
>   }
>   static class NGramAnalyzer extends Analyzer {
>     public TokenStream tokenStream(String field, Reader input) {
>       return new NGramTokenizer(input,2,2);
>     }
>   }
> }
> {code}
> expected output is:
> Lucene <B>can</B> make index. Then Lucene <B>can</B> search.
> but the actual output is:
> Lucene <B>can make index. Then Lucene can</B> search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Updated: (LUCENE-1489) highlighter problem with n-gram tokens

Posted by "David Bowen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/LUCENE-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Bowen updated LUCENE-1489:
--------------------------------

    Attachment: LUCENE-1489.patch

Updated patch to work with tokenizer API changes.


> highlighter problem with n-gram tokens
> --------------------------------------
>
>                 Key: LUCENE-1489
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1489
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: LUCENE-1489.patch, lucene1489.patch
>
>
> I have a problem when using n-gram and highlighter. I thought it had been solved in LUCENE-627...
> Actually, I found this problem when I was using CJKTokenizer on Solr, though, here is lucene program to reproduce it using NGramTokenizer(min=2,max=2) instead of CJKTokenizer:
> {code:java}
> public class TestNGramHighlighter {
>   public static void main(String[] args) throws Exception {
>     Analyzer analyzer = new NGramAnalyzer();
>     final String TEXT = "Lucene can make index. Then Lucene can search.";
>     final String QUERY = "can";
>     QueryParser parser = new QueryParser("f",analyzer);
>     Query query = parser.parse(QUERY);
>     QueryScorer scorer = new QueryScorer(query,"f");
>     Highlighter h = new Highlighter( scorer );
>     System.out.println( h.getBestFragment(analyzer, "f", TEXT) );
>   }
>   static class NGramAnalyzer extends Analyzer {
>     public TokenStream tokenStream(String field, Reader input) {
>       return new NGramTokenizer(input,2,2);
>     }
>   }
> }
> {code}
> expected output is:
> Lucene <B>can</B> make index. Then Lucene <B>can</B> search.
> but the actual output is:
> Lucene <B>can make index. Then Lucene can</B> search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1489) highlighter problem with n-gram tokens

Posted by "David Bowen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761441#action_12761441 ] 

David Bowen commented on LUCENE-1489:
-------------------------------------

By the way. here is the output from Chris's test program with this patch:
{code}
Testing analyzer Bigram shingle analyzer (bigrams and unigrams)...
---------------------------------
<B>Lucene</B> can index and can search [query='Lucene']
Lucene <B>can</B> make an index [query='can']
Lucene <B>can</B> index and <B>can</B> search [query='can']
Lucene <B>can</B> index <B>can</B> search and <B>can</B> highlight [query='can']
Lucene can <B>index</B> can <B>search</B> and can highlight [query='+index +search']

Testing analyzer Bigram (non-shingle) analyzer (bigrams only)...
---------------------------------
<B>Lucene</B> can index and can search [query='Lucene']
Lucene <B>can</B> make an index [query='can']
Lucene <B>can</B> index and <B>can</B> search [query='can']
Lucene <B>can</B> index <B>can</B> search and <B>can</B> highlight [query='can']
Lucene can <B>index</B> can <B>search</B> and can highlight [query='+index +search']
{code}



> highlighter problem with n-gram tokens
> --------------------------------------
>
>                 Key: LUCENE-1489
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1489
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: lucene1489.patch
>
>
> I have a problem when using n-gram and highlighter. I thought it had been solved in LUCENE-627...
> Actually, I found this problem when I was using CJKTokenizer on Solr, though, here is lucene program to reproduce it using NGramTokenizer(min=2,max=2) instead of CJKTokenizer:
> {code:java}
> public class TestNGramHighlighter {
>   public static void main(String[] args) throws Exception {
>     Analyzer analyzer = new NGramAnalyzer();
>     final String TEXT = "Lucene can make index. Then Lucene can search.";
>     final String QUERY = "can";
>     QueryParser parser = new QueryParser("f",analyzer);
>     Query query = parser.parse(QUERY);
>     QueryScorer scorer = new QueryScorer(query,"f");
>     Highlighter h = new Highlighter( scorer );
>     System.out.println( h.getBestFragment(analyzer, "f", TEXT) );
>   }
>   static class NGramAnalyzer extends Analyzer {
>     public TokenStream tokenStream(String field, Reader input) {
>       return new NGramTokenizer(input,2,2);
>     }
>   }
> }
> {code}
> expected output is:
> Lucene <B>can</B> make index. Then Lucene <B>can</B> search.
> but the actual output is:
> Lucene <B>can make index. Then Lucene can</B> search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1489) highlighter problem with n-gram tokens

Posted by "Mark Harwood (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667654#action_12667654 ] 

Mark Harwood commented on LUCENE-1489:
--------------------------------------

It looks to me like this could be fixed in the "Formatter" classes when marking up the output string.

Currently classes such as SimpleHTMLFormatter in their "highlightTerm" method put a tag around the whole section of text, if it contains a hit, i.e.

{code:title=SimpleHTMLFormatter.java|borderStyle=solid}
	public String highlightTerm(String originalText, TokenGroup tokenGroup)
	{
		StringBuffer returnBuffer;
		if(tokenGroup.getTotalScore()>0)
		{
			returnBuffer=new StringBuffer();
			returnBuffer.append(preTag);
			returnBuffer.append(originalText);
			returnBuffer.append(postTag);
			return returnBuffer.toString();
		}
		return originalText;
	}
{code}

The TokenGroup object passed to this method contains all of the tokens and their scores so it should be possible to use this information to deconstruct the originalText parameter and inject markup according to which tokens in the group had a match rather than putting a tag around the whole block.  Some complexity may lie in handling token streams that produce tokens that "rewind" to earlier offsets.
SimpleHtmlFormatter suddenly seems less simple!

TokenStreams that produce entirely overlapping streams of tokens will automatically be broken into multiple TokenGroups because TokenGroup has a maximum number of linked Tokens it will ever hold in a single group.

I haven't got the time to fix this right now but if someone has a burning need to leap in, the above seems like what may be required.

Cheers
Mark






> highlighter problem with n-gram tokens
> --------------------------------------
>
>                 Key: LUCENE-1489
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1489
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>
> I have a problem when using n-gram and highlighter. I thought it had been solved in LUCENE-627...
> Actually, I found this problem when I was using CJKTokenizer on Solr, though, here is lucene program to reproduce it using NGramTokenizer(min=2,max=2) instead of CJKTokenizer:
> {code:java}
> public class TestNGramHighlighter {
>   public static void main(String[] args) throws Exception {
>     Analyzer analyzer = new NGramAnalyzer();
>     final String TEXT = "Lucene can make index. Then Lucene can search.";
>     final String QUERY = "can";
>     QueryParser parser = new QueryParser("f",analyzer);
>     Query query = parser.parse(QUERY);
>     QueryScorer scorer = new QueryScorer(query,"f");
>     Highlighter h = new Highlighter( scorer );
>     System.out.println( h.getBestFragment(analyzer, "f", TEXT) );
>   }
>   static class NGramAnalyzer extends Analyzer {
>     public TokenStream tokenStream(String field, Reader input) {
>       return new NGramTokenizer(input,2,2);
>     }
>   }
> }
> {code}
> expected output is:
> Lucene <B>can</B> make index. Then Lucene <B>can</B> search.
> but the actual output is:
> Lucene <B>can make index. Then Lucene can</B> search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1489) highlighter problem with n-gram tokens

Posted by "David Bowen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12761439#action_12761439 ] 

David Bowen commented on LUCENE-1489:
-------------------------------------

Mark, I tried the approach you suggested of using the Formatter interface.  I found it didn't work because the Formatter did not have a way to map the tokens in the token group into the text.  This could be fixed by providing a public accessor function for TokenGroup's matchStartOffset field.  However, it seems convoluted to go to the trouble of constructing a TokenGroup only to have every Formatter have to take it all apart again to find the places within it that need highlighting.  It seems to me that the purpose of a TokenGroup is to identify (up to) one span of characters that needs to be highlighted.

> highlighter problem with n-gram tokens
> --------------------------------------
>
>                 Key: LUCENE-1489
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1489
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: lucene1489.patch
>
>
> I have a problem when using n-gram and highlighter. I thought it had been solved in LUCENE-627...
> Actually, I found this problem when I was using CJKTokenizer on Solr, though, here is lucene program to reproduce it using NGramTokenizer(min=2,max=2) instead of CJKTokenizer:
> {code:java}
> public class TestNGramHighlighter {
>   public static void main(String[] args) throws Exception {
>     Analyzer analyzer = new NGramAnalyzer();
>     final String TEXT = "Lucene can make index. Then Lucene can search.";
>     final String QUERY = "can";
>     QueryParser parser = new QueryParser("f",analyzer);
>     Query query = parser.parse(QUERY);
>     QueryScorer scorer = new QueryScorer(query,"f");
>     Highlighter h = new Highlighter( scorer );
>     System.out.println( h.getBestFragment(analyzer, "f", TEXT) );
>   }
>   static class NGramAnalyzer extends Analyzer {
>     public TokenStream tokenStream(String field, Reader input) {
>       return new NGramTokenizer(input,2,2);
>     }
>   }
> }
> {code}
> expected output is:
> Lucene <B>can</B> make index. Then Lucene <B>can</B> search.
> but the actual output is:
> Lucene <B>can make index. Then Lucene can</B> search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1489) highlighter problem with n-gram tokens

Posted by "Jens Muecke (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797501#action_12797501 ] 

Jens Muecke commented on LUCENE-1489:
-------------------------------------

I tried this patch. After applying, following testcase fail:

{noformat}
    [junit] Testcase: testOverlapAnalyzer2(org.apache.lucene.search.highlight.HighlighterTest):	FAILED
    [junit] null expected:<<B>Hi[-]Speed</B>10 foo> but was:<<B>Hi[</B>-<B>]Speed</B>10 foo>
    [junit] junit.framework.ComparisonFailure: null expected:<<B>Hi[-]Speed</B>10 foo> but was:<<B>Hi[</B>-<B>]Speed</B>10 foo>
    [junit] 	at org.apache.lucene.search.highlight.HighlighterTest$30.run(HighlighterTest.java:1558)
    [junit] 	at org.apache.lucene.search.highlight.SynonymTokenizer$TestHighlightRunner.start(HighlighterTest.java:1947)
    [junit] 	at org.apache.lucene.search.highlight.HighlighterTest.testOverlapAnalyzer2(HighlighterTest.java:1594)
    [junit] 	at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:212)
    [junit] 
    [junit] 
    [junit] Test org.apache.lucene.search.highlight.HighlighterTest FAILED
{noformat}

> highlighter problem with n-gram tokens
> --------------------------------------
>
>                 Key: LUCENE-1489
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1489
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>         Attachments: LUCENE-1489.patch, lucene1489.patch
>
>
> I have a problem when using n-gram and highlighter. I thought it had been solved in LUCENE-627...
> Actually, I found this problem when I was using CJKTokenizer on Solr, though, here is lucene program to reproduce it using NGramTokenizer(min=2,max=2) instead of CJKTokenizer:
> {code:java}
> public class TestNGramHighlighter {
>   public static void main(String[] args) throws Exception {
>     Analyzer analyzer = new NGramAnalyzer();
>     final String TEXT = "Lucene can make index. Then Lucene can search.";
>     final String QUERY = "can";
>     QueryParser parser = new QueryParser("f",analyzer);
>     Query query = parser.parse(QUERY);
>     QueryScorer scorer = new QueryScorer(query,"f");
>     Highlighter h = new Highlighter( scorer );
>     System.out.println( h.getBestFragment(analyzer, "f", TEXT) );
>   }
>   static class NGramAnalyzer extends Analyzer {
>     public TokenStream tokenStream(String field, Reader input) {
>       return new NGramTokenizer(input,2,2);
>     }
>   }
> }
> {code}
> expected output is:
> Lucene <B>can</B> make index. Then Lucene <B>can</B> search.
> but the actual output is:
> Lucene <B>can make index. Then Lucene can</B> search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

[jira] Commented: (LUCENE-1489) highlighter problem with n-gram tokens

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/LUCENE-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667469#action_12667469 ] 

Chris Harris commented on LUCENE-1489:
--------------------------------------

As I mentioned on the Solr list, I've discovered similar problems when highlighting with the ShingleFilter. (ShingleFilter does n-gram processing on _Tokens_, whereas NGramAnalyzer does n-gram processing on _characters_.) Here's a variation on Koji's demo program that exhibits some problems with ShingleFilter, as well as offering a slightly more textured example of how things work with NGramAnalyzer:

{code}
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.NullFragmenter;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.analysis.ngram.NGramTokenizer;
import org.apache.lucene.analysis.shingle.ShingleFilter;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import java.io.Reader;

public class Main {

    public static void main(String[] args) throws Exception {
        testAnalyzer(new BigramShingleAnalyzer(true), "Bigram shingle analyzer (bigrams and unigrams)");
        testAnalyzer(new NGramAnalyzer(), "Bigram (non-shingle) analyzer (bigrams only)");
    }

    static void testAnalyzer(Analyzer analyzer, String analyzerDescription) throws Exception {
        System.out.println("Testing analyzer " + analyzerDescription + "...");
        System.out.println("---------------------------------");
        test(analyzer, "Lucene can index and can search", "Lucene");
        test(analyzer, "Lucene can make an index", "can");
        test(analyzer, "Lucene can index and can search", "can");
        test(analyzer, "Lucene can index can search and can highlight", "can");
        test(analyzer, "Lucene can index can search and can highlight", "+index +search");
        System.out.println();
    }

    static void test(Analyzer analyzer, String text, String queryStr) throws Exception {
        QueryParser parser = new QueryParser("f", analyzer);
        Query query = parser.parse(queryStr);
        QueryScorer scorer = new QueryScorer(query, "f");
        Highlighter h = new Highlighter(scorer);
        h.setTextFragmenter(new NullFragmenter()); // We're not testing fragmenter here.
        System.out.println(h.getBestFragment(analyzer, "f", text) + " [query='" + queryStr + "']");
    }

    static class NGramAnalyzer extends Analyzer {
        public TokenStream tokenStream(String field, Reader input) {
            return new NGramTokenizer(input, 2, 2);
        }
    }

    static class BigramShingleAnalyzer extends Analyzer {
        boolean outputUnigrams;

        public BigramShingleAnalyzer(boolean outputUnigrams) {
            this.outputUnigrams = outputUnigrams;
        }

        public TokenStream tokenStream(String field, Reader input) {
            ShingleFilter sf = new ShingleFilter(new WhitespaceTokenizer(input));
            sf.setOutputUnigrams(outputUnigrams);
            return sf;
        }
    }
}
{code}

Here's the current output, with commentary:
{code}
Testing analyzer Bigram shingle analyzer (bigrams and unigrams)...
---------------------------------
// works ok:
<B>Lucene</B> can index and can search [query='Lucene']
// works ok:
Lucene <B>can</B> make an index [query='can']
// same as Koji's example:
Lucene <B>can index and can</B> search [query='can']
// if there are three matches, they all get bundled into a single highlight:
Lucene <B>can index can search and can</B> highlight [query='can']
// it doesn't have to be the same search term that matches:
Lucene can <B>index can search</B> and can highlight [query='+index +search']

Testing analyzer Bigram (non-shingle) analyzer (bigrams only)...
---------------------------------
// works ok:
<B>Lucene</B> can index and can search [query='Lucene']
// is 'an' being treated as a match for 'can'(?):
Lucene <B>can make an</B> index [query='can']
// same as Koji's example:
Lucene <B>can index and can</B> search [query='can']
// if there are three matches, they all get bundled into a single highlight:
Lucene <B>can index can search and can</B> highlight [query='can']
// not sure what' happening here:
Lucene can <B>index can search and</B> can highlight [query='+index +search']
{code}

I'm interested what others think, but for me it makes sense to classify both of these as the same issue. From a high-level perspective, the problem in each case seems to be that Highlighter.getBestTextFragments(TokenStream tokenStream, String text, boolean mergeContiguousFragments, int maxNumFragments) makes use of a TokenGroup abstraction that doesn't really work for the n-gram or the bigram shingle case:

A TokenGroup is supposed to represent "one, or several overlapping tokens, along with the score(s) and the scope of the original text". (I assume TokenGroup was introduced to deal with synonym filter expansions.) Tokens are determined to overlap or not basically by seeing whether tokenB.startOffset() >= tokenA.endOffset(). (It's slightly more complex than this, but that's approximately what the test in TokenGroup.isDistinct() amounts to.) With the two analyzers under discussion, that criterion basically means that each token "overlaps" with the next.

In Koji's bigram case, consider how "dogs" would get tokenized:

"do" (startOffset=0, endOffset=2)
"og" (startOffset=1, endOffset=3)
"gs" (startOffset=2, endOffset=4)

Or in my shingle case, consider how "I love Lucene" would get tokenized:

"I" (startOffset=0, endOffset=1)
"I love" (startOffset=0, endOffset=6)
"love" (startOffset=2, endOffset=6)
"love Lucene" (startOffset=2, endOffset=13)
"Lucene" (startOffset=7, endOffset=13)

In both cases, you never have a token whose startOffset is >= the preceding token's endOffset. So all these tokens are part of the same TokenGroup. That should mean these tokens all "overlap", but that would make for a rather mysterious notion of "overlapping".


> highlighter problem with n-gram tokens
> --------------------------------------
>
>                 Key: LUCENE-1489
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1489
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>
> I have a problem when using n-gram and highlighter. I thought it had been solved in LUCENE-627...
> Actually, I found this problem when I was using CJKTokenizer on Solr, though, here is lucene program to reproduce it using NGramTokenizer(min=2,max=2) instead of CJKTokenizer:
> {code:java}
> public class TestNGramHighlighter {
>   public static void main(String[] args) throws Exception {
>     Analyzer analyzer = new NGramAnalyzer();
>     final String TEXT = "Lucene can make index. Then Lucene can search.";
>     final String QUERY = "can";
>     QueryParser parser = new QueryParser("f",analyzer);
>     Query query = parser.parse(QUERY);
>     QueryScorer scorer = new QueryScorer(query,"f");
>     Highlighter h = new Highlighter( scorer );
>     System.out.println( h.getBestFragment(analyzer, "f", TEXT) );
>   }
>   static class NGramAnalyzer extends Analyzer {
>     public TokenStream tokenStream(String field, Reader input) {
>       return new NGramTokenizer(input,2,2);
>     }
>   }
> }
> {code}
> expected output is:
> Lucene <B>can</B> make index. Then Lucene <B>can</B> search.
> but the actual output is:
> Lucene <B>can make index. Then Lucene can</B> search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org