You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Chris Harris (JIRA)" <ji...@apache.org> on 2009/01/26 23:27:59 UTC

[jira] Commented: (LUCENE-1489) highlighter problem with n-gram tokens

    [ https://issues.apache.org/jira/browse/LUCENE-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667469#action_12667469 ] 

Chris Harris commented on LUCENE-1489:
--------------------------------------

As I mentioned on the Solr list, I've discovered similar problems when highlighting with the ShingleFilter. (ShingleFilter does n-gram processing on _Tokens_, whereas NGramAnalyzer does n-gram processing on _characters_.) Here's a variation on Koji's demo program that exhibits some problems with ShingleFilter, as well as offering a slightly more textured example of how things work with NGramAnalyzer:

{code}
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.highlight.QueryScorer;
import org.apache.lucene.search.highlight.NullFragmenter;
import org.apache.lucene.search.highlight.Highlighter;
import org.apache.lucene.analysis.ngram.NGramTokenizer;
import org.apache.lucene.analysis.shingle.ShingleFilter;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import java.io.Reader;

public class Main {

    public static void main(String[] args) throws Exception {
        testAnalyzer(new BigramShingleAnalyzer(true), "Bigram shingle analyzer (bigrams and unigrams)");
        testAnalyzer(new NGramAnalyzer(), "Bigram (non-shingle) analyzer (bigrams only)");
    }

    static void testAnalyzer(Analyzer analyzer, String analyzerDescription) throws Exception {
        System.out.println("Testing analyzer " + analyzerDescription + "...");
        System.out.println("---------------------------------");
        test(analyzer, "Lucene can index and can search", "Lucene");
        test(analyzer, "Lucene can make an index", "can");
        test(analyzer, "Lucene can index and can search", "can");
        test(analyzer, "Lucene can index can search and can highlight", "can");
        test(analyzer, "Lucene can index can search and can highlight", "+index +search");
        System.out.println();
    }

    static void test(Analyzer analyzer, String text, String queryStr) throws Exception {
        QueryParser parser = new QueryParser("f", analyzer);
        Query query = parser.parse(queryStr);
        QueryScorer scorer = new QueryScorer(query, "f");
        Highlighter h = new Highlighter(scorer);
        h.setTextFragmenter(new NullFragmenter()); // We're not testing fragmenter here.
        System.out.println(h.getBestFragment(analyzer, "f", text) + " [query='" + queryStr + "']");
    }

    static class NGramAnalyzer extends Analyzer {
        public TokenStream tokenStream(String field, Reader input) {
            return new NGramTokenizer(input, 2, 2);
        }
    }

    static class BigramShingleAnalyzer extends Analyzer {
        boolean outputUnigrams;

        public BigramShingleAnalyzer(boolean outputUnigrams) {
            this.outputUnigrams = outputUnigrams;
        }

        public TokenStream tokenStream(String field, Reader input) {
            ShingleFilter sf = new ShingleFilter(new WhitespaceTokenizer(input));
            sf.setOutputUnigrams(outputUnigrams);
            return sf;
        }
    }
}
{code}

Here's the current output, with commentary:
{code}
Testing analyzer Bigram shingle analyzer (bigrams and unigrams)...
---------------------------------
// works ok:
<B>Lucene</B> can index and can search [query='Lucene']
// works ok:
Lucene <B>can</B> make an index [query='can']
// same as Koji's example:
Lucene <B>can index and can</B> search [query='can']
// if there are three matches, they all get bundled into a single highlight:
Lucene <B>can index can search and can</B> highlight [query='can']
// it doesn't have to be the same search term that matches:
Lucene can <B>index can search</B> and can highlight [query='+index +search']

Testing analyzer Bigram (non-shingle) analyzer (bigrams only)...
---------------------------------
// works ok:
<B>Lucene</B> can index and can search [query='Lucene']
// is 'an' being treated as a match for 'can'(?):
Lucene <B>can make an</B> index [query='can']
// same as Koji's example:
Lucene <B>can index and can</B> search [query='can']
// if there are three matches, they all get bundled into a single highlight:
Lucene <B>can index can search and can</B> highlight [query='can']
// not sure what' happening here:
Lucene can <B>index can search and</B> can highlight [query='+index +search']
{code}

I'm interested what others think, but for me it makes sense to classify both of these as the same issue. From a high-level perspective, the problem in each case seems to be that Highlighter.getBestTextFragments(TokenStream tokenStream, String text, boolean mergeContiguousFragments, int maxNumFragments) makes use of a TokenGroup abstraction that doesn't really work for the n-gram or the bigram shingle case:

A TokenGroup is supposed to represent "one, or several overlapping tokens, along with the score(s) and the scope of the original text". (I assume TokenGroup was introduced to deal with synonym filter expansions.) Tokens are determined to overlap or not basically by seeing whether tokenB.startOffset() >= tokenA.endOffset(). (It's slightly more complex than this, but that's approximately what the test in TokenGroup.isDistinct() amounts to.) With the two analyzers under discussion, that criterion basically means that each token "overlaps" with the next.

In Koji's bigram case, consider how "dogs" would get tokenized:

"do" (startOffset=0, endOffset=2)
"og" (startOffset=1, endOffset=3)
"gs" (startOffset=2, endOffset=4)

Or in my shingle case, consider how "I love Lucene" would get tokenized:

"I" (startOffset=0, endOffset=1)
"I love" (startOffset=0, endOffset=6)
"love" (startOffset=2, endOffset=6)
"love Lucene" (startOffset=2, endOffset=13)
"Lucene" (startOffset=7, endOffset=13)

In both cases, you never have a token whose startOffset is >= the preceding token's endOffset. So all these tokens are part of the same TokenGroup. That should mean these tokens all "overlap", but that would make for a rather mysterious notion of "overlapping".


> highlighter problem with n-gram tokens
> --------------------------------------
>
>                 Key: LUCENE-1489
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1489
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>
> I have a problem when using n-gram and highlighter. I thought it had been solved in LUCENE-627...
> Actually, I found this problem when I was using CJKTokenizer on Solr, though, here is lucene program to reproduce it using NGramTokenizer(min=2,max=2) instead of CJKTokenizer:
> {code:java}
> public class TestNGramHighlighter {
>   public static void main(String[] args) throws Exception {
>     Analyzer analyzer = new NGramAnalyzer();
>     final String TEXT = "Lucene can make index. Then Lucene can search.";
>     final String QUERY = "can";
>     QueryParser parser = new QueryParser("f",analyzer);
>     Query query = parser.parse(QUERY);
>     QueryScorer scorer = new QueryScorer(query,"f");
>     Highlighter h = new Highlighter( scorer );
>     System.out.println( h.getBestFragment(analyzer, "f", TEXT) );
>   }
>   static class NGramAnalyzer extends Analyzer {
>     public TokenStream tokenStream(String field, Reader input) {
>       return new NGramTokenizer(input,2,2);
>     }
>   }
> }
> {code}
> expected output is:
> Lucene <B>can</B> make index. Then Lucene <B>can</B> search.
> but the actual output is:
> Lucene <B>can make index. Then Lucene can</B> search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org