You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2009/03/12 11:28:51 UTC
[jira] Commented: (LUCENE-1522) another highlighter
[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12681249#action_12681249 ]
Michael McCandless commented on LUCENE-1522:
--------------------------------------------
This highlighter looks very interesting! I love the colored tags, and
the fast performance on large docs, and the extensive unit tests.
When I applied the patch to current trunk, I see some tests failing,
eg:
{code}
[junit] Testcase: test1PhraseLongMVB(org.apache.lucene.search.highlight2.FieldPhraseListTest): FAILED
[junit] expected:<sppeeeed(1.0)((8[8,93]))> but was:<sppeeeed(1.0)((8[7,92]))>
[junit] junit.framework.ComparisonFailure: expected:<sppeeeed(1.0)((8[8,93]))> but was:<sppeeeed(1.0)((8[7,92]))>
[junit] at org.apache.lucene.search.highlight2.FieldPhraseListTest.test1PhraseLongMVB(FieldPhraseListTest.java:175)
{code}
Is this approach guaranteed to only highlight term occurrences that
actually contribute to the document match? Can it handle all /
arbitrary Query subclasses? How does it score fragments?
I also like that you first generate hits in the document, and from
those hits you generate fragments (if I'm reading the code correctly);
this is a nicely scalable approach.
> another highlighter
> -------------------
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/highlighter
> Reporter: Koji Sekiguchi
> Priority: Minor
> Attachments: colored-tag-sample.png, LUCENE-1522.patch, LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream (general token stream (e.g. WhitespaceTokenizer) also supported. see test code in patch). The idea was inherited from my previous project with my colleague and LUCENE-644. This approach needs highlight fields to be TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
> // fieldName="content", fragCharSize=100, numFragments=3
> String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, "content", 100, 3 );
> if( fragments != null ){
> for( String fragment : fragments )
> System.out.println( fragment );
> }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org