You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Koji Sekiguchi (JIRA)" <ji...@apache.org> on 2012/06/12 04:32:43 UTC
[jira] [Commented] (LUCENE-4133) FastVectorHighlighter: A weighted
approach for ordered fragments
[ https://issues.apache.org/jira/browse/LUCENE-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293295#comment-13293295 ]
Koji Sekiguchi commented on LUCENE-4133:
----------------------------------------
Hi Sebastian, thank you for the patch and the description as always!
I applied the patch to trunk and run test, I got the following error:
{code}
$ cd lucene/highlighter
$ ant clean test
:
common.test:
[mkdir] Created dir: /Users/koji/Project/lucene/lusolr/trunk/COMMIT-NEW/lucene/build/highlighter/test
[junit4] <JUnit4> says hi! Master seed: F7909ECA1C4FD1AE
[junit4] Executing 16 suites with 4 JVMs.
[junit4] Suite: org.apache.lucene.search.highlight.custom.HighlightCustomQueryTest
[junit4] Completed on J2 in 0.36s, 1 test
[junit4]
[junit4] Suite: org.apache.lucene.search.vectorhighlight.SingleFragListBuilderTest
[junit4] Completed on J1 in 1.02s, 3 tests
[junit4]
[junit4] Suite: org.apache.lucene.search.vectorhighlight.WeightedFragListBuilderTest
[junit4] FAILURE 0.85s J0 | WeightedFragListBuilderTest.test2WeightedFragList
[junit4] > Throwable #1: org.junit.ComparisonFailure: expected:<...eboth((195,203)))/0.[26632088](189,289)> but was:<...eboth((195,203)))/0.[86791086](189,289)>
[junit4] > at __randomizedtesting.SeedInfo.seed([F7909ECA1C4FD1AE:94B09B8AD58716F0]:0)
[junit4] > at org.junit.Assert.assertEquals(Assert.java:125)
[junit4] > at org.junit.Assert.assertEquals(Assert.java:147)
[junit4] > at org.apache.lucene.search.vectorhighlight.WeightedFragListBuilderTest.test2WeightedFragList(WeightedFragListBuilderTest.java:32)
:
{code}
If I change 0.26632088 to 0.86791086 in WeightedFragListBuilderTest, test is successful. Is the change ok for you?
> FastVectorHighlighter: A weighted approach for ordered fragments
> ----------------------------------------------------------------
>
> Key: LUCENE-4133
> URL: https://issues.apache.org/jira/browse/LUCENE-4133
> Project: Lucene - Java
> Issue Type: Improvement
> Components: modules/highlighter
> Affects Versions: 4.0, 5.0
> Reporter: Sebastian Lutze
> Assignee: Koji Sekiguchi
> Priority: Minor
> Labels: FastVectorHighlighter
> Fix For: 4.0
>
> Attachments: LUCENE-4133.patch
>
>
> The FastVectorHighlighter currently disregards IDF-weights for matching terms within generated fragments. In the worst case, a fragment, which contains high number of very common words, is scored higher, than a fragment that contains *all* of the terms which have been used in the original query.
> This patch provides ordered fragments with IDF-weighted terms:
> *For each distinct matching term per fragment:*
> _weight = weight + IDF * boost_
> *For each fragment:*
> _weight = weight * length * 1 / sqrt( length )_
> |weight| total weight of fragment
> |IDF| inverse document frequency for each distinct matching term
> |boost| query boost as provided, for example _term^2_
> |length| total number of non-distinct matching terms per fragment
> *Method:*
> {code:java}
> public void add( int startOffset, int endOffset, List<WeightedPhraseInfo> phraseInfoList ) {
>
> float totalBoost = 0;
>
> List<SubInfo> subInfos = new ArrayList<SubInfo>();
> HashSet<String> distinctTerms = new HashSet<String>();
>
> int length = 0;
> for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
> subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum() ) );
> for ( TermInfo ti : phraseInfo.getTermsInfos()) {
> if ( distinctTerms.add( ti.getText() ) )
> totalBoost += ti.getWeight() * phraseInfo.getBoost();
> length++;
> }
> }
> totalBoost *= length * ( 1 / Math.sqrt( length ) );
>
> getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, totalBoost ) );
> }
> {code}
> The ranking-formula should be the same, or at least similar, to that one used in QueryTermScorer.
> *This patch contains:*
> * a changed class-member in FieldPhraseList (termInfos to termsInfos)
> * a changed local variable in SimpleFieldFragList (score to totalBoost)
> * adds a missing @override in SimpleFragListBuilder
> * class WeightedFieldFragList, a implementation of FieldFragList
> * class WeightedFragListBuilder, a implementation of BaseFragListBuilder
> * class WeightedFragListBuilderTest, a simple test-case
> * updated docs for FVH
> Last part (see also LUCENE-4091, LUCENE-4107, LUCENE-4113) of LUCENE-3440.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org