You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by ko...@apache.org on 2012/06/12 16:06:50 UTC
svn commit: r1349364 - in /lucene/dev/branches/branch_4x: ./ dev-tools/
lucene/ lucene/analysis/ lucene/analysis/common/
lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/std31/
lucene/analysis/common/src/java/org/apache/lucene/analys...
Author: koji
Date: Tue Jun 12 14:06:48 2012
New Revision: 1349364
URL: http://svn.apache.org/viewvc?rev=1349364&view=rev
Log:
LUCENE-4133: FVH: A weighted approach for ordered fragments, part of LUCENE-3440
Added:
lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/WeightedFieldFragList.java
- copied unchanged from r1349361, lucene/dev/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/WeightedFieldFragList.java
lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/WeightedFragListBuilder.java
- copied unchanged from r1349361, lucene/dev/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/WeightedFragListBuilder.java
lucene/dev/branches/branch_4x/lucene/highlighter/src/test/org/apache/lucene/search/vectorhighlight/WeightedFragListBuilderTest.java
- copied unchanged from r1349361, lucene/dev/trunk/lucene/highlighter/src/test/org/apache/lucene/search/vectorhighlight/WeightedFragListBuilderTest.java
Modified:
lucene/dev/branches/branch_4x/ (props changed)
lucene/dev/branches/branch_4x/dev-tools/ (props changed)
lucene/dev/branches/branch_4x/lucene/ (props changed)
lucene/dev/branches/branch_4x/lucene/BUILD.txt (props changed)
lucene/dev/branches/branch_4x/lucene/CHANGES.txt (contents, props changed)
lucene/dev/branches/branch_4x/lucene/JRE_VERSION_MIGRATION.txt (props changed)
lucene/dev/branches/branch_4x/lucene/LICENSE.txt (props changed)
lucene/dev/branches/branch_4x/lucene/MIGRATE.txt (props changed)
lucene/dev/branches/branch_4x/lucene/NOTICE.txt (props changed)
lucene/dev/branches/branch_4x/lucene/README.txt (props changed)
lucene/dev/branches/branch_4x/lucene/analysis/ (props changed)
lucene/dev/branches/branch_4x/lucene/analysis/common/ (props changed)
lucene/dev/branches/branch_4x/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/std31/package.html (props changed)
lucene/dev/branches/branch_4x/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/std34/package.html (props changed)
lucene/dev/branches/branch_4x/lucene/backwards/ (props changed)
lucene/dev/branches/branch_4x/lucene/benchmark/ (props changed)
lucene/dev/branches/branch_4x/lucene/build.xml (props changed)
lucene/dev/branches/branch_4x/lucene/common-build.xml (props changed)
lucene/dev/branches/branch_4x/lucene/core/ (props changed)
lucene/dev/branches/branch_4x/lucene/demo/ (props changed)
lucene/dev/branches/branch_4x/lucene/facet/ (props changed)
lucene/dev/branches/branch_4x/lucene/grouping/ (props changed)
lucene/dev/branches/branch_4x/lucene/highlighter/ (props changed)
lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldPhraseList.java
lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/SimpleFieldFragList.java
lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/package.html
lucene/dev/branches/branch_4x/lucene/ivy-settings.xml (props changed)
lucene/dev/branches/branch_4x/lucene/join/ (props changed)
lucene/dev/branches/branch_4x/lucene/memory/ (props changed)
lucene/dev/branches/branch_4x/lucene/misc/ (props changed)
lucene/dev/branches/branch_4x/lucene/module-build.xml (props changed)
lucene/dev/branches/branch_4x/lucene/queries/ (props changed)
lucene/dev/branches/branch_4x/lucene/queryparser/ (props changed)
lucene/dev/branches/branch_4x/lucene/sandbox/ (props changed)
lucene/dev/branches/branch_4x/lucene/site/ (props changed)
lucene/dev/branches/branch_4x/lucene/spatial/ (props changed)
lucene/dev/branches/branch_4x/lucene/suggest/ (props changed)
lucene/dev/branches/branch_4x/lucene/test-framework/ (props changed)
lucene/dev/branches/branch_4x/lucene/tools/ (props changed)
lucene/dev/branches/branch_4x/solr/ (props changed)
lucene/dev/branches/branch_4x/solr/CHANGES.txt (props changed)
lucene/dev/branches/branch_4x/solr/LICENSE.txt (props changed)
lucene/dev/branches/branch_4x/solr/NOTICE.txt (props changed)
lucene/dev/branches/branch_4x/solr/README.txt (props changed)
lucene/dev/branches/branch_4x/solr/build.xml (props changed)
lucene/dev/branches/branch_4x/solr/cloud-dev/ (props changed)
lucene/dev/branches/branch_4x/solr/common-build.xml (props changed)
lucene/dev/branches/branch_4x/solr/contrib/ (props changed)
lucene/dev/branches/branch_4x/solr/core/ (props changed)
lucene/dev/branches/branch_4x/solr/dev-tools/ (props changed)
lucene/dev/branches/branch_4x/solr/example/ (props changed)
lucene/dev/branches/branch_4x/solr/lib/ (props changed)
lucene/dev/branches/branch_4x/solr/lib/httpclient-LICENSE-ASL.txt (props changed)
lucene/dev/branches/branch_4x/solr/lib/httpclient-NOTICE.txt (props changed)
lucene/dev/branches/branch_4x/solr/lib/httpcore-LICENSE-ASL.txt (props changed)
lucene/dev/branches/branch_4x/solr/lib/httpcore-NOTICE.txt (props changed)
lucene/dev/branches/branch_4x/solr/lib/httpmime-LICENSE-ASL.txt (props changed)
lucene/dev/branches/branch_4x/solr/lib/httpmime-NOTICE.txt (props changed)
lucene/dev/branches/branch_4x/solr/scripts/ (props changed)
lucene/dev/branches/branch_4x/solr/solrj/ (props changed)
lucene/dev/branches/branch_4x/solr/test-framework/ (props changed)
lucene/dev/branches/branch_4x/solr/testlogging.properties (props changed)
lucene/dev/branches/branch_4x/solr/webapp/ (props changed)
Modified: lucene/dev/branches/branch_4x/lucene/CHANGES.txt
URL: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/CHANGES.txt?rev=1349364&r1=1349363&r2=1349364&view=diff
==============================================================================
--- lucene/dev/branches/branch_4x/lucene/CHANGES.txt (original)
+++ lucene/dev/branches/branch_4x/lucene/CHANGES.txt Tue Jun 12 14:06:48 2012
@@ -895,6 +895,9 @@ New features
cause a ParseException (depending on whether strict parsing is enabled).
(Luca Cavanna via Chris Male)
+* LUCENE-3440: Add ordered fragments feature with IDF-weighted terms for FVH.
+ (Sebastian Lutze via Koji Sekiguchi)
+
Optimizations
* LUCENE-2588: Don't store unnecessary suffixes when writing the terms
Modified: lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldPhraseList.java
URL: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldPhraseList.java?rev=1349364&r1=1349363&r2=1349364&view=diff
==============================================================================
--- lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldPhraseList.java (original)
+++ lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldPhraseList.java Tue Jun 12 14:06:48 2012
@@ -150,7 +150,7 @@ public class FieldPhraseList {
}
/**
- * @return the termInfos
+ * @return the termInfos
*/
public List<TermInfo> getTermsInfos() {
return termsInfos;
@@ -164,7 +164,7 @@ public class FieldPhraseList {
this.boost = boost;
this.seqnum = seqnum;
- // now we keep TermInfos for further operations
+ // We keep TermInfos for further operations
termsInfos = new ArrayList<TermInfo>( terms );
termsOffsets = new ArrayList<Toffs>( terms.size() );
Modified: lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/SimpleFieldFragList.java
URL: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/SimpleFieldFragList.java?rev=1349364&r1=1349363&r2=1349364&view=diff
==============================================================================
--- lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/SimpleFieldFragList.java (original)
+++ lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/SimpleFieldFragList.java Tue Jun 12 14:06:48 2012
@@ -42,12 +42,13 @@ public class SimpleFieldFragList extends
*/
@Override
public void add( int startOffset, int endOffset, List<WeightedPhraseInfo> phraseInfoList ) {
- float score = 0;
+ float totalBoost = 0;
List<SubInfo> subInfos = new ArrayList<SubInfo>();
for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum() ) );
- score += phraseInfo.getBoost();
+ totalBoost += phraseInfo.getBoost();
}
- getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, score ) );
+ getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, totalBoost ) );
}
+
}
Modified: lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/package.html
URL: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/package.html?rev=1349364&r1=1349363&r2=1349364&view=diff
==============================================================================
--- lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/package.html (original)
+++ lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/package.html Tue Jun 12 14:06:48 2012
@@ -27,9 +27,9 @@ This is an another highlighter implement
<li>support multi-term (includes wildcard, range, regexp, etc) queries</li>
<li>need Java 1.5</li>
<li>highlight fields need to be stored with Positions and Offsets</li>
-<li>take into account query boost to score fragments</li>
+<li>take into account query boost and/or IDF-weight to score fragments</li>
<li>support colored highlight tags</li>
-<li>pluggable FragListBuilder</li>
+<li>pluggable FragListBuilder / FieldFragList</li>
<li>pluggable FragmentsBuilder</li>
</ul>
@@ -122,9 +122,8 @@ by reference to <code>QueryPhraseMap</co
+----------------+-----------------+---+
</pre>
<p>The type of each entry is <code>WeightedPhraseInfo</code> that consists of
-an array of terms offsets and weight. The weight (Fast Vector Highlighter uses query boost to
-calculate the weight) will be taken into account when Fast Vector Highlighter creates
-{@link org.apache.lucene.search.vectorhighlight.FieldFragList} in the next step.</p>
+an array of terms offsets and weight.
+</p>
<h3>Step 4.</h3>
<p>In Step 4, Fast Vector Highlighter creates <code>FieldFragList</code> by reference to
<code>FieldPhraseList</code>. In this sample case, the following
@@ -137,6 +136,59 @@ calculate the weight) will be taken into
|totalBoost=3 |
+---------------------------------+
</pre>
+
+<p>
+The calculation for each <code>FieldFragList.WeightedFragInfo.totalBoost</code> (weight)
+depends on the implementation of <code>FieldFragList.add( ... )</code>:
+<pre class="prettyprint">
+ public void add( int startOffset, int endOffset, List<WeightedPhraseInfo> phraseInfoList ) {
+ float totalBoost = 0;
+ List<SubInfo> subInfos = new ArrayList<SubInfo>();
+ for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
+ subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum() ) );
+ totalBoost += phraseInfo.getBoost();
+ }
+ getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, totalBoost ) );
+ }
+
+</pre>
+The used implementation of <code>FieldFragList</code> is noted in <code>BaseFragListBuilder.createFieldFragList( ... )</code>:
+<pre class="prettyprint">
+ public FieldFragList createFieldFragList( FieldPhraseList fieldPhraseList, int fragCharSize ){
+ return createFieldFragList( fieldPhraseList, new SimpleFieldFragList( fragCharSize ), fragCharSize );
+ }
+</pre>
+<p>
+Currently there are basically to approaches available:
+</p>
+<ul>
+<li><code>SimpleFragListBuilder using SimpleFieldFragList</code>: <i>sum-of-boosts</i>-approach. The totalBoost is calculated by summarizing the query-boosts per term. Per default a term is boosted by 1.0</li>
+<li><code>WeightedFragListBuilder using WeightedFieldFragList</code>: <i>sum-of-distinct-weights</i>-approach. The totalBoost is calculated by summarizing the IDF-weights of distinct terms.</li>
+</ul>
+<p>Comparison of the two approaches:</p>
+<table border="1">
+<caption>
+ query = das alte testament (The Old Testament)
+</caption>
+<tr><th>Terms in fragment</th><th>sum-of-distinct-weights</th><th>sum-of-boosts</th></tr>
+<tr><td>das alte testament</td><td>5.339621</td><td>3.0</td></tr>
+<tr><td>das alte testament</td><td>5.339621</td><td>3.0</td></tr>
+<tr><td>das testament alte</td><td>5.339621</td><td>3.0</td></tr>
+<tr><td>das alte testament</td><td>5.339621</td><td>3.0</td></tr>
+<tr><td>das testament</td><td>2.9455688</td><td>2.0</td></tr>
+<tr><td>das alte</td><td>2.4759595</td><td>2.0</td></tr>
+<tr><td>das das das das</td><td>1.5015357</td><td>4.0</td></tr>
+<tr><td>das das das</td><td>1.3003681</td><td>3.0</td></tr>
+<tr><td>das das</td><td>1.061746</td><td>2.0</td></tr>
+<tr><td>alte</td><td>1.0</td><td>1.0</td></tr>
+<tr><td>alte</td><td>1.0</td><td>1.0</td></tr>
+<tr><td>das</td><td>0.7507678</td><td>1.0</td></tr>
+<tr><td>das</td><td>0.7507678</td><td>1.0</td></tr>
+<tr><td>das</td><td>0.7507678</td><td>1.0</td></tr>
+<tr><td>das</td><td>0.7507678</td><td>1.0</td></tr>
+<tr><td>das</td><td>0.7507678</td><td>1.0</td></tr>
+</table>
+
<h3>Step 5.</h3>
<p>In Step 5, by using <code>FieldFragList</code> and the field stored data,
Fast Vector Highlighter creates highlighted snippets!</p>