You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by ko...@apache.org on 2012/06/12 16:06:50 UTC

svn commit: r1349364 - in /lucene/dev/branches/branch_4x: ./ dev-tools/ lucene/ lucene/analysis/ lucene/analysis/common/ lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/std31/ lucene/analysis/common/src/java/org/apache/lucene/analys...

Author: koji
Date: Tue Jun 12 14:06:48 2012
New Revision: 1349364

URL: http://svn.apache.org/viewvc?rev=1349364&view=rev
Log:
LUCENE-4133: FVH: A weighted approach for ordered fragments, part of LUCENE-3440

Added:
    lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/WeightedFieldFragList.java
      - copied unchanged from r1349361, lucene/dev/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/WeightedFieldFragList.java
    lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/WeightedFragListBuilder.java
      - copied unchanged from r1349361, lucene/dev/trunk/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/WeightedFragListBuilder.java
    lucene/dev/branches/branch_4x/lucene/highlighter/src/test/org/apache/lucene/search/vectorhighlight/WeightedFragListBuilderTest.java
      - copied unchanged from r1349361, lucene/dev/trunk/lucene/highlighter/src/test/org/apache/lucene/search/vectorhighlight/WeightedFragListBuilderTest.java
Modified:
    lucene/dev/branches/branch_4x/   (props changed)
    lucene/dev/branches/branch_4x/dev-tools/   (props changed)
    lucene/dev/branches/branch_4x/lucene/   (props changed)
    lucene/dev/branches/branch_4x/lucene/BUILD.txt   (props changed)
    lucene/dev/branches/branch_4x/lucene/CHANGES.txt   (contents, props changed)
    lucene/dev/branches/branch_4x/lucene/JRE_VERSION_MIGRATION.txt   (props changed)
    lucene/dev/branches/branch_4x/lucene/LICENSE.txt   (props changed)
    lucene/dev/branches/branch_4x/lucene/MIGRATE.txt   (props changed)
    lucene/dev/branches/branch_4x/lucene/NOTICE.txt   (props changed)
    lucene/dev/branches/branch_4x/lucene/README.txt   (props changed)
    lucene/dev/branches/branch_4x/lucene/analysis/   (props changed)
    lucene/dev/branches/branch_4x/lucene/analysis/common/   (props changed)
    lucene/dev/branches/branch_4x/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/std31/package.html   (props changed)
    lucene/dev/branches/branch_4x/lucene/analysis/common/src/java/org/apache/lucene/analysis/standard/std34/package.html   (props changed)
    lucene/dev/branches/branch_4x/lucene/backwards/   (props changed)
    lucene/dev/branches/branch_4x/lucene/benchmark/   (props changed)
    lucene/dev/branches/branch_4x/lucene/build.xml   (props changed)
    lucene/dev/branches/branch_4x/lucene/common-build.xml   (props changed)
    lucene/dev/branches/branch_4x/lucene/core/   (props changed)
    lucene/dev/branches/branch_4x/lucene/demo/   (props changed)
    lucene/dev/branches/branch_4x/lucene/facet/   (props changed)
    lucene/dev/branches/branch_4x/lucene/grouping/   (props changed)
    lucene/dev/branches/branch_4x/lucene/highlighter/   (props changed)
    lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldPhraseList.java
    lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/SimpleFieldFragList.java
    lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/package.html
    lucene/dev/branches/branch_4x/lucene/ivy-settings.xml   (props changed)
    lucene/dev/branches/branch_4x/lucene/join/   (props changed)
    lucene/dev/branches/branch_4x/lucene/memory/   (props changed)
    lucene/dev/branches/branch_4x/lucene/misc/   (props changed)
    lucene/dev/branches/branch_4x/lucene/module-build.xml   (props changed)
    lucene/dev/branches/branch_4x/lucene/queries/   (props changed)
    lucene/dev/branches/branch_4x/lucene/queryparser/   (props changed)
    lucene/dev/branches/branch_4x/lucene/sandbox/   (props changed)
    lucene/dev/branches/branch_4x/lucene/site/   (props changed)
    lucene/dev/branches/branch_4x/lucene/spatial/   (props changed)
    lucene/dev/branches/branch_4x/lucene/suggest/   (props changed)
    lucene/dev/branches/branch_4x/lucene/test-framework/   (props changed)
    lucene/dev/branches/branch_4x/lucene/tools/   (props changed)
    lucene/dev/branches/branch_4x/solr/   (props changed)
    lucene/dev/branches/branch_4x/solr/CHANGES.txt   (props changed)
    lucene/dev/branches/branch_4x/solr/LICENSE.txt   (props changed)
    lucene/dev/branches/branch_4x/solr/NOTICE.txt   (props changed)
    lucene/dev/branches/branch_4x/solr/README.txt   (props changed)
    lucene/dev/branches/branch_4x/solr/build.xml   (props changed)
    lucene/dev/branches/branch_4x/solr/cloud-dev/   (props changed)
    lucene/dev/branches/branch_4x/solr/common-build.xml   (props changed)
    lucene/dev/branches/branch_4x/solr/contrib/   (props changed)
    lucene/dev/branches/branch_4x/solr/core/   (props changed)
    lucene/dev/branches/branch_4x/solr/dev-tools/   (props changed)
    lucene/dev/branches/branch_4x/solr/example/   (props changed)
    lucene/dev/branches/branch_4x/solr/lib/   (props changed)
    lucene/dev/branches/branch_4x/solr/lib/httpclient-LICENSE-ASL.txt   (props changed)
    lucene/dev/branches/branch_4x/solr/lib/httpclient-NOTICE.txt   (props changed)
    lucene/dev/branches/branch_4x/solr/lib/httpcore-LICENSE-ASL.txt   (props changed)
    lucene/dev/branches/branch_4x/solr/lib/httpcore-NOTICE.txt   (props changed)
    lucene/dev/branches/branch_4x/solr/lib/httpmime-LICENSE-ASL.txt   (props changed)
    lucene/dev/branches/branch_4x/solr/lib/httpmime-NOTICE.txt   (props changed)
    lucene/dev/branches/branch_4x/solr/scripts/   (props changed)
    lucene/dev/branches/branch_4x/solr/solrj/   (props changed)
    lucene/dev/branches/branch_4x/solr/test-framework/   (props changed)
    lucene/dev/branches/branch_4x/solr/testlogging.properties   (props changed)
    lucene/dev/branches/branch_4x/solr/webapp/   (props changed)

Modified: lucene/dev/branches/branch_4x/lucene/CHANGES.txt
URL: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/CHANGES.txt?rev=1349364&r1=1349363&r2=1349364&view=diff
==============================================================================
--- lucene/dev/branches/branch_4x/lucene/CHANGES.txt (original)
+++ lucene/dev/branches/branch_4x/lucene/CHANGES.txt Tue Jun 12 14:06:48 2012
@@ -895,6 +895,9 @@ New features
   cause a ParseException (depending on whether strict parsing is enabled).
   (Luca Cavanna via Chris Male)
    
+* LUCENE-3440: Add ordered fragments feature with IDF-weighted terms for FVH.
+  (Sebastian Lutze via Koji Sekiguchi)
+
 Optimizations
 
 * LUCENE-2588: Don't store unnecessary suffixes when writing the terms

Modified: lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldPhraseList.java
URL: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldPhraseList.java?rev=1349364&r1=1349363&r2=1349364&view=diff
==============================================================================
--- lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldPhraseList.java (original)
+++ lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/FieldPhraseList.java Tue Jun 12 14:06:48 2012
@@ -150,7 +150,7 @@ public class FieldPhraseList {
     }
 
     /**
-     * @return the termInfos
+     * @return the termInfos 
      */    
     public List<TermInfo> getTermsInfos() {
       return termsInfos;
@@ -164,7 +164,7 @@ public class FieldPhraseList {
       this.boost = boost;
       this.seqnum = seqnum;
       
-      // now we keep TermInfos for further operations
+      // We keep TermInfos for further operations
       termsInfos = new ArrayList<TermInfo>( terms );
       
       termsOffsets = new ArrayList<Toffs>( terms.size() );

Modified: lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/SimpleFieldFragList.java
URL: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/SimpleFieldFragList.java?rev=1349364&r1=1349363&r2=1349364&view=diff
==============================================================================
--- lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/SimpleFieldFragList.java (original)
+++ lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/SimpleFieldFragList.java Tue Jun 12 14:06:48 2012
@@ -42,12 +42,13 @@ public class SimpleFieldFragList extends
    */
   @Override
   public void add( int startOffset, int endOffset, List<WeightedPhraseInfo> phraseInfoList ) {
-    float score = 0;
+    float totalBoost = 0;
     List<SubInfo> subInfos = new ArrayList<SubInfo>();
     for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
       subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum() ) );
-      score += phraseInfo.getBoost();
+      totalBoost += phraseInfo.getBoost();
     }
-    getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, score ) );
+    getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, totalBoost ) );
   }
+  
 }

Modified: lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/package.html
URL: http://svn.apache.org/viewvc/lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/package.html?rev=1349364&r1=1349363&r2=1349364&view=diff
==============================================================================
--- lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/package.html (original)
+++ lucene/dev/branches/branch_4x/lucene/highlighter/src/java/org/apache/lucene/search/vectorhighlight/package.html Tue Jun 12 14:06:48 2012
@@ -27,9 +27,9 @@ This is an another highlighter implement
 <li>support multi-term (includes wildcard, range, regexp, etc) queries</li>
 <li>need Java 1.5</li>
 <li>highlight fields need to be stored with Positions and Offsets</li>
-<li>take into account query boost to score fragments</li>
+<li>take into account query boost and/or IDF-weight to score fragments</li>
 <li>support colored highlight tags</li>
-<li>pluggable FragListBuilder</li>
+<li>pluggable FragListBuilder / FieldFragList</li>
 <li>pluggable FragmentsBuilder</li>
 </ul>
 
@@ -122,9 +122,8 @@ by reference to <code>QueryPhraseMap</co
 +----------------+-----------------+---+
 </pre>
 <p>The type of each entry is <code>WeightedPhraseInfo</code> that consists of
-an array of terms offsets and weight. The weight (Fast Vector Highlighter uses query boost to
-calculate the weight) will be taken into account when Fast Vector Highlighter creates
-{@link org.apache.lucene.search.vectorhighlight.FieldFragList} in the next step.</p>
+an array of terms offsets and weight. 
+</p>
 <h3>Step 4.</h3>
 <p>In Step 4, Fast Vector Highlighter creates <code>FieldFragList</code> by reference to
 <code>FieldPhraseList</code>. In this sample case, the following
@@ -137,6 +136,59 @@ calculate the weight) will be taken into
 |totalBoost=3                     |
 +---------------------------------+
 </pre>
+
+<p>
+The calculation for each <code>FieldFragList.WeightedFragInfo.totalBoost</code> (weight)  
+depends on the implementation of <code>FieldFragList.add( ... )</code>:
+<pre class="prettyprint">
+  public void add( int startOffset, int endOffset, List&lt;WeightedPhraseInfo&gt; phraseInfoList ) {
+    float totalBoost = 0;
+    List&lt;SubInfo&gt; subInfos = new ArrayList&lt;SubInfo&gt;();
+    for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
+      subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum() ) );
+      totalBoost += phraseInfo.getBoost();
+    }
+    getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, totalBoost ) );
+  }
+  
+</pre>
+The used implementation of <code>FieldFragList</code> is noted in <code>BaseFragListBuilder.createFieldFragList( ... )</code>:
+<pre class="prettyprint">
+  public FieldFragList createFieldFragList( FieldPhraseList fieldPhraseList, int fragCharSize ){
+    return createFieldFragList( fieldPhraseList, new SimpleFieldFragList( fragCharSize ), fragCharSize );
+  }
+</pre>
+<p>
+Currently there are basically to approaches available:
+</p>
+<ul>
+<li><code>SimpleFragListBuilder using SimpleFieldFragList</code>: <i>sum-of-boosts</i>-approach. The totalBoost is calculated by summarizing the query-boosts per term. Per default a term is boosted by 1.0</li>
+<li><code>WeightedFragListBuilder using WeightedFieldFragList</code>: <i>sum-of-distinct-weights</i>-approach. The totalBoost is calculated by summarizing the IDF-weights of distinct terms.</li>
+</ul> 
+<p>Comparison of the two approaches:</p>
+<table border="1">
+<caption>
+	query = das alte testament (The Old Testament)
+</caption>
+<tr><th>Terms in fragment</th><th>sum-of-distinct-weights</th><th>sum-of-boosts</th></tr>
+<tr><td>das alte testament</td><td>5.339621</td><td>3.0</td></tr>
+<tr><td>das alte testament</td><td>5.339621</td><td>3.0</td></tr>
+<tr><td>das testament alte</td><td>5.339621</td><td>3.0</td></tr>
+<tr><td>das alte testament</td><td>5.339621</td><td>3.0</td></tr>
+<tr><td>das testament</td><td>2.9455688</td><td>2.0</td></tr>
+<tr><td>das alte</td><td>2.4759595</td><td>2.0</td></tr>
+<tr><td>das das das das</td><td>1.5015357</td><td>4.0</td></tr>
+<tr><td>das das das</td><td>1.3003681</td><td>3.0</td></tr>
+<tr><td>das das</td><td>1.061746</td><td>2.0</td></tr>
+<tr><td>alte</td><td>1.0</td><td>1.0</td></tr>
+<tr><td>alte</td><td>1.0</td><td>1.0</td></tr>
+<tr><td>das</td><td>0.7507678</td><td>1.0</td></tr>
+<tr><td>das</td><td>0.7507678</td><td>1.0</td></tr>
+<tr><td>das</td><td>0.7507678</td><td>1.0</td></tr>
+<tr><td>das</td><td>0.7507678</td><td>1.0</td></tr>
+<tr><td>das</td><td>0.7507678</td><td>1.0</td></tr>
+</table>
+
 <h3>Step 5.</h3>
 <p>In Step 5, by using <code>FieldFragList</code> and the field stored data,
 Fast Vector Highlighter creates highlighted snippets!</p>