You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by rm...@apache.org on 2012/04/22 20:41:43 UTC
svn commit: r1328929 - in /lucene/dev/trunk/lucene: core/src/java/org/apache/lucene/search/package.html core/src/java/org/apache/lucene/search/similarities/package.html site/html/scoring.html site/xsl/index.xsl

Author: rmuir
Date: Sun Apr 22 18:41:42 2012
New Revision: 1328929

URL: http://svn.apache.org/viewvc?rev=1328929&view=rev
Log:
integrate scoring.html into scoring package, fix broken links, and update for 4.0

Removed:
    lucene/dev/trunk/lucene/site/html/scoring.html
Modified:
    lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/package.html
    lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/similarities/package.html
    lucene/dev/trunk/lucene/site/xsl/index.xsl

Modified: lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/package.html
URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/package.html?rev=1328929&r1=1328928&r2=1328929&view=diff
==============================================================================
--- lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/package.html (original)
+++ lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/package.html Sun Apr 22 18:41:42 2012
@@ -27,18 +27,33 @@ Code to search indices.
     <ol>
         <li><a href="#search">Search Basics</a></li>
         <li><a href="#query">The Query Classes</a></li>
-        <li><a href="#scoring">Changing the Scoring</a></li>
+        <li><a href="#scoring">Scoring: Introduction</a></li>
+        <li><a href="#scoringBasics">Scoring: Basics</a></li>
+        <li><a href="#changingScoring">Changing the Scoring</a></li>
+        <li><a href="#algorithm">Appendix: Search Algorithm</a></li>
     </ol>
 </p>
 <a name="search"></a>
-<h2>Search</h2>
+<h2>Search Basics</h2>
 <p>
-Search over indices.
-
-Applications usually call {@link
+Lucene offers a wide variety of {@link org.apache.lucene.search.Query} implementations, most of which are in
+this package, its subpackages ({@link org.apache.lucene.search.spans spans}, {@link org.apache.lucene.search.payloads payloads}),
+or the <a href="{@docRoot}/../queries/overview-summary.html">queries module</a>. These implementations can be combined in a wide 
+variety of ways to provide complex querying capabilities along with information about where matches took place in the document 
+collection. The <a href="#query">Query Classes</a> section below highlights some of the more important Query classes. For details 
+on implementing your own Query class, see <a href="#customQueries">Custom Queries -- Expert Level</a> below.
+</p>
+<p>
+To perform a search, applications usually call {@link
 org.apache.lucene.search.IndexSearcher#search(Query,int)} or {@link
 org.apache.lucene.search.IndexSearcher#search(Query,Filter,int)}.
-
+</p>
+<p>
+Once a Query has been created and submitted to the {@link org.apache.lucene.search.IndexSearcher IndexSearcher}, the scoring
+process begins. After some infrastructure setup, control finally passes to the {@link org.apache.lucene.search.Weight Weight}
+implementation and its {@link org.apache.lucene.search.Scorer Scorer} instances. See the <a href="#algorithm">Algorithm</a> 
+section for more notes on the process.
+</p>
     <!-- FILL IN MORE HERE -->   
     <!-- TODO: this page over-links the same things too many times -->
 </p>
@@ -211,20 +226,118 @@ org.apache.lucene.search.IndexSearcher#s
     This type of query can be useful when accounting for spelling variations in the collection.
 </p>
 <a name="scoring"></a>
+<h2>Scoring &mdash; Introduction</h2>
+<p>Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides 
+   almost all of the complexity from the user. In a nutshell, it works.  At least, that is, 
+   until it doesn't work, or doesn't work as one would expect it to work.  Then we are left 
+   digging into Lucene internals or asking for help on 
+   <a href="mailto:java-user@lucene.apache.org">java-user@lucene.apache.org</a> to figure out 
+   why a document with five of our query terms scores lower than a different document with 
+   only one of the query terms. 
+</p>
+<p>While this document won't answer your specific scoring issues, it will, hopefully, point you 
+  to the places that can help you figure out the <i>what</i> and <i>why</i> of Lucene scoring.
+</p>
+<p>Lucene scoring supports a number of pluggable information retrieval 
+   <a href="http://en.wikipedia.org/wiki/Information_retrieval#Model_types">models</a>, including:
+   <ul>
+     <li><a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM)</a></li>
+     <li><a href="http://en.wikipedia.org/wiki/Probabilistic_relevance_model">Probablistic Models</a> such as 
+         <a href="http://en.wikipedia.org/wiki/Probabilistic_relevance_model_(BM25)">Okapi BM25</a> and
+         <a href="http://en.wikipedia.org/wiki/Divergence-from-randomness_model">DFR</a></li>
+     <li><a href="http://en.wikipedia.org/wiki/Language_model">Language models</a></li>
+   </ul>
+   These models can be plugged in via the {@link org.apache.lucene.search.similarities Similarity API},
+   and offer extension hooks and parameters for tuning. In general, Lucene first narrows down the documents
+   that need to be scored based on boolean logic in the Query specification, and then ranks this subset of
+   documents via the retrieval model. For some valuable references on VSM and IR in general refer to
+   <a href="http://wiki.apache.org/lucene-java/InformationRetrieval">Lucene Wiki IR references</a>.
+</p>
+<p>The rest of this document will cover <a href="#scoringBasics">Scoring basics</a> and explain how to 
+   change your {@link org.apache.lucene.search.similarities.Similarity Similarity}. Next, it will cover
+   ways you can customize the lucene internals in 
+   <a href="#customQueriesExpert">Custom Queries -- Expert Level</a>, which gives details on 
+   implementing your own {@link org.apache.lucene.search.Query Query} class and related functionality.
+   Finally, we will finish up with some reference material in the <a href="#algorithm">Appendix</a>.
+</p>
+<a name="scoringBasics"></a>
+<h2>Scoring &mdash; Basics</h2>
+<p>Scoring is very much dependent on the way documents are indexed, so it is important to understand 
+   indexing. (see <a href="@{docRoot}/overview-summary.html">Lucene overview</a> before continuing
+   on with this section) It is also assumed that readers know how to use the 
+   {@link org.apache.lucene.search.IndexSearcher#explain(org.apache.lucene.search.Query, int) IndexSearcher.explain(Query, doc)}
+   functionality, which can go a long way in informing why a score is returned.
+</p>
+<h4>Fields and Documents</h4>
+<p>In Lucene, the objects we are scoring are {@link org.apache.lucene.document.Document Document}s.
+   A Document is a collection of {@link org.apache.lucene.document.Field Field}s.  Each Field has
+   {@link org.apache.lucene.document.FieldType semantics} about how it is created and stored 
+   ({@link org.apache.lucene.document.FieldType#tokenized() tokenized}, 
+   {@link org.apache.lucene.document.FieldType#stored() stored}, etc). It is important to note that 
+   Lucene scoring works on Fields and then combines the results to return Documents. This is 
+   important because two Documents with the exact same content, but one having the content in two
+   Fields and the other in one Field may return different scores for the same query due to length
+   normalization.
+</p>
+<h4>Score Boosting</h4>
+<p>Lucene allows influencing search results by "boosting" in more than one level:
+   <ul>                   
+      <li><b>Index-time boost</b> by calling
+       {@link org.apache.lucene.document.Field#setBoost(float) Field.setBoost()} before a document is 
+       added to the index.</li>
+      <li><b>Query-time boost</b> by setting a boost on a query clause, calling
+       {@link org.apache.lucene.search.Query#setBoost(float) Query.setBoost()}.</li>
+   </ul>    
+</p>
+<p>Indexing time boosts are pre-processed for storage efficiency and written to
+   storage for a field as follows:
+   <ul>
+       <li>All boosts of that field (i.e. all boosts under the same field name in that doc) are 
+           multiplied.</li>
+       <li>The boost is then encoded into a normalization value by the Similarity
+           object at index-time: {@link org.apache.lucene.search.similarities.Similarity#computeNorm computeNorm()}.
+           The actual encoding depends upon the Similarity implementation, but note that most
+           use a lossy encoding (such as multiplying the boost with document length or similar, packed
+           into a single byte!).</li>
+       <li>Decoding of any index-time normalization values and integration into the document's score is also performed 
+           at search time by the Similarity.</li>
+    </ul>
+</p>
+<a name="changingScoring"></a>
 <h2>Changing Scoring &mdash; Similarity</h2>
-
+<p>
+Changing {@link org.apache.lucene.search.similarities.Similarity Similarity} is an easy way to 
+influence scoring, this is done at index-time with 
+{@link org.apache.lucene.index.IndexWriterConfig#setSimilarity(org.apache.lucene.search.similarities.Similarity)
+ IndexWriterConfig.setSimilarity(Similarity)} and at query-time with
+{@link org.apache.lucene.search.IndexSearcher#setSimilarity(org.apache.lucene.search.similarities.Similarity)
+ IndexSearcher.setSimilarity(Similarity)}.
+</p>
+<p>
+You can influence scoring by configuring a different built-in Similarity implementation, or by tweaking its
+parameters, subclassing it to override behavior. Some implementations also offer a modular API which you can
+extend by plugging in a different component (e.g. term frequency normalizer).
+</p>
+<p>
+Finally, you can extend the low level {@link org.apache.lucene.search.similarities.Similarity Similarity} directly
+to implement a new retrieval model, or to use external scoring factors particular to your application. For example,
+a custom Similarity can access per-document values via {@link org.apache.lucene.search.FieldCache FieldCache} or
+{@link org.apache.lucene.index.DocValues} and integrate them into the score.
+</p>
+<p>
 See the {@link org.apache.lucene.search.similarities} package documentation for information
-on the available scoring models and extending or changing Similarity.
-
-<h2>Changing Scoring &mdash; Expert Level</h2>
+on the built-in available scoring models and extending or changing Similarity.
+</p>
+<a name="customQueriesExpert"></a>
+<h2>Custom Queries &mdash; Expert Level</h2>
 
-<p>Changing scoring is an expert level task, so tread carefully and be prepared to share your code if
+<p>Custom queries are an expert level task, so tread carefully and be prepared to share your code if
     you want help.
 </p>
 
 <p>With the warning out of the way, it is possible to change a lot more than just the Similarity
-    when it comes to scoring in Lucene. Lucene's scoring is a complex mechanism that is grounded by
-    <span >three main classes</span>:
+    when it comes to matching and scoring in Lucene. Lucene's search is a complex mechanism that is grounded by
+    <span>three main classes</span>:
     <ol>
         <li>
             {@link org.apache.lucene.search.Query Query} &mdash; The abstract object representation of the
@@ -248,13 +361,13 @@ on the available scoring models and exte
         {@link org.apache.lucene.search.Query Query} class has several methods that are important for
         derived classes:
         <ol>
-            <li>{@link org.apache.lucene.search.Query#createWeight(IndexSearcher) createWeight(IndexSearcher searcher} &mdash; A
+            <li>{@link org.apache.lucene.search.Query#createWeight(IndexSearcher) createWeight(IndexSearcher searcher)} &mdash; A
                 {@link org.apache.lucene.search.Weight Weight} is the internal representation of the
                 Query, so each Query implementation must
                 provide an implementation of Weight. See the subsection on <a
                     href="#weightClass">The Weight Interface</a> below for details on implementing the Weight
                 interface.</li>
-            <li>{@link org.apache.lucene.search.Query#rewrite(IndexReader) rewrite(IndexReader reader} &mdash; Rewrites queries into primitive queries. Primitive queries are:
+            <li>{@link org.apache.lucene.search.Query#rewrite(IndexReader) rewrite(IndexReader reader)} &mdash; Rewrites queries into primitive queries. Primitive queries are:
                 {@link org.apache.lucene.search.TermQuery TermQuery},
                 {@link org.apache.lucene.search.BooleanQuery BooleanQuery}, <span
                     >and other queries that implement {@link org.apache.lucene.search.Query#createWeight(IndexSearcher) createWeight(IndexSearcher searcher)}</span></li>
@@ -363,5 +476,63 @@ on the available scoring models and exte
         back
         out of Lucene (similar to Doug adding SpanQuery functionality).</p>
 
+<!-- TODO: integrate this better, its better served as an intro than an appendix -->
+<a name="algorithm"></a>
+<h2>Appendix: Search Algorithm</h2>
+<p>This section is mostly notes on stepping through the Scoring process and serves as
+   fertilizer for the earlier sections.</p>
+<p>In the typical search application, a {@link org.apache.lucene.search.Query Query}
+   is passed to the {@link org.apache.lucene.search.IndexSearcher IndexSearcher},
+   beginning the scoring process.</p>
+<p>Once inside the IndexSearcher, a {@link org.apache.lucene.search.Collector Collector}
+   is used for the scoring and sorting of the search results.
+   These important objects are involved in a search:
+   <ol>                
+      <li>The {@link org.apache.lucene.search.Weight Weight} object of the Query. The
+          Weight object is an internal representation of the Query that allows the Query 
+          to be reused by the IndexSearcher.</li>
+      <li>The IndexSearcher that initiated the call.</li>     
+      <li>A {@link org.apache.lucene.search.Filter Filter} for limiting the result set.
+          Note, the Filter may be null.</li>                   
+      <li>A {@link org.apache.lucene.search.Sort Sort} object for specifying how to sort
+          the results if the standard score-based sort method is not desired.</li>                   
+  </ol>       
+</p>
+<p>Assuming we are not sorting (since sorting doesn't affect the raw Lucene score),
+   we call one of the search methods of the IndexSearcher, passing in the
+   {@link org.apache.lucene.search.Weight Weight} object created by
+   {@link org.apache.lucene.search.IndexSearcher#createNormalizedWeight(org.apache.lucene.search.Query)
+    IndexSearcher.createNormalizedWeight(Query)}, 
+   {@link org.apache.lucene.search.Filter Filter} and the number of results we want.
+   This method returns a {@link org.apache.lucene.search.TopDocs TopDocs} object,
+   which is an internal collection of search results. The IndexSearcher creates
+   a {@link org.apache.lucene.search.TopScoreDocCollector TopScoreDocCollector} and
+   passes it along with the Weight, Filter to another expert search method (for
+   more on the {@link org.apache.lucene.search.Collector Collector} mechanism,
+   see {@link org.apache.lucene.search.IndexSearcher IndexSearcher}). The TopScoreDocCollector
+   uses a {@link org.apache.lucene.util.PriorityQueue PriorityQueue} to collect the
+   top results for the search.
+</p> 
+<p>If a Filter is being used, some initial setup is done to determine which docs to include. 
+   Otherwise, we ask the Weight for a {@link org.apache.lucene.search.Scorer Scorer} for each
+   {@link org.apache.lucene.index.IndexReader IndexReader} segment and proceed by calling
+   {@link org.apache.lucene.search.Scorer#score(org.apache.lucene.search.Collector) Scorer.score()}.
+</p>
+<p>At last, we are actually going to score some documents. The score method takes in the Collector
+   (most likely the TopScoreDocCollector or TopFieldCollector) and does its business.Of course, here 
+   is where things get involved. The {@link org.apache.lucene.search.Scorer Scorer} that is returned
+   by the {@link org.apache.lucene.search.Weight Weight} object depends on what type of Query was
+   submitted. In most real world applications with multiple query terms, the 
+   {@link org.apache.lucene.search.Scorer Scorer} is going to be a <code>BooleanScorer2</code> created
+   from {@link org.apache.lucene.search.BooleanQuery.BooleanWeight BooleanWeight} (see the section on
+   <a href="#customQueriesExpert">custom queries</a> for info on changing this).
+</p>
+<p>Assuming a BooleanScorer2, we first initialize the Coordinator, which is used to apply the coord() 
+  factor. We then get a internal Scorer based on the required, optional and prohibited parts of the query.
+  Using this internal Scorer, the BooleanScorer2 then proceeds into a while loop based on the 
+  {@link org.apache.lucene.search.Scorer#nextDoc Scorer.nextDoc()} method. The nextDoc() method advances 
+  to the next document matching the query. This is an abstract method in the Scorer class and is thus 
+  overridden by all derived  implementations. If you have a simple OR query your internal Scorer is most 
+  likely a DisjunctionSumScorer, which essentially combines the scorers from the sub scorers of the OR'd terms.</p>
 </body>
 </html>

Modified: lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/similarities/package.html
URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/similarities/package.html?rev=1328929&r1=1328928&r2=1328929&view=diff
==============================================================================
--- lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/similarities/package.html (original)
+++ lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/similarities/package.html Sun Apr 22 18:41:42 2012
@@ -39,7 +39,8 @@ package.
 <h2>Summary of the Ranking Methods</h2>
 
 <p>{@link org.apache.lucene.search.similarities.DefaultSimilarity} is the original Lucene
-scoring function. It is based on a highly optimized Vector Space Model. For more
+scoring function. It is based on a highly optimized 
+<a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model</a>. For more
 information, see {@link org.apache.lucene.search.similarities.TFIDFSimilarity}.</p>
 
 <p>{@link org.apache.lucene.search.similarities.BM25Similarity} is an optimized

Modified: lucene/dev/trunk/lucene/site/xsl/index.xsl
URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/site/xsl/index.xsl?rev=1328929&r1=1328928&r2=1328929&view=diff
==============================================================================
--- lucene/dev/trunk/lucene/site/xsl/index.xsl (original)
+++ lucene/dev/trunk/lucene/site/xsl/index.xsl Sun Apr 22 18:41:42 2012
@@ -61,7 +61,7 @@
           <ul>
             <li><a href="changes/Changes.html">Changes</a>: List of changes in this release.</li>
             <li><a href="fileformats.html">File Formats</a>: Guide to the index format used by Lucene.</li>
-            <li><a href="scoring.html">Scoring in Lucene</a>: Introduction to how Lucene scores documents.</li>
+            <li><a href="core/org/apache/lucene/search/package-summary.html#package_description">Search and Scoring in Lucene</a>: Introduction to how Lucene scores documents.</li>
             <li><a href="core/org/apache/lucene/search/similarities/TFIDFSimilarity.html">Classic Scoring Formula</a>: Formula of Lucene's classic <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space</a> implementation. (look <a href="core/org/apache/lucene/search/similarities/package-summary.html#package_description">here</a> for other models)</li>
             <li><a href="queryparser/org/apache/lucene/queryparser/classic/package-summary.html#package_description">Classic QueryParser Syntax</a>: Overview of the Classic QueryParser's syntax and features.</li>
             <li><a href="facet/org/apache/lucene/facet/doc-files/userguide.html">Facet User Guide</a>: User's Guide to implementing <a href="http://en.wikipedia.org/wiki/Faceted_search">Faceted search</a>.</li>