You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by us...@apache.org on 2012/04/22 02:07:16 UTC
svn commit: r1328748 [2/2] - in /lucene/dev/trunk: lucene/
lucene/queryparser/src/java/org/apache/lucene/queryparser/classic/
lucene/site/ lucene/site/build/ lucene/site/html/ lucene/site/src/
lucene/site/xsl/ solr/
Added: lucene/dev/trunk/lucene/site/html/scoring.html
URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/site/html/scoring.html?rev=1328748&view=auto
==============================================================================
--- lucene/dev/trunk/lucene/site/html/scoring.html (added)
+++ lucene/dev/trunk/lucene/site/html/scoring.html Sun Apr 22 00:07:15 2012
@@ -0,0 +1,338 @@
+<html>
+<head>
+<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
+<title>Apache Lucene - Scoring</title>
+</head>
+<body>
+<h1>Apache Lucene - Scoring</h1>
+<div id="minitoc-area">
+<ul class="minitoc">
+<li>
+<a href="#Introduction">Introduction</a>
+</li>
+<li>
+<a href="#Scoring">Scoring</a>
+<ul class="minitoc">
+<li>
+<a href="#Fields and Documents">Fields and Documents</a>
+</li>
+<li>
+<a href="#Score Boosting">Score Boosting</a>
+</li>
+<li>
+<a href="#Understanding the Scoring Formula">Understanding the Scoring Formula</a>
+</li>
+<li>
+<a href="#The Big Picture">The Big Picture</a>
+</li>
+<li>
+<a href="#Query Classes">Query Classes</a>
+</li>
+<li>
+<a href="#Changing Similarity">Changing Similarity</a>
+</li>
+</ul>
+</li>
+<li>
+<a href="#Changing your Scoring -- Expert Level">Changing your Scoring -- Expert Level</a>
+</li>
+<li>
+<a href="#Appendix">Appendix</a>
+<ul class="minitoc">
+<li>
+<a href="#Algorithm">Algorithm</a>
+</li>
+</ul>
+</li>
+</ul>
+</div>
+
+
+<a name="N10013"></a><a name="Introduction"></a>
+<h2 class="boxed">Introduction</h2>
+<div class="section">
+<p>Lucene scoring is the heart of why we all love Lucene. It is blazingly fast and it hides almost all of the complexity from the user.
+ In a nutshell, it works. At least, that is, until it doesn't work, or doesn't work as one would expect it to
+ work. Then we are left digging into Lucene internals or asking for help on java-user@lucene.apache.org to figure out why a document with five of our query terms
+ scores lower than a different document with only one of the query terms. </p>
+<p>While this document won't answer your specific scoring issues, it will, hopefully, point you to the places that can
+ help you figure out the what and why of Lucene scoring.</p>
+<p>Lucene scoring uses a combination of the
+ <a href="http://en.wikipedia.org/wiki/Vector_Space_Model">Vector Space Model (VSM) of Information
+ Retrieval</a> and the <a href="http://en.wikipedia.org/wiki/Standard_Boolean_model">Boolean model</a>
+ to determine
+ how relevant a given Document is to a User's query. In general, the idea behind the VSM is the more
+ times a query term appears in a document relative to
+ the number of times the term appears in all the documents in the collection, the more relevant that
+ document is to the query. It uses the Boolean model to first narrow down the documents that need to
+ be scored based on the use of boolean logic in the Query specification. Lucene also adds some
+ capabilities and refinements onto this model to support boolean and fuzzy searching, but it
+ essentially remains a VSM based system at the heart.
+ For some valuable references on VSM and IR in general refer to the
+ <a href="http://wiki.apache.org/lucene-java/InformationRetrieval">Lucene Wiki IR references</a>.
+ </p>
+<p>The rest of this document will cover <a href="#Scoring">Scoring</a> basics and how to change your
+ <a href="core/org/apache/lucene/search/Similarity.html">Similarity</a>. Next it will cover ways you can
+ customize the Lucene internals in <a href="#Changing your Scoring -- Expert Level">Changing your Scoring
+ -- Expert Level</a> which gives details on implementing your own
+ <a href="core/org/apache/lucene/search/Query.html">Query</a> class and related functionality. Finally, we
+ will finish up with some reference material in the <a href="#Appendix">Appendix</a>.
+ </p>
+</div>
+
+<a name="N10045"></a><a name="Scoring"></a>
+<h2 class="boxed">Scoring</h2>
+<div class="section">
+<p>Scoring is very much dependent on the way documents are indexed,
+ so it is important to understand indexing (see
+ <a href="gettingstarted.html">Apache Lucene - Getting Started Guide</a>
+ and the Lucene
+ <a href="fileformats.html">file formats</a>
+ before continuing on with this section.) It is also assumed that readers know how to use the
+ <a href="core/org/apache/lucene/search/Searcher.html#explain(Query query, int doc)">Searcher.explain(Query query, int doc)</a> functionality,
+ which can go a long way in informing why a score is returned.
+ </p>
+<a name="N10059"></a><a name="Fields and Documents"></a>
+<h3 class="boxed">Fields and Documents</h3>
+<p>In Lucene, the objects we are scoring are
+ <a href="core/org/apache/lucene/document/Document.html">Documents</a>. A Document is a collection
+ of
+ <a href="core/org/apache/lucene/document/Field.html">Fields</a>. Each Field has semantics about how
+ it is created and stored (i.e. tokenized, untokenized, raw data, compressed, etc.) It is important to
+ note that Lucene scoring works on Fields and then combines the results to return Documents. This is
+ important because two Documents with the exact same content, but one having the content in two Fields
+ and the other in one Field will return different scores for the same query due to length normalization
+ (assumming the
+ <a href="core/org/apache/lucene/search/DefaultSimilarity.html">DefaultSimilarity</a>
+ on the Fields).
+ </p>
+<a name="N1006E"></a><a name="Score Boosting"></a>
+<h3 class="boxed">Score Boosting</h3>
+<p>Lucene allows influencing search results by "boosting" in more than one level:
+ <ul>
+
+<li>
+<b>Document level boosting</b>
+ - while indexing - by calling
+ <a href="core/org/apache/lucene/document/Document.html#setBoost(float)">document.setBoost()</a>
+ before a document is added to the index.
+ </li>
+
+<li>
+<b>Document's Field level boosting</b>
+ - while indexing - by calling
+ <a href="core/org/apache/lucene/document/Fieldable.html#setBoost(float)">field.setBoost()</a>
+ before adding a field to the document (and before adding the document to the index).
+ </li>
+
+<li>
+<b>Query level boosting</b>
+ - during search, by setting a boost on a query clause, calling
+ <a href="core/org/apache/lucene/search/Query.html#setBoost(float)">Query.setBoost()</a>.
+ </li>
+
+</ul>
+
+</p>
+<p>Indexing time boosts are preprocessed for storage efficiency and written to
+ the directory (when writing the document) in a single byte (!) as follows:
+ For each field of a document, all boosts of that field
+ (i.e. all boosts under the same field name in that doc) are multiplied.
+ The result is multiplied by the boost of the document,
+ and also multiplied by a "field length norm" value
+ that represents the length of that field in that doc
+ (so shorter fields are automatically boosted up).
+ The result is decoded as a single byte
+ (with some precision loss of course) and stored in the directory.
+ The similarity object in effect at indexing computes the length-norm of the field.
+ </p>
+<p>This composition of 1-byte representation of norms
+ (that is, indexing time multiplication of field boosts & doc boost & field-length-norm)
+ is nicely described in
+ <a href="core/org/apache/lucene/document/Fieldable.html#setBoost(float)">Fieldable.setBoost()</a>.
+ </p>
+<p>Encoding and decoding of the resulted float norm in a single byte are done by the
+ static methods of the class Similarity:
+ <a href="core/org/apache/lucene/search/Similarity.html#encodeNorm(float)">encodeNorm()</a> and
+ <a href="core/org/apache/lucene/search/Similarity.html#decodeNorm(byte)">decodeNorm()</a>.
+ Due to loss of precision, it is not guaranteed that decode(encode(x)) = x,
+ e.g. decode(encode(0.89)) = 0.75.
+ At scoring (search) time, this norm is brought into the score of document
+ as <b>norm(t, d)</b>, as shown by the formula in
+ <a href="core/org/apache/lucene/search/Similarity.html">Similarity</a>.
+ </p>
+<a name="N100B1"></a><a name="Understanding the Scoring Formula"></a>
+<h3 class="boxed">Understanding the Scoring Formula</h3>
+<p>
+ This scoring formula is described in the
+ <a href="core/org/apache/lucene/search/Similarity.html">Similarity</a> class. Please take the time to study this formula, as it contains much of the information about how the
+ basics of Lucene scoring work, especially the
+ <a href="core/org/apache/lucene/search/TermQuery.html">TermQuery</a>.
+ </p>
+<a name="N100C2"></a><a name="The Big Picture"></a>
+<h3 class="boxed">The Big Picture</h3>
+<p>OK, so the tf-idf formula and the
+ <a href="core/org/apache/lucene/search/Similarity.html">Similarity</a>
+ is great for understanding the basics of Lucene scoring, but what really drives Lucene scoring are
+ the use and interactions between the
+ <a href="core/org/apache/lucene/search/Query.html">Query</a> classes, as created by each application in
+ response to a user's information need.
+ </p>
+<p>In this regard, Lucene offers a wide variety of <a href="core/org/apache/lucene/search/Query.html">Query</a> implementations, most of which are in the
+ <a href="core/org/apache/lucene/search/package-summary.html">org.apache.lucene.search</a> package.
+ These implementations can be combined in a wide variety of ways to provide complex querying
+ capabilities along with
+ information about where matches took place in the document collection. The <a href="#Query Classes">Query</a>
+ section below
+ highlights some of the more important Query classes. For information on the other ones, see the
+ <a href="core/org/apache/lucene/search/package-summary.html">package summary</a>. For details on implementing
+ your own Query class, see <a href="#Changing your Scoring -- Expert Level">Changing your Scoring --
+ Expert Level</a> below.
+ </p>
+<p>Once a Query has been created and submitted to the
+ <a href="core/org/apache/lucene/search/IndexSearcher.html">IndexSearcher</a>, the scoring process
+ begins. (See the <a href="#Appendix">Appendix</a> Algorithm section for more notes on the process.) After some infrastructure setup,
+ control finally passes to the <a href="core/org/apache/lucene/search/Weight.html">Weight</a> implementation and its
+ <a href="core/org/apache/lucene/search/Scorer.html">Scorer</a> instance. In the case of any type of
+ <a href="core/org/apache/lucene/search/BooleanQuery.html">BooleanQuery</a>, scoring is handled by the
+ <a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight2</a>
+ (link goes to ViewVC BooleanQuery java code which contains the BooleanWeight2 inner class) or
+ <a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/BooleanQuery.java?view=log">BooleanWeight</a>
+ (link goes to ViewVC BooleanQuery java code, which contains the BooleanWeight inner class).
+ </p>
+<p>
+ Assuming the use of the BooleanWeight2, a
+ BooleanScorer2 is created by bringing together
+ all of the
+ <a href="core/org/apache/lucene/search/Scorer.html">Scorer</a>s from the sub-clauses of the BooleanQuery.
+ When the BooleanScorer2 is asked to score it delegates its work to an internal Scorer based on the type
+ of clauses in the Query. This internal Scorer essentially loops over the sub scorers and sums the scores
+ provided by each scorer while factoring in the coord() score.
+ <!-- Do we want to fill in the details of the counting sum scorer, disjunction scorer, etc.? -->
+ </p>
+<a name="N10112"></a><a name="Query Classes"></a>
+<h3 class="boxed">Query Classes</h3>
+<p>For information on the Query Classes, refer to the
+ <a href="core/org/apache/lucene/search/package-summary.html#query">search package javadocs</a>
+
+</p>
+<a name="N1011F"></a><a name="Changing Similarity"></a>
+<h3 class="boxed">Changing Similarity</h3>
+<p>One of the ways of changing the scoring characteristics of Lucene is to change the similarity factors. For information on
+ how to do this, see the
+ <a href="core/org/apache/lucene/search/package-summary.html#changingSimilarity">search package javadocs</a>
+</p>
+</div>
+
+<a name="N1012C"></a><a name="Changing your Scoring -- Expert Level"></a>
+<h2 class="boxed">Changing your Scoring -- Expert Level</h2>
+<div class="section">
+<p>At a much deeper level, one can affect scoring by implementing their own Query classes (and related scoring classes.) To learn more
+ about how to do this, refer to the
+ <a href="core/org/apache/lucene/search/package-summary.html#scoring">search package javadocs</a>
+
+</p>
+</div>
+
+
+<a name="N10139"></a><a name="Appendix"></a>
+<h2 class="boxed">Appendix</h2>
+<div class="section">
+<a name="N1013E"></a><a name="Algorithm"></a>
+<h3 class="boxed">Algorithm</h3>
+<p>This section is mostly notes on stepping through the Scoring process and serves as
+ fertilizer for the earlier sections.</p>
+<p>In the typical search application, a
+ <a href="core/org/apache/lucene/search/Query.html">Query</a>
+ is passed to the
+ <a href="core/org/apache/lucene/search/Searcher.html">Searcher</a>
+ , beginning the scoring process.
+ </p>
+<p>Once inside the Searcher, a
+ <a href="core/org/apache/lucene/search/Collector.html">Collector</a>
+ is used for the scoring and sorting of the search results.
+ These important objects are involved in a search:
+ <ol>
+
+<li>The
+ <a href="core/org/apache/lucene/search/Weight.html">Weight</a>
+ object of the Query. The Weight object is an internal representation of the Query that
+ allows the Query to be reused by the Searcher.
+ </li>
+
+<li>The Searcher that initiated the call.</li>
+
+<li>A
+ <a href="core/org/apache/lucene/search/Filter.html">Filter</a>
+ for limiting the result set. Note, the Filter may be null.
+ </li>
+
+<li>A
+ <a href="core/org/apache/lucene/search/Sort.html">Sort</a>
+ object for specifying how to sort the results if the standard score based sort method is not
+ desired.
+ </li>
+
+</ol>
+
+</p>
+<p> Assuming we are not sorting (since sorting doesn't
+ effect the raw Lucene score),
+ we call one of the search methods of the Searcher, passing in the
+ <a href="core/org/apache/lucene/search/Weight.html">Weight</a>
+ object created by Searcher.createWeight(Query),
+ <a href="core/org/apache/lucene/search/Filter.html">Filter</a>
+ and the number of results we want. This method
+ returns a
+ <a href="core/org/apache/lucene/search/TopDocs.html">TopDocs</a>
+ object, which is an internal collection of search results.
+ The Searcher creates a
+ <a href="core/org/apache/lucene/search/TopScoreDocCollector.html">TopScoreDocCollector</a>
+ and passes it along with the Weight, Filter to another expert search method (for more on the
+ <a href="core/org/apache/lucene/search/Collector.html">Collector</a>
+ mechanism, see
+ <a href="core/org/apache/lucene/search/Searcher.html">Searcher</a>
+ .) The TopDocCollector uses a
+ <a href="core/org/apache/lucene/util/PriorityQueue.html">PriorityQueue</a>
+ to collect the top results for the search.
+ </p>
+<p>If a Filter is being used, some initial setup is done to determine which docs to include. Otherwise,
+ we ask the Weight for
+ a
+ <a href="core/org/apache/lucene/search/Scorer.html">Scorer</a>
+ for the
+ <a href="core/org/apache/lucene/index/IndexReader.html">IndexReader</a>
+ of the current searcher and we proceed by
+ calling the score method on the
+ <a href="core/org/apache/lucene/search/Scorer.html">Scorer</a>
+ .
+ </p>
+<p>At last, we are actually going to score some documents. The score method takes in the Collector
+ (most likely the TopScoreDocCollector or TopFieldCollector) and does its business.
+ Of course, here is where things get involved. The
+ <a href="core/org/apache/lucene/search/Scorer.html">Scorer</a>
+ that is returned by the
+ <a href="core/org/apache/lucene/search/Weight.html">Weight</a>
+ object depends on what type of Query was submitted. In most real world applications with multiple
+ query terms,
+ the
+ <a href="core/org/apache/lucene/search/Scorer.html">Scorer</a>
+ is going to be a
+ <a href="http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/BooleanScorer2.java?view=log">BooleanScorer2</a>
+ (see the section on customizing your scoring for info on changing this.)
+
+ </p>
+<p>Assuming a BooleanScorer2 scorer, we first initialize the Coordinator, which is used to apply the
+ coord() factor. We then
+ get a internal Scorer based on the required, optional and prohibited parts of the query.
+ Using this internal Scorer, the BooleanScorer2 then proceeds
+ into a while loop based on the Scorer#next() method. The next() method advances to the next document
+ matching the query. This is an
+ abstract method in the Scorer class and is thus overriden by all derived
+ implementations. <!-- DOUBLE CHECK THIS -->If you have a simple OR query
+ your internal Scorer is most likely a DisjunctionSumScorer, which essentially combines the scorers
+ from the sub scorers of the OR'd terms.</p>
+</div>
+
+</body>
+</html>
Added: lucene/dev/trunk/lucene/site/xsl/index.xsl
URL: http://svn.apache.org/viewvc/lucene/dev/trunk/lucene/site/xsl/index.xsl?rev=1328748&view=auto
==============================================================================
--- lucene/dev/trunk/lucene/site/xsl/index.xsl (added)
+++ lucene/dev/trunk/lucene/site/xsl/index.xsl Sun Apr 22 00:07:15 2012
@@ -0,0 +1,94 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<xsl:stylesheet version="1.0"
+ xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
+ xmlns:str="http://exslt.org/strings"
+ extension-element-prefixes="str"
+>
+ <xsl:param name="buildfiles"/>
+ <xsl:param name="version"/>
+
+ <xsl:template match="/">
+ <html>
+ <head>
+ <title><xsl:text>Apache Lucene </xsl:text><xsl:value-of select="$version"/><xsl:text> Documentation</xsl:text></title>
+ </head>
+ <body>
+ <div><img src="lucene_green_300.gif"/></div>
+ <h1><xsl:text>Apache Lucene </xsl:text><xsl:value-of select="$version"/><xsl:text> Documentation</xsl:text></h1>
+ <p>
+ This is the official documentation for <b><xsl:text>Apache Lucene </xsl:text>
+ <xsl:value-of select="$version"/></b>. Additional documentation is available in the
+ <a href="http://wiki.apache.org/lucene-java">Wiki</a>.
+ </p>
+ <h2>Index</h2>
+ <ul>
+ <li><a href="changes/Changes.html">Changes</a></li>
+ <li><a href="fileformats.html">File Formats Documentation</a></li>
+ <li><a href="scoring.html">Scoring in Lucene</a></li>
+ </ul>
+ <h2>Getting Started</h2>
+ <p>This document is intended as a "getting started" guide. It has three
+ audiences: first-time users looking to install Apache Lucene in their
+ application; developers looking to modify or base the applications they develop
+ on Lucene; and developers looking to become involved in and contribute to the
+ development of Lucene. This document is written in tutorial and walk-through
+ format. The goal is to help you "get started". It does not go into great depth
+ on some of the conceptual or inner details of Lucene.</p>
+ <p>Each section listed below builds on one another. More advanced users may
+ wish to skip sections.</p>
+ <ul>
+ <li><a href="demo.html">About the command-line Lucene demo and its usage</a>.
+ This section is intended for anyone who wants to use the command-line Lucene
+ demo.</li>
+ <li><a href="demo2.html">About the sources and implementation for the
+ command-line Lucene demo</a>. This section walks through the implementation
+ details (sources) of the command-line Lucene demo. This section is intended for
+ developers.</li>
+ </ul>
+ <h2>Javadocs</h2>
+ <xsl:call-template name="modules"/>
+ </body>
+ </html>
+ </xsl:template>
+
+ <xsl:template name="modules">
+ <ul>
+ <xsl:for-each select="str:split($buildfiles,'|')">
+ <!-- hack to list "core" first, contains() returns "true" which sorts before "false" if descending: -->
+ <xsl:sort select="string(contains(text(), '/core/'))" order="descending" lang="en"/>
+ <!-- hack to list "test-framework" at the end, contains() returns "true" which sorts after "false" if ascending: -->
+ <xsl:sort select="string(contains(text(), '/test-framework/'))" order="ascending" lang="en"/>
+ <!-- sort the remaining build files by path name: -->
+ <xsl:sort select="text()" order="ascending" lang="en"/>
+
+ <xsl:variable name="buildxml" select="document(.)"/>
+ <xsl:variable name="name" select="$buildxml/*/@name"/>
+ <li>
+ <xsl:if test="$name='core'">
+ <xsl:attribute name="style">font-size:larger; margin-bottom:.5em;</xsl:attribute>
+ </xsl:if>
+ <b><a href="{$name}/index.html"><xsl:value-of select="$name"/>
+ </a><xsl:text>: </xsl:text></b>
+ <xsl:value-of select="normalize-space($buildxml/*/description)"/>
+ </li>
+ </xsl:for-each>
+ </ul>
+ </xsl:template>
+
+</xsl:stylesheet>
Modified: lucene/dev/trunk/solr/common-build.xml
URL: http://svn.apache.org/viewvc/lucene/dev/trunk/solr/common-build.xml?rev=1328748&r1=1328747&r2=1328748&view=diff
==============================================================================
--- lucene/dev/trunk/solr/common-build.xml (original)
+++ lucene/dev/trunk/solr/common-build.xml Sun Apr 22 00:07:15 2012
@@ -196,7 +196,7 @@
<copy file="${build.dir}/${fullnamever}.jar" todir="${dist}"/>
</target>
- <property name="lucenedocs" location="${common.dir}/build/docs/api"/>
+ <property name="lucenedocs" location="${common.dir}/build/docs"/>
<!-- dependency to ensure all lucene javadocs are present -->
<target name="lucene-javadocs" depends="javadocs-lucene-core,javadocs-analyzers-common,javadocs-analyzers-icu,javadocs-analyzers-kuromoji,javadocs-analyzers-phonetic,javadocs-analyzers-smartcn,javadocs-analyzers-morfologik,javadocs-analyzers-stempel,javadocs-analyzers-uima,javadocs-suggest,javadocs-grouping,javadocs-queries,javadocs-queryparser,javadocs-highlighter,javadocs-memory,javadocs-misc,javadocs-spatial"/>
@@ -252,7 +252,7 @@
<target name="define-lucene-javadoc-url-SNAPSHOT" if="version.contains.SNAPSHOT">
<property name="lucene.javadoc.url"
- value="${common.dir}/build/docs/api/"/>
+ value="${common.dir}/build/docs/"/>
</target>
<target name="define-lucene-javadoc-url-release" unless="version.contains.SNAPSHOT">
@@ -264,7 +264,7 @@
</filterchain>
</loadproperties>
<property name="lucene.javadoc.url"
- value="http://lucene.apache.org/java/${underscore.version}/api/"/>
+ value="http://lucene.apache.org/java/${underscore.version}/"/>
</target>
<target name="jar-src" depends="init">