You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@lucene.apache.org by jp...@apache.org on 2018/09/03 10:22:34 UTC
lucene-solr:master: Update javadocs for Lucene 8.
Repository: lucene-solr
Updated Branches:
refs/heads/master d93c46ea9 -> a1ec716e1
Update javadocs for Lucene 8.
This fixes a couple mistakes, puts more emphasis on BM25 compared to Classic and
gives more guidance regarding custom scores without a custom query.
Project: http://git-wip-us.apache.org/repos/asf/lucene-solr/repo
Commit: http://git-wip-us.apache.org/repos/asf/lucene-solr/commit/a1ec716e
Tree: http://git-wip-us.apache.org/repos/asf/lucene-solr/tree/a1ec716e
Diff: http://git-wip-us.apache.org/repos/asf/lucene-solr/diff/a1ec716e
Branch: refs/heads/master
Commit: a1ec716e107807f1dc24923cc7a91d0c5e64a7e1
Parents: d93c46e
Author: Adrien Grand <jp...@gmail.com>
Authored: Mon Sep 3 12:21:12 2018 +0200
Committer: Adrien Grand <jp...@gmail.com>
Committed: Mon Sep 3 12:22:26 2018 +0200
----------------------------------------------------------------------
.../org/apache/lucene/index/package-info.java | 6 +-
.../apache/lucene/search/TermRangeQuery.java | 4 +
.../org/apache/lucene/search/package-info.java | 127 +++++++++++--------
.../search/similarities/package-info.java | 52 +++-----
lucene/core/src/java/overview.html | 4 +-
5 files changed, 103 insertions(+), 90 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/a1ec716e/lucene/core/src/java/org/apache/lucene/index/package-info.java
----------------------------------------------------------------------
diff --git a/lucene/core/src/java/org/apache/lucene/index/package-info.java b/lucene/core/src/java/org/apache/lucene/index/package-info.java
index 55ee56c..1dbc400 100644
--- a/lucene/core/src/java/org/apache/lucene/index/package-info.java
+++ b/lucene/core/src/java/org/apache/lucene/index/package-info.java
@@ -110,8 +110,10 @@
* inverted index, is comprised of "postings." The postings, with their term dictionary, can be
* thought of as a map that provides efficient lookup given a {@link org.apache.lucene.index.Term}
* (roughly, a word or token), to (the ordered list of) {@link org.apache.lucene.document.Document}s
- * containing that Term. Postings do not provide any way of retrieving terms given a document,
- * short of scanning the entire index.</p>
+ * containing that Term. Codecs may additionally record
+ * {@link org.apache.lucene.index.ImpactsEnum#getImpacts impacts} alongside postings in order to be
+ * able to skip over low-scoring documents at search time. Postings do not provide any way of
+ * retrieving terms given a document, short of scanning the entire index.</p>
*
* <a name="stored-fields"></a>
* <p>Stored fields are essentially the opposite of postings, providing efficient retrieval of field
http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/a1ec716e/lucene/core/src/java/org/apache/lucene/search/TermRangeQuery.java
----------------------------------------------------------------------
diff --git a/lucene/core/src/java/org/apache/lucene/search/TermRangeQuery.java b/lucene/core/src/java/org/apache/lucene/search/TermRangeQuery.java
index cd73902..02dff86 100644
--- a/lucene/core/src/java/org/apache/lucene/search/TermRangeQuery.java
+++ b/lucene/core/src/java/org/apache/lucene/search/TermRangeQuery.java
@@ -28,6 +28,10 @@ import org.apache.lucene.util.automaton.Automaton;
* <p>This query matches the documents looking for terms that fall into the
* supplied range according to {@link BytesRef#compareTo(BytesRef)}.
*
+ * <p><b>NOTE</b>: {@link TermRangeQuery} performs significantly slower than
+ * {@link PointRangeQuery point-based ranges} as it needs to visit all terms
+ * that match the range and merges their matches.
+ *
* <p>This query uses the {@link
* MultiTermQuery#CONSTANT_SCORE_REWRITE}
* rewrite method.
http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/a1ec716e/lucene/core/src/java/org/apache/lucene/search/package-info.java
----------------------------------------------------------------------
diff --git a/lucene/core/src/java/org/apache/lucene/search/package-info.java b/lucene/core/src/java/org/apache/lucene/search/package-info.java
index 21832c7..8120840 100644
--- a/lucene/core/src/java/org/apache/lucene/search/package-info.java
+++ b/lucene/core/src/java/org/apache/lucene/search/package-info.java
@@ -44,7 +44,7 @@
* <p>
* Once a Query has been created and submitted to the {@link org.apache.lucene.search.IndexSearcher IndexSearcher}, the scoring
* process begins. After some infrastructure setup, control finally passes to the {@link org.apache.lucene.search.Weight Weight}
- * implementation and its {@link org.apache.lucene.search.Scorer Scorer} or {@link org.apache.lucene.search.BulkScorer BulkScore}
+ * implementation and its {@link org.apache.lucene.search.Scorer Scorer} or {@link org.apache.lucene.search.BulkScorer BulkScorer}
* instances. See the <a href="#algorithm">Algorithm</a> section for more notes on the process.
* <!-- FILL IN MORE HERE -->
* <!-- TODO: this page over-links the same things too many times -->
@@ -95,9 +95,11 @@
* If a query is made up of all SHOULD clauses, then every document in the result
* set matches at least one of these clauses.</p></li>
*
- * <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#MUST MUST} — Use this operator when a clause is required to occur in the result set. Every
- * document in the result set will match
- * all such clauses.</p></li>
+ * <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#MUST MUST} — Use this operator when a clause is required to occur in the result set and should
+ * contribute to the score. Every document in the result set will match all such clauses.</p></li>
+ *
+ * <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#FILTER FILTER} — Use this operator when a clause is required to occur in the result set but
+ * should not contribute to the score. Every document in the result set will match all such clauses.</p></li>
*
* <li><p>{@link org.apache.lucene.search.BooleanClause.Occur#MUST_NOT MUST NOT} — Use this operator when a
* clause must not occur in the result set. No
@@ -113,7 +115,7 @@
* {@link org.apache.lucene.search.TermQuery TermQuery} clauses,
* for example by {@link org.apache.lucene.search.WildcardQuery WildcardQuery}.
* The default setting for the maximum number
- * of clauses 1024, but this can be changed via the
+ * of clauses is 1024, but this can be changed via the
* static method {@link org.apache.lucene.search.BooleanQuery#setMaxClauseCount(int)}.
*
* <h3>Phrases</h3>
@@ -149,23 +151,6 @@
* </ol>
*
* <h3>
- * {@link org.apache.lucene.search.TermRangeQuery TermRangeQuery}
- * </h3>
- *
- * <p>The
- * {@link org.apache.lucene.search.TermRangeQuery TermRangeQuery}
- * matches all documents that occur in the
- * exclusive range of a lower
- * {@link org.apache.lucene.index.Term Term}
- * and an upper
- * {@link org.apache.lucene.index.Term Term}
- * according to {@link org.apache.lucene.util.BytesRef#compareTo BytesRef.compareTo()}. It is not intended
- * for numerical ranges; use {@link org.apache.lucene.search.PointRangeQuery PointRangeQuery} instead.
- *
- * For example, one could find all documents
- * that have terms beginning with the letters <tt>a</tt> through <tt>c</tt>.
- *
- * <h3>
* {@link org.apache.lucene.search.PointRangeQuery PointRangeQuery}
* </h3>
*
@@ -274,6 +259,7 @@
*
* <a name="changingScoring"></a>
* <h2>Changing Scoring — Similarity</h2>
+ * <h3>Changing the scoring formula</h3>
* <p>
* Changing {@link org.apache.lucene.search.similarities.Similarity Similarity} is an easy way to
* influence scoring, this is done at index-time with
@@ -289,14 +275,54 @@
* extend by plugging in a different component (e.g. term frequency normalizer).
* <p>
* Finally, you can extend the low level {@link org.apache.lucene.search.similarities.Similarity Similarity} directly
- * to implement a new retrieval model, or to use external scoring factors particular to your application. For example,
- * a custom Similarity can access per-document values via {@link org.apache.lucene.index.NumericDocValues} and
- * integrate them into the score.
+ * to implement a new retrieval model.
* <p>
* See the {@link org.apache.lucene.search.similarities} package documentation for information
* on the built-in available scoring models and extending or changing Similarity.
- *
- *
+ *
+ * <h3>Integrating field values into the score</h3>
+ * <p>While similarities help score a document relatively to a query, it is also common for documents to hold
+ * features that measure the quality of a match. Such features are best integrated into the score by indexing
+ * a {@link org.apache.lucene.document.FeatureField FeatureField} with the document at index-time, and then
+ * combining the similarity score and the feature score using a linear combination. For instance the below
+ * query matches the same documents as {@code originalQuery} and computes scores as
+ * {@code similarityScore + 0.7 * featureScore}:
+ * <pre class="prettyprint">
+ * Query originalQuery = new BooleanQuery.Builder()
+ * .add(new TermQuery(new Term("body", "apache")), Occur.SHOULD)
+ * .add(new TermQuery(new Term("body", "lucene")), Occur.SHOULD)
+ * .build();
+ * Query featureQuery = FeatureField.newSaturationQuery("features", "pagerank");
+ * Query query = new BooleanQuery.Builder()
+ * .add(originalQuery, Occur.MUST)
+ * .add(new BoostQuery(featureQuery, 0.7f), Occur.SHOULD)
+ * .build();
+ * </pre>
+ *
+ * <p>A less efficient yet more flexible way of modifying scores is to index scoring features into
+ * doc-value fields and then combine them with the similarity score using a
+ * <a href="{@docRoot}/../queries/org/apache/lucene/queries/function/FunctionScoreQuery.html">FunctionScoreQuery</a>
+ * from the <a href="{@docRoot}/../queries/overview-summary.html">queries module</a>. For instance
+ * the below example shows how to compute scores as {@code similarityScore * Math.log(popularity)}
+ * using the <a href="{@docRoot}/../expressions/overview-summary.html">expressions module</a> and
+ * assuming that values for the {@code popularity} field have been set in a
+ * {@link org.apache.lucene.document.NumericDocValuesField NumericDocValuesField} at index time:
+ * <pre class="prettyprint">
+ * // compile an expression:
+ * Expression expr = JavascriptCompiler.compile("_score * ln(popularity)");
+ *
+ * // SimpleBindings just maps variables to SortField instances
+ * SimpleBindings bindings = new SimpleBindings();
+ * bindings.add(new SortField("_score", SortField.Type.SCORE));
+ * bindings.add(new SortField("popularity", SortField.Type.INT));
+ *
+ * // create a query that matches based on 'originalQuery' but
+ * // scores using expr
+ * Query query = new FunctionScoreQuery(
+ * originalQuery,
+ * expr.getDoubleValuesSource(bindings));
+ * </pre>
+ *
* <a name="customQueriesExpert"></a>
* <h2>Custom Queries — Expert Level</h2>
*
@@ -311,15 +337,14 @@
* {@link org.apache.lucene.search.Query Query} — The abstract object representation of the
* user's information need.</li>
* <li>
- * {@link org.apache.lucene.search.Weight Weight} — The internal interface representation of
- * the user's Query, so that Query objects may be reused.
- * This is global (across all segments of the index) and
- * generally will require global statistics (such as docFreq
- * for a given term across all segments).</li>
+ * {@link org.apache.lucene.search.Weight Weight} — A specialization of a Query for a given
+ * index. This typically associates a Query object with index statistics that are later used to
+ * compute document scores.
* <li>
- * {@link org.apache.lucene.search.Scorer Scorer} — An abstract class containing common
- * functionality for scoring. Provides both scoring and
- * explanation capabilities. This is created per-segment.</li>
+ * {@link org.apache.lucene.search.Scorer Scorer} — The core class of the scoring process:
+ * for a given segment, scorers return {@link org.apache.lucene.search.Scorer#iterator iterators}
+ * over matches and give a way to compute the {@link org.apache.lucene.search.Scorer#score score}
+ * of these matches.</li>
* <li>
* {@link org.apache.lucene.search.BulkScorer BulkScorer} — An abstract class that scores
* a range of documents. A default implementation simply iterates through the hits from
@@ -338,7 +363,7 @@
* {@link org.apache.lucene.search.Query Query} class has several methods that are important for
* derived classes:
* <ol>
- * <li>{@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher, boolean needsScores, float boost)} — A
+ * <li>{@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher, ScoreMode scoreMode, float boost)} — A
* {@link org.apache.lucene.search.Weight Weight} is the internal representation of the
* Query, so each Query implementation must
* provide an implementation of Weight. See the subsection on <a
@@ -347,7 +372,7 @@
* <li>{@link org.apache.lucene.search.Query#rewrite(org.apache.lucene.index.IndexReader) rewrite(IndexReader reader)} — Rewrites queries into primitive queries. Primitive queries are:
* {@link org.apache.lucene.search.TermQuery TermQuery},
* {@link org.apache.lucene.search.BooleanQuery BooleanQuery}, <span
- * >and other queries that implement {@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher,boolean needsScores, float boost)}</span></li>
+ * >and other queries that implement {@link org.apache.lucene.search.Query#createWeight(IndexSearcher,ScoreMode,float) createWeight(IndexSearcher searcher,ScoreMode scoreMode, float boost)}</span></li>
* </ol>
* <a name="weightClass"></a>
* <h3>The Weight Interface</h3>
@@ -356,23 +381,15 @@
* interface provides an internal representation of the Query so that it can be reused. Any
* {@link org.apache.lucene.search.IndexSearcher IndexSearcher}
* dependent state should be stored in the Weight implementation,
- * not in the Query class. The interface defines five methods that must be implemented:
+ * not in the Query class. The interface defines four main methods:
* <ol>
* <li>
- * {@link org.apache.lucene.search.Weight#getQuery getQuery()} — Pointer to the
- * Query that this Weight represents.</li>
- * <li>
* {@link org.apache.lucene.search.Weight#scorer scorer()} —
* Construct a new {@link org.apache.lucene.search.Scorer Scorer} for this Weight. See <a href="#scorerClass">The Scorer Class</a>
* below for help defining a Scorer. As the name implies, the Scorer is responsible for doing the actual scoring of documents
* given the Query.
* </li>
* <li>
- * {@link org.apache.lucene.search.Weight#bulkScorer bulkScorer()} —
- * Construct a new {@link org.apache.lucene.search.BulkScorer BulkScorer} for this Weight. See <a href="#bulkScorerClass">The BulkScorer Class</a>
- * below for help defining a BulkScorer. This is an optional method, and most queries do not implement it.
- * </li>
- * <li>
* {@link org.apache.lucene.search.Weight#explain(org.apache.lucene.index.LeafReaderContext, int)
* explain(LeafReaderContext context, int doc)} — Provide a means for explaining why a given document was
* scored the way it was.
@@ -380,6 +397,16 @@
* that scores via a {@link org.apache.lucene.search.similarities.Similarity Similarity} will make use of the Similarity's implementation:
* {@link org.apache.lucene.search.similarities.Similarity.SimScorer#explain(Explanation, long) SimScorer#explain(Explanation freq, long norm)}.
* </li>
+ * <li>
+ * {@link org.apache.lucene.search.Weight#extractTerms(java.util.Set) extractTerms(Set<Term> terms)} — Extract terms that
+ * this query operates on. This is typically used to support distributed search: knowing the terms that a query operates on helps
+ * merge index statistics of these terms so that scores are computed over a subset of the data like they would if all documents
+ * were in the same index.
+ * </li>
+ * <li>
+ * {@link org.apache.lucene.search.Weight#matches matches(LeafReaderContext context, int doc)} — Give information about positions
+ * and offsets of matches. This is typically useful to implement highlighting.
+ * </li>
* </ol>
* <a name="scorerClass"></a>
* <h3>The Scorer Class</h3>
@@ -458,17 +485,13 @@
* This method returns a {@link org.apache.lucene.search.TopDocs TopDocs} object,
* which is an internal collection of search results. The IndexSearcher creates
* a {@link org.apache.lucene.search.TopScoreDocCollector TopScoreDocCollector} and
- * passes it along with the Weight, Filter to another expert search method (for
+ * passes it along with the Weight to another expert search method (for
* more on the {@link org.apache.lucene.search.Collector Collector} mechanism,
* see {@link org.apache.lucene.search.IndexSearcher IndexSearcher}). The TopScoreDocCollector
* uses a {@link org.apache.lucene.util.PriorityQueue PriorityQueue} to collect the
* top results for the search.
- * <p>If a Filter is being used, some initial setup is done to determine which docs to include.
- * Otherwise, we ask the Weight for a {@link org.apache.lucene.search.Scorer Scorer} for each
- * {@link org.apache.lucene.index.IndexReader IndexReader} segment and proceed by calling
- * {@link org.apache.lucene.search.BulkScorer#score(org.apache.lucene.search.LeafCollector,org.apache.lucene.util.Bits) BulkScorer.score(LeafCollector,Bits)}.
* <p>At last, we are actually going to score some documents. The score method takes in the Collector
- * (most likely the TopScoreDocCollector or TopFieldCollector) and does its business.Of course, here
+ * (most likely the TopScoreDocCollector or TopFieldCollector) and does its business. Of course, here
* is where things get involved. The {@link org.apache.lucene.search.Scorer Scorer} that is returned
* by the {@link org.apache.lucene.search.Weight Weight} object depends on what type of Query was
* submitted. In most real world applications with multiple query terms, the
http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/a1ec716e/lucene/core/src/java/org/apache/lucene/search/similarities/package-info.java
----------------------------------------------------------------------
diff --git a/lucene/core/src/java/org/apache/lucene/search/similarities/package-info.java b/lucene/core/src/java/org/apache/lucene/search/similarities/package-info.java
index 34a014b..997d5d6 100644
--- a/lucene/core/src/java/org/apache/lucene/search/similarities/package-info.java
+++ b/lucene/core/src/java/org/apache/lucene/search/similarities/package-info.java
@@ -73,9 +73,9 @@
* your searching needs.
* However, in some applications it may be necessary to customize your <a
* href="Similarity.html">Similarity</a> implementation. For instance, some
- * applications do not need to
- * distinguish between shorter and longer documents (see <a
- * href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">a "fair" similarity</a>).
+ * applications do not need to distinguish between shorter and longer documents
+ * and could set BM25's {@link org.apache.lucene.search.similarities.BM25Similarity#BM25Similarity(float,float) b}
+ * parameter to {@code 0}.
*
* <p>To change {@link org.apache.lucene.search.similarities.Similarity}, one must do so for both indexing and
* searching, and the changes must happen before
@@ -83,15 +83,27 @@
* just isn't well-defined what is going to happen.
*
* <p>To make this change, implement your own {@link org.apache.lucene.search.similarities.Similarity} (likely
- * you'll want to simply subclass an existing method, be it
- * {@link org.apache.lucene.search.similarities.ClassicSimilarity} or a descendant of
- * {@link org.apache.lucene.search.similarities.SimilarityBase}), and
+ * you'll want to simply subclass {@link org.apache.lucene.search.similarities.SimilarityBase}), and
* then register the new class by calling
* {@link org.apache.lucene.index.IndexWriterConfig#setSimilarity(Similarity)}
* before indexing and
* {@link org.apache.lucene.search.IndexSearcher#setSimilarity(Similarity)}
* before searching.
*
+ * <h3>Tuning {@linkplain org.apache.lucene.search.similarities.BM25Similarity}</h3>
+ * <p>{@link org.apache.lucene.search.similarities.BM25Similarity} has
+ * two parameters that may be tuned:
+ * <ul>
+ * <li><tt>k1</tt>, which calibrates term frequency saturation and must be
+ * positive or null. A value of {@code 0} makes term frequency completely
+ * ignored, making documents scored only based on the value of the <tt>IDF</tt>
+ * of the matched terms. Higher values of <tt>k1</tt> increase the impact of
+ * term frequency on the final score. Default value is {@code 1.2}.</li>
+ * <li><tt>b</tt>, which controls how much document length should normalize
+ * term frequency values and must be in {@code [0, 1]}. A value of {@code 0}
+ * disables length normalization completely. Default value is {@code 0.75}.</li>
+ * </ul>
+ *
* <h3>Extending {@linkplain org.apache.lucene.search.similarities.SimilarityBase}</h3>
* <p>
* The easiest way to quickly implement a new ranking method is to extend
@@ -112,33 +124,5 @@
* subclassing the Similarity, one can simply introduce a new basic model and tell
* {@link org.apache.lucene.search.similarities.DFRSimilarity} to use it.
*
- * <h3>Changing {@linkplain org.apache.lucene.search.similarities.ClassicSimilarity}</h3>
- * <p>
- * If you are interested in use cases for changing your similarity, see the Lucene users's mailing list at <a
- * href="http://www.gossamer-threads.com/lists/lucene/java-user/39125">Overriding Similarity</a>.
- * In summary, here are a few use cases:
- * <ol>
- * <li><p>The <code>SweetSpotSimilarity</code> in
- * <code>org.apache.lucene.misc</code> gives small
- * increases as the frequency increases a small amount
- * and then greater increases when you hit the "sweet spot", i.e. where
- * you think the frequency of terms is more significant.</li>
- * <li><p>Overriding tf — In some applications, it doesn't matter what the score of a document is as long as a
- * matching term occurs. In these
- * cases people have overridden Similarity to return 1 from the tf() method.</li>
- * <li><p>Changing Length Normalization — By overriding
- * {@link org.apache.lucene.search.similarities.Similarity#computeNorm(org.apache.lucene.index.FieldInvertState state)},
- * it is possible to discount how the length of a field contributes
- * to a score. In {@link org.apache.lucene.search.similarities.ClassicSimilarity},
- * lengthNorm = 1 / (numTerms in field)^0.5, but if one changes this to be
- * 1 / (numTerms in field), all fields will be treated
- * <a href="http://www.gossamer-threads.com/lists/lucene/java-user/38967#38967">"fairly"</a>.</li>
- * </ol>
- * In general, Chris Hostetter sums it up best in saying (from <a
- * href="http://www.gossamer-threads.com/lists/lucene/java-user/39125#39125">the Lucene users's mailing list</a>):
- * <blockquote>[One would override the Similarity in] ... any situation where you know more about your data then just
- * that
- * it's "text" is a situation where it *might* make sense to to override your
- * Similarity method.</blockquote>
*/
package org.apache.lucene.search.similarities;
http://git-wip-us.apache.org/repos/asf/lucene-solr/blob/a1ec716e/lucene/core/src/java/overview.html
----------------------------------------------------------------------
diff --git a/lucene/core/src/java/overview.html b/lucene/core/src/java/overview.html
index b7112ac..e941744 100644
--- a/lucene/core/src/java/overview.html
+++ b/lucene/core/src/java/overview.html
@@ -35,7 +35,7 @@ to check if the results are what we expect):</p>
// Store the index in memory:
Directory directory = new RAMDirectory();
// To store an index on disk, use this instead:
- //Directory directory = FSDirectory.open("/tmp/testindex");
+ //Directory directory = FSDirectory.open(Paths.get("/tmp/testindex"));
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter iwriter = new IndexWriter(directory, config);
Document doc = new Document();
@@ -50,7 +50,7 @@ to check if the results are what we expect):</p>
// Parse a simple query that searches for "text":
QueryParser parser = new QueryParser("fieldname", analyzer);
Query query = parser.parse("text");
- ScoreDoc[] hits = isearcher.search(query, null, 1000).scoreDocs;
+ ScoreDoc[] hits = isearcher.search(query, 10).scoreDocs;
assertEquals(1, hits.length);
// Iterate through the results:
for (int i = 0; i < hits.length; i++) {