You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/02/04 08:53:08 UTC

[GitHub] [lucene-solr] jpountz commented on a change in pull request #2293: LUCENE-9725: Allow BM25FQuery to use other similarities.

jpountz commented on a change in pull request #2293:
URL: https://github.com/apache/lucene-solr/pull/2293#discussion_r570044511



##########
File path: lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java
##########
@@ -64,39 +62,35 @@
  * A {@link Query} that treats multiple fields as a single stream and scores terms as if you had
  * indexed them as a single term in a single field.
  *
- * <p>For scoring purposes this query implements the BM25F's simple formula described in:
- * http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf
+ * <p>The query works as follows:
  *
- * <p>The per-field similarity is ignored but to be compatible each field must use a {@link
- * Similarity} at index time that encodes norms the same way as {@link SimilarityBase#computeNorm}.
+ * <ol>
+ *   <li>Given a list of fields and weights, it pretends there is a synthetic combined field where
+ *       all terms have been indexed. It computes new term and collection statistics for this
+ *       combined field.
+ *   <li>It uses a disjunction iterator and {@link IndexSearcher#getSimilarity} to score documents.
+ * </ol>
+ *
+ * <p>In order for a similarity to be compatible, {@link Similarity#computeNorm} must be additive:
+ * the norm of the combined field is the sum of norms for each individual field. This is usually
+ * true, since norms often represent the field length. Per-field similarities are not supported.

Review comment:
       The requirement is actually stronger, we need a similarity that uses an additive normalization factor AND that encodes it using `SmallFloat#intToByte4` in the index since the decoding of norms is hardcoded as `SmallFloat#byte4ToInt` in `MultiFieldNormValues#advanceExact`.
   
   Also maybe mention explicitly that e.g. `BM25Similarity` and `DFRSimilarity` meet this requirement?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org