You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/02/03 00:14:49 UTC

[GitHub] [lucene-solr] jtibshirani opened a new pull request #2293: LUCENE-9725: Allow BM25FQuery to use other similarities.

jtibshirani opened a new pull request #2293:
URL: https://github.com/apache/lucene-solr/pull/2293


   From a high level, BM25FQuery (1) computes statistic that represent the combined
   field content and (2) passes these to a score function. This model makes sense
   for many similarities besides BM25.
   
   This PR unhardcodes BM25Similarity in BM25FQuery and instead uses the one
   configured on IndexSearcher. It also renames BM25FQuery since it's no longer
   specific to BM25.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani commented on pull request #2293: LUCENE-9725: Allow BM25FQuery to use other similarities.

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on pull request #2293:
URL: https://github.com/apache/lucene-solr/pull/2293#issuecomment-773581613


   Thanks @jpountz and @jimczi for reviewing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani merged pull request #2293: LUCENE-9725: Allow BM25FQuery to use other similarities.

Posted by GitBox <gi...@apache.org>.

jtibshirani merged pull request #2293:
URL: https://github.com/apache/lucene-solr/pull/2293


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani merged pull request #2293: LUCENE-9725: Allow BM25FQuery to use other similarities.

Posted by GitBox <gi...@apache.org>.

jtibshirani merged pull request #2293:
URL: https://github.com/apache/lucene-solr/pull/2293


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jpountz commented on a change in pull request #2293: LUCENE-9725: Allow BM25FQuery to use other similarities.

Posted by GitBox <gi...@apache.org>.

jpountz commented on a change in pull request #2293:
URL: https://github.com/apache/lucene-solr/pull/2293#discussion_r570044511



##########
File path: lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java
##########
@@ -64,39 +62,35 @@
  * A {@link Query} that treats multiple fields as a single stream and scores terms as if you had
  * indexed them as a single term in a single field.
  *
- * <p>For scoring purposes this query implements the BM25F's simple formula described in:
- * http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf
+ * <p>The query works as follows:
  *
- * <p>The per-field similarity is ignored but to be compatible each field must use a {@link
- * Similarity} at index time that encodes norms the same way as {@link SimilarityBase#computeNorm}.
+ * <ol>
+ *   <li>Given a list of fields and weights, it pretends there is a synthetic combined field where
+ *       all terms have been indexed. It computes new term and collection statistics for this
+ *       combined field.
+ *   <li>It uses a disjunction iterator and {@link IndexSearcher#getSimilarity} to score documents.
+ * </ol>
+ *
+ * <p>In order for a similarity to be compatible, {@link Similarity#computeNorm} must be additive:
+ * the norm of the combined field is the sum of norms for each individual field. This is usually
+ * true, since norms often represent the field length. Per-field similarities are not supported.

Review comment:
       The requirement is actually stronger, we need a similarity that uses an additive normalization factor AND that encodes it using `SmallFloat#intToByte4` in the index since the decoding of norms is hardcoded as `SmallFloat#byte4ToInt` in `MultiFieldNormValues#advanceExact`.
   
   Also maybe mention explicitly that e.g. `BM25Similarity` and `DFRSimilarity` meet this requirement?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani commented on a change in pull request #2293: LUCENE-9725: Allow BM25FQuery to use other similarities.

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #2293:
URL: https://github.com/apache/lucene-solr/pull/2293#discussion_r570517621



##########
File path: lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java
##########
@@ -64,39 +62,35 @@
  * A {@link Query} that treats multiple fields as a single stream and scores terms as if you had
  * indexed them as a single term in a single field.
  *
- * <p>For scoring purposes this query implements the BM25F's simple formula described in:
- * http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf
+ * <p>The query works as follows:
  *
- * <p>The per-field similarity is ignored but to be compatible each field must use a {@link
- * Similarity} at index time that encodes norms the same way as {@link SimilarityBase#computeNorm}.
+ * <ol>
+ *   <li>Given a list of fields and weights, it pretends there is a synthetic combined field where
+ *       all terms have been indexed. It computes new term and collection statistics for this
+ *       combined field.
+ *   <li>It uses a disjunction iterator and {@link IndexSearcher#getSimilarity} to score documents.
+ * </ol>
+ *
+ * <p>In order for a similarity to be compatible, {@link Similarity#computeNorm} must be additive:
+ * the norm of the combined field is the sum of norms for each individual field. This is usually
+ * true, since norms often represent the field length. Per-field similarities are not supported.

Review comment:
       Good point! I mentioned those similarities and restored the reference to `SimilarityBase#computeNorm`, which gives a clear example of what's acceptable.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani commented on pull request #2293: LUCENE-9725: Allow BM25FQuery to use other similarities.

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on pull request #2293:
URL: https://github.com/apache/lucene-solr/pull/2293#issuecomment-773581613


   Thanks @jpountz and @jimczi for reviewing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jpountz commented on a change in pull request #2293: LUCENE-9725: Allow BM25FQuery to use other similarities.

Posted by GitBox <gi...@apache.org>.

jpountz commented on a change in pull request #2293:
URL: https://github.com/apache/lucene-solr/pull/2293#discussion_r570044511



##########
File path: lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java
##########
@@ -64,39 +62,35 @@
  * A {@link Query} that treats multiple fields as a single stream and scores terms as if you had
  * indexed them as a single term in a single field.
  *
- * <p>For scoring purposes this query implements the BM25F's simple formula described in:
- * http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf
+ * <p>The query works as follows:
  *
- * <p>The per-field similarity is ignored but to be compatible each field must use a {@link
- * Similarity} at index time that encodes norms the same way as {@link SimilarityBase#computeNorm}.
+ * <ol>
+ *   <li>Given a list of fields and weights, it pretends there is a synthetic combined field where
+ *       all terms have been indexed. It computes new term and collection statistics for this
+ *       combined field.
+ *   <li>It uses a disjunction iterator and {@link IndexSearcher#getSimilarity} to score documents.
+ * </ol>
+ *
+ * <p>In order for a similarity to be compatible, {@link Similarity#computeNorm} must be additive:
+ * the norm of the combined field is the sum of norms for each individual field. This is usually
+ * true, since norms often represent the field length. Per-field similarities are not supported.

Review comment:
       The requirement is actually stronger, we need a similarity that uses an additive normalization factor AND that encodes it using `SmallFloat#intToByte4` in the index since the decoding of norms is hardcoded as `SmallFloat#byte4ToInt` in `MultiFieldNormValues#advanceExact`.
   
   Also maybe mention explicitly that e.g. `BM25Similarity` and `DFRSimilarity` meet this requirement?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene-solr] jtibshirani commented on a change in pull request #2293: LUCENE-9725: Allow BM25FQuery to use other similarities.

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on a change in pull request #2293:
URL: https://github.com/apache/lucene-solr/pull/2293#discussion_r570517621



##########
File path: lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java
##########
@@ -64,39 +62,35 @@
  * A {@link Query} that treats multiple fields as a single stream and scores terms as if you had
  * indexed them as a single term in a single field.
  *
- * <p>For scoring purposes this query implements the BM25F's simple formula described in:
- * http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf
+ * <p>The query works as follows:
  *
- * <p>The per-field similarity is ignored but to be compatible each field must use a {@link
- * Similarity} at index time that encodes norms the same way as {@link SimilarityBase#computeNorm}.
+ * <ol>
+ *   <li>Given a list of fields and weights, it pretends there is a synthetic combined field where
+ *       all terms have been indexed. It computes new term and collection statistics for this
+ *       combined field.
+ *   <li>It uses a disjunction iterator and {@link IndexSearcher#getSimilarity} to score documents.
+ * </ol>
+ *
+ * <p>In order for a similarity to be compatible, {@link Similarity#computeNorm} must be additive:
+ * the norm of the combined field is the sum of norms for each individual field. This is usually
+ * true, since norms often represent the field length. Per-field similarities are not supported.

Review comment:
       Good point! I mentioned those similarities and restored the reference to `SimilarityBase#computeNorm`, which gives a clear example of what's acceptable.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org