You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/05/20 13:32:47 UTC

[GitHub] [lucene] ywelsch opened a new pull request, #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

ywelsch opened a new pull request, #910:
URL: https://github.com/apache/lucene/pull/910

   CombinedFieldQuery does not properly combine distributed collection statistics, resulting in an IllegalArgumentException during searches.
   
   Originally surfaced in this Elasticsearch issue: https://github.com/elastic/elasticsearch/issues/82817


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a diff in pull request #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on code in PR #910:
URL: https://github.com/apache/lucene/pull/910#discussion_r878568768


##########
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##########
@@ -589,4 +589,52 @@ public SimScorer scorer(
       return new BM25Similarity().scorer(boost, collectionStats, termStats);
     }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {
+    Directory dir = newDirectory();
+    IndexWriterConfig iwc = new IndexWriterConfig();
+    iwc.setSimilarity(randomCompatibleSimilarity());
+    RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+    String queryString = "foo";
+
+    Document doc0 = new Document();
+    doc0.add(new TextField("f", "foo", Store.NO));
+    doc0.add(new TextField("g", "foo baz", Store.NO));
+    w.addDocument(doc0);
+
+    IndexReader reader = w.getReader();
+    IndexSearcher searcher =
+        new IndexSearcher(reader) {
+          @Override
+          public CollectionStatistics collectionStatistics(String field) throws IOException {
+            CollectionStatistics shardStatistics = super.collectionStatistics(field);
+            int extraMaxDoc = randomIntBetween(0, 10);
+            int extraDocCount = randomIntBetween(0, extraMaxDoc);
+            int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10);
+            int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0, 10);
+            CollectionStatistics globalStatistics =
+                new CollectionStatistics(
+                    field,
+                    shardStatistics.maxDoc() + extraMaxDoc,
+                    shardStatistics.docCount() + extraDocCount,
+                    shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq,
+                    shardStatistics.sumDocFreq() + extraSumDocFreq);
+            return globalStatistics;
+          }
+        };
+    searcher.setSimilarity(new BM25Similarity());
+    CombinedFieldQuery query =
+        new CombinedFieldQuery.Builder()
+            .addField("f")
+            .addField("g")
+            .addTerm(new BytesRef(queryString))
+            .build();
+    // just check that search does not fail
+    searcher.search(query, 10);

Review Comment:
   It'd be nice to assert something stronger here, to check that `CombinedFieldQuery` still works as expected when collection stats are overridden. Maybe we could compare the output of two query strategies like we do in `testCopyField`.



##########
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##########
@@ -589,4 +589,52 @@ public SimScorer scorer(
       return new BM25Similarity().scorer(boost, collectionStats, termStats);
     }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {
+    Directory dir = newDirectory();
+    IndexWriterConfig iwc = new IndexWriterConfig();
+    iwc.setSimilarity(randomCompatibleSimilarity());
+    RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+    String queryString = "foo";
+
+    Document doc0 = new Document();
+    doc0.add(new TextField("f", "foo", Store.NO));
+    doc0.add(new TextField("g", "foo baz", Store.NO));
+    w.addDocument(doc0);
+
+    IndexReader reader = w.getReader();
+    IndexSearcher searcher =
+        new IndexSearcher(reader) {
+          @Override
+          public CollectionStatistics collectionStatistics(String field) throws IOException {
+            CollectionStatistics shardStatistics = super.collectionStatistics(field);
+            int extraMaxDoc = randomIntBetween(0, 10);
+            int extraDocCount = randomIntBetween(0, extraMaxDoc);
+            int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10);
+            int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0, 10);
+            CollectionStatistics globalStatistics =
+                new CollectionStatistics(
+                    field,
+                    shardStatistics.maxDoc() + extraMaxDoc,
+                    shardStatistics.docCount() + extraDocCount,
+                    shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq,
+                    shardStatistics.sumDocFreq() + extraSumDocFreq);
+            return globalStatistics;
+          }
+        };
+    searcher.setSimilarity(new BM25Similarity());

Review Comment:
   It's unusual to search with a different similarity than was used during indexing -- I think we could remove this line.



##########
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##########
@@ -589,4 +589,52 @@ public SimScorer scorer(
       return new BM25Similarity().scorer(boost, collectionStats, termStats);
     }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {

Review Comment:
   Small comment, maybe we could call this `testOverrideCollectionStatistics`? Lucene doesn't really have a native concept of "distributed collection statistics" (as far as I'm aware) and this test doesn't really use that concept anyway?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] ywelsch commented on a diff in pull request #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

Posted by GitBox <gi...@apache.org>.

ywelsch commented on code in PR #910:
URL: https://github.com/apache/lucene/pull/910#discussion_r880463557


##########
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##########
@@ -589,4 +589,52 @@ public SimScorer scorer(
       return new BM25Similarity().scorer(boost, collectionStats, termStats);
     }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {
+    Directory dir = newDirectory();
+    IndexWriterConfig iwc = new IndexWriterConfig();
+    iwc.setSimilarity(randomCompatibleSimilarity());
+    RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+    String queryString = "foo";
+
+    Document doc0 = new Document();
+    doc0.add(new TextField("f", "foo", Store.NO));
+    doc0.add(new TextField("g", "foo baz", Store.NO));
+    w.addDocument(doc0);
+
+    IndexReader reader = w.getReader();
+    IndexSearcher searcher =
+        new IndexSearcher(reader) {
+          @Override
+          public CollectionStatistics collectionStatistics(String field) throws IOException {
+            CollectionStatistics shardStatistics = super.collectionStatistics(field);
+            int extraMaxDoc = randomIntBetween(0, 10);
+            int extraDocCount = randomIntBetween(0, extraMaxDoc);
+            int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10);
+            int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0, 10);
+            CollectionStatistics globalStatistics =
+                new CollectionStatistics(
+                    field,
+                    shardStatistics.maxDoc() + extraMaxDoc,
+                    shardStatistics.docCount() + extraDocCount,
+                    shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq,
+                    shardStatistics.sumDocFreq() + extraSumDocFreq);
+            return globalStatistics;
+          }
+        };
+    searcher.setSimilarity(new BM25Similarity());

Review Comment:
   fixed in [88b7f2c](https://github.com/apache/lucene/pull/910/commits/88b7f2ca8e44e554878a0c10f8ee6bfeb19e57d7)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] ywelsch commented on a diff in pull request #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

Posted by GitBox <gi...@apache.org>.

ywelsch commented on code in PR #910:
URL: https://github.com/apache/lucene/pull/910#discussion_r880465199


##########
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##########
@@ -589,4 +589,52 @@ public SimScorer scorer(
       return new BM25Similarity().scorer(boost, collectionStats, termStats);
     }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {

Review Comment:
   I used the term distributed as that's the use that is mentioned on the Javadocs of the collectionStatistics method. Fine to rename it here ([88b7f2c](https://github.com/apache/lucene/pull/910/commits/88b7f2ca8e44e554878a0c10f8ee6bfeb19e57d7)).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani commented on a diff in pull request #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

Posted by GitBox <gi...@apache.org>.

jtibshirani commented on code in PR #910:
URL: https://github.com/apache/lucene/pull/910#discussion_r881050412


##########
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##########
@@ -589,4 +589,97 @@ public SimScorer scorer(
       return new BM25Similarity().scorer(boost, collectionStats, termStats);
     }
   }
+
+  public void testOverrideCollectionStatistics() throws IOException {
+    Directory dir = newDirectory();
+    IndexWriterConfig iwc = new IndexWriterConfig();
+    Similarity similarity = randomCompatibleSimilarity();
+    iwc.setSimilarity(similarity);
+    RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+    int numMatch = atLeast(10);
+    for (int i = 0; i < numMatch; i++) {
+      Document doc = new Document();
+      if (random().nextBoolean()) {
+        doc.add(new TextField("a", "baz", Store.NO));
+        doc.add(new TextField("b", "baz", Store.NO));
+        for (int k = 0; k < 2; k++) {
+          doc.add(new TextField("ab", "baz", Store.NO));
+        }
+        w.addDocument(doc);
+        doc.clear();
+      }
+      int freqA = random().nextInt(5) + 1;
+      for (int j = 0; j < freqA; j++) {
+        doc.add(new TextField("a", "foo", Store.NO));
+      }
+      int freqB = random().nextInt(5) + 1;
+      for (int j = 0; j < freqB; j++) {
+        doc.add(new TextField("b", "foo", Store.NO));
+      }
+      int freqAB = freqA + freqB;
+      for (int j = 0; j < freqAB; j++) {
+        doc.add(new TextField("ab", "foo", Store.NO));
+      }
+      w.addDocument(doc);
+    }
+
+    IndexReader reader = w.getReader();
+
+    int extraMaxDoc = randomIntBetween(0, 10);
+    int extraDocCount = randomIntBetween(0, extraMaxDoc);
+
+    int extraSumDocFreqA = extraDocCount + randomIntBetween(0, 10);

Review Comment:
   I think it'd make more sense to have a single `sumDocFreq` here. This represents the number of unique term-document pairs, and we can't just add the values across different fields. In fact `CombinedFieldQuery` chooses to take a maximum of the `sumDocFreq`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] jtibshirani merged pull request #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

Posted by GitBox <gi...@apache.org>.

jtibshirani merged PR #910:
URL: https://github.com/apache/lucene/pull/910


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org

[GitHub] [lucene] ywelsch commented on a diff in pull request #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

Posted by GitBox <gi...@apache.org>.

ywelsch commented on code in PR #910:
URL: https://github.com/apache/lucene/pull/910#discussion_r880465450


##########
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##########
@@ -589,4 +589,52 @@ public SimScorer scorer(
       return new BM25Similarity().scorer(boost, collectionStats, termStats);
     }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {
+    Directory dir = newDirectory();
+    IndexWriterConfig iwc = new IndexWriterConfig();
+    iwc.setSimilarity(randomCompatibleSimilarity());
+    RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+    String queryString = "foo";
+
+    Document doc0 = new Document();
+    doc0.add(new TextField("f", "foo", Store.NO));
+    doc0.add(new TextField("g", "foo baz", Store.NO));
+    w.addDocument(doc0);
+
+    IndexReader reader = w.getReader();
+    IndexSearcher searcher =
+        new IndexSearcher(reader) {
+          @Override
+          public CollectionStatistics collectionStatistics(String field) throws IOException {
+            CollectionStatistics shardStatistics = super.collectionStatistics(field);
+            int extraMaxDoc = randomIntBetween(0, 10);
+            int extraDocCount = randomIntBetween(0, extraMaxDoc);
+            int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10);
+            int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0, 10);
+            CollectionStatistics globalStatistics =
+                new CollectionStatistics(
+                    field,
+                    shardStatistics.maxDoc() + extraMaxDoc,
+                    shardStatistics.docCount() + extraDocCount,
+                    shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq,
+                    shardStatistics.sumDocFreq() + extraSumDocFreq);
+            return globalStatistics;
+          }
+        };
+    searcher.setSimilarity(new BM25Similarity());
+    CombinedFieldQuery query =
+        new CombinedFieldQuery.Builder()
+            .addField("f")
+            .addField("g")
+            .addTerm(new BytesRef(queryString))
+            .build();
+    // just check that search does not fail
+    searcher.search(query, 10);

Review Comment:
   I gave that a try in [88b7f2c](https://github.com/apache/lucene/pull/910/commits/88b7f2ca8e44e554878a0c10f8ee6bfeb19e57d7). Let me know what you think



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org