You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2022/05/20 22:03:04 UTC

[GitHub] [lucene] jtibshirani commented on a diff in pull request #910: LUCENE-10582: Fix merging of CollectionStatistics in CombinedFieldQuery

jtibshirani commented on code in PR #910:
URL: https://github.com/apache/lucene/pull/910#discussion_r878568768


##########
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##########
@@ -589,4 +589,52 @@ public SimScorer scorer(
       return new BM25Similarity().scorer(boost, collectionStats, termStats);
     }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {
+    Directory dir = newDirectory();
+    IndexWriterConfig iwc = new IndexWriterConfig();
+    iwc.setSimilarity(randomCompatibleSimilarity());
+    RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+    String queryString = "foo";
+
+    Document doc0 = new Document();
+    doc0.add(new TextField("f", "foo", Store.NO));
+    doc0.add(new TextField("g", "foo baz", Store.NO));
+    w.addDocument(doc0);
+
+    IndexReader reader = w.getReader();
+    IndexSearcher searcher =
+        new IndexSearcher(reader) {
+          @Override
+          public CollectionStatistics collectionStatistics(String field) throws IOException {
+            CollectionStatistics shardStatistics = super.collectionStatistics(field);
+            int extraMaxDoc = randomIntBetween(0, 10);
+            int extraDocCount = randomIntBetween(0, extraMaxDoc);
+            int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10);
+            int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0, 10);
+            CollectionStatistics globalStatistics =
+                new CollectionStatistics(
+                    field,
+                    shardStatistics.maxDoc() + extraMaxDoc,
+                    shardStatistics.docCount() + extraDocCount,
+                    shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq,
+                    shardStatistics.sumDocFreq() + extraSumDocFreq);
+            return globalStatistics;
+          }
+        };
+    searcher.setSimilarity(new BM25Similarity());
+    CombinedFieldQuery query =
+        new CombinedFieldQuery.Builder()
+            .addField("f")
+            .addField("g")
+            .addTerm(new BytesRef(queryString))
+            .build();
+    // just check that search does not fail
+    searcher.search(query, 10);

Review Comment:
   It'd be nice to assert something stronger here, to check that `CombinedFieldQuery` still works as expected when collection stats are overridden. Maybe we could compare the output of two query strategies like we do in `testCopyField`.



##########
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##########
@@ -589,4 +589,52 @@ public SimScorer scorer(
       return new BM25Similarity().scorer(boost, collectionStats, termStats);
     }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {
+    Directory dir = newDirectory();
+    IndexWriterConfig iwc = new IndexWriterConfig();
+    iwc.setSimilarity(randomCompatibleSimilarity());
+    RandomIndexWriter w = new RandomIndexWriter(random(), dir, iwc);
+
+    String queryString = "foo";
+
+    Document doc0 = new Document();
+    doc0.add(new TextField("f", "foo", Store.NO));
+    doc0.add(new TextField("g", "foo baz", Store.NO));
+    w.addDocument(doc0);
+
+    IndexReader reader = w.getReader();
+    IndexSearcher searcher =
+        new IndexSearcher(reader) {
+          @Override
+          public CollectionStatistics collectionStatistics(String field) throws IOException {
+            CollectionStatistics shardStatistics = super.collectionStatistics(field);
+            int extraMaxDoc = randomIntBetween(0, 10);
+            int extraDocCount = randomIntBetween(0, extraMaxDoc);
+            int extraSumDocFreq = extraDocCount + randomIntBetween(0, 10);
+            int extraSumTotalTermFreq = extraSumDocFreq + randomIntBetween(0, 10);
+            CollectionStatistics globalStatistics =
+                new CollectionStatistics(
+                    field,
+                    shardStatistics.maxDoc() + extraMaxDoc,
+                    shardStatistics.docCount() + extraDocCount,
+                    shardStatistics.sumTotalTermFreq() + extraSumTotalTermFreq,
+                    shardStatistics.sumDocFreq() + extraSumDocFreq);
+            return globalStatistics;
+          }
+        };
+    searcher.setSimilarity(new BM25Similarity());

Review Comment:
   It's unusual to search with a different similarity than was used during indexing -- I think we could remove this line.



##########
lucene/sandbox/src/test/org/apache/lucene/sandbox/search/TestCombinedFieldQuery.java:
##########
@@ -589,4 +589,52 @@ public SimScorer scorer(
       return new BM25Similarity().scorer(boost, collectionStats, termStats);
     }
   }
+
+  public void testDistributedCollectionStatistics() throws IOException {

Review Comment:
   Small comment, maybe we could call this `testOverrideCollectionStatistics`? Lucene doesn't really have a native concept of "distributed collection statistics" (as far as I'm aware) and this test doesn't really use that concept anyway?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org