You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/11/17 08:47:52 UTC
[GitHub] [lucene] zacharymorn commented on a change in pull request #418: LUCENE-10061: Implements dynamic pruning support for CombinedFieldsQuery

zacharymorn commented on a change in pull request #418:
URL: https://github.com/apache/lucene/pull/418#discussion_r751018772



##########
File path: lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java
##########
@@ -441,6 +491,273 @@ public boolean isCacheable(LeafReaderContext ctx) {
     }
   }
 
+  /** Merge impacts for combined field. */
+  static ImpactsSource mergeImpacts(
+      Map<String, List<ImpactsEnum>> fieldsWithImpactsEnums,
+      Map<String, List<Impacts>> fieldsWithImpacts,
+      Map<String, Float> fieldWeights) {
+    return new ImpactsSource() {
+
+      class SubIterator {
+        final Iterator<Impact> iterator;
+        int previousFreq;
+        Impact current;
+
+        SubIterator(Iterator<Impact> iterator) {
+          this.iterator = iterator;
+          this.current = iterator.next();
+        }
+
+        void next() {
+          previousFreq = current.freq;
+          if (iterator.hasNext() == false) {
+            current = null;
+          } else {
+            current = iterator.next();
+          }
+        }
+      }
+
+      @Override
+      public Impacts getImpacts() throws IOException {
+        // Use the impacts that have the lower next boundary (doc id in skip entry) as a lead for
+        // each field
+        // They collectively will decide on the number of levels and the block boundaries.
+        Map<String, Impacts> leadingImpactsPerField = new HashMap<>(fieldsWithImpactsEnums.size());
+
+        for (Map.Entry<String, List<ImpactsEnum>> fieldImpacts :
+            fieldsWithImpactsEnums.entrySet()) {
+          String field = fieldImpacts.getKey();
+          List<ImpactsEnum> impactsEnums = fieldImpacts.getValue();
+          fieldsWithImpacts.put(field, new ArrayList<>(impactsEnums.size()));
+
+          Impacts tmpLead = null;
+          // find the impact that has the lowest next boundary for this field
+          for (int i = 0; i < impactsEnums.size(); ++i) {
+            Impacts impacts = impactsEnums.get(i).getImpacts();
+            fieldsWithImpacts.get(field).add(impacts);
+
+            if (tmpLead == null || impacts.getDocIdUpTo(0) < tmpLead.getDocIdUpTo(0)) {
+              tmpLead = impacts;
+            }
+          }
+
+          leadingImpactsPerField.put(field, tmpLead);
+        }
+
+        return new Impacts() {
+
+          @Override
+          public int numLevels() {
+            // max of levels across fields' impactEnums
+            int result = 0;
+
+            for (Impacts impacts : leadingImpactsPerField.values()) {
+              result = Math.max(result, impacts.numLevels());
+            }
+
+            return result;
+          }
+
+          @Override
+          public int getDocIdUpTo(int level) {
+            // min of docIdUpTo across fields' impactEnums
+            int result = Integer.MAX_VALUE;
+
+            for (Impacts impacts : leadingImpactsPerField.values()) {
+              if (impacts.numLevels() > level) {
+                result = Math.min(result, impacts.getDocIdUpTo(level));
+              }
+            }
+
+            return result;
+          }

Review comment:
       Thanks for the suggestion! I assume by "highest weight" here, you meant term that has lower doc frequencies, as opposed to field weight? 
   
   I also did a quick test with the following updated numLevels / getDocIdupTo implementations to approximate using lower doc frequencies term's impact
   
   ```
   @Override
   public int numLevels() {
     // this is changed from Integer.MIN_VALUE
     int result = Integer.MAX_VALUE;
    
     // this is changed from Math.max
     for (Impacts impacts : leadingImpactsPerField.values()) {
       result = Math.min(result, impacts.numLevels());
     }
   
     return result;
   }
   
   @Override
   public int getDocIdUpTo(int level) {
     // this is changed from Integer.MAX_VALUE
     int result = Integer.MIN_VALUE;
   
     for (Impacts impacts : leadingImpactsPerField.values()) {
       if (impacts.numLevels() > level) {
         // this is changed from Math.min
         result = Math.max(result, impacts.getDocIdUpTo(level));
       }
     }
   
     return result;
   }
   ```
   
   For the slow query `CFQHighHigh: at united +combinedFields=titleTokenized^4.0,body^2.0 # freq=2834104 freq=1185528` above, it did improve the from `-42%` to `-30%` ~ `-35%`, with the following JFR CPU result: 
   
   ```
   PERCENT       CPU SAMPLES   STACK
   19.41%        11866         org.apache.lucene.sandbox.search.MultiNormsLeafSimScorer$MultiFieldNormValues#advanceExact()
   7.35%         4491          org.apache.lucene.search.DisiPriorityQueue#downHeap()
   6.07%         3713          org.apache.lucene.search.similarities.BM25Similarity$BM25Scorer#score()
   3.59%         2193          org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1#collect()
   3.50%         2143          org.apache.lucene.sandbox.search.CombinedFieldQuery$CombinedFieldScorer#freq()
   3.24%         1983          org.apache.lucene.search.DisjunctionDISIApproximation#advance()
   3.09%         1889          org.apache.lucene.sandbox.search.CombinedFieldQuery$WeightedDisiWrapper#freq()
   2.87%         1752          org.apache.lucene.search.DisiPriorityQueue#top()
   2.75%         1681          java.lang.Math#round()
   2.71%         1657          org.apache.lucene.search.DisiPriorityQueue#topList()
   2.54%         1555          org.apache.lucene.codecs.lucene90.Lucene90NormsProducer$3#longValue()
   2.36%         1441          org.apache.lucene.store.ByteBufferGuard#ensureValid()
   2.11%         1292          org.apache.lucene.util.SmallFloat#longToInt4()
   1.87%         1142          org.apache.lucene.sandbox.search.CombinedFieldQuery$CombinedFieldScorer#score()
   1.73%         1058          org.apache.lucene.search.DisiPriorityQueue#updateTop()
   1.71%         1045          org.apache.lucene.sandbox.search.MultiNormsLeafSimScorer#getNormValue()
   1.68%         1030          org.apache.lucene.store.ByteBufferGuard#getByte()
   1.59%         970           jdk.internal.misc.Unsafe#getByte()
   1.57%         959           org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
   1.55%         948           org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
   1.18%         720           org.apache.lucene.search.ImpactsDISI#docID()
   0.84%         515           org.apache.lucene.sandbox.search.CombinedFieldQuery$1$1#doMergeImpactsPerField()
   0.79%         485           org.apache.lucene.search.Weight$DefaultBulkScorer#scoreAll()
   0.76%         462           org.apache.lucene.search.DisiPriorityQueue#prepend()
   0.73%         444           java.lang.Math#toIntExact()
   0.60%         367           org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#freq()
   0.53%         324           org.apache.lucene.search.ImpactsDISI#advanceTarget()
   0.51%         314           org.apache.lucene.codecs.MultiLevelSkipListReader#skipTo()
   0.50%         306           org.apache.lucene.util.SmallFloat#intToByte4()
   0.49%         299           org.apache.lucene.search.ImpactsDISI#nextDoc()
   ```
   
   This CPU profiling result looks very similar to that of baseline. I'll do more testings to understand why.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org