You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by GitBox <gi...@apache.org> on 2021/11/17 08:47:52 UTC
[GitHub] [lucene] zacharymorn commented on a change in pull request #418: LUCENE-10061: Implements dynamic pruning support for CombinedFieldsQuery
zacharymorn commented on a change in pull request #418:
URL: https://github.com/apache/lucene/pull/418#discussion_r751018772
##########
File path: lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java
##########
@@ -441,6 +491,273 @@ public boolean isCacheable(LeafReaderContext ctx) {
}
}
+ /** Merge impacts for combined field. */
+ static ImpactsSource mergeImpacts(
+ Map<String, List<ImpactsEnum>> fieldsWithImpactsEnums,
+ Map<String, List<Impacts>> fieldsWithImpacts,
+ Map<String, Float> fieldWeights) {
+ return new ImpactsSource() {
+
+ class SubIterator {
+ final Iterator<Impact> iterator;
+ int previousFreq;
+ Impact current;
+
+ SubIterator(Iterator<Impact> iterator) {
+ this.iterator = iterator;
+ this.current = iterator.next();
+ }
+
+ void next() {
+ previousFreq = current.freq;
+ if (iterator.hasNext() == false) {
+ current = null;
+ } else {
+ current = iterator.next();
+ }
+ }
+ }
+
+ @Override
+ public Impacts getImpacts() throws IOException {
+ // Use the impacts that have the lower next boundary (doc id in skip entry) as a lead for
+ // each field
+ // They collectively will decide on the number of levels and the block boundaries.
+ Map<String, Impacts> leadingImpactsPerField = new HashMap<>(fieldsWithImpactsEnums.size());
+
+ for (Map.Entry<String, List<ImpactsEnum>> fieldImpacts :
+ fieldsWithImpactsEnums.entrySet()) {
+ String field = fieldImpacts.getKey();
+ List<ImpactsEnum> impactsEnums = fieldImpacts.getValue();
+ fieldsWithImpacts.put(field, new ArrayList<>(impactsEnums.size()));
+
+ Impacts tmpLead = null;
+ // find the impact that has the lowest next boundary for this field
+ for (int i = 0; i < impactsEnums.size(); ++i) {
+ Impacts impacts = impactsEnums.get(i).getImpacts();
+ fieldsWithImpacts.get(field).add(impacts);
+
+ if (tmpLead == null || impacts.getDocIdUpTo(0) < tmpLead.getDocIdUpTo(0)) {
+ tmpLead = impacts;
+ }
+ }
+
+ leadingImpactsPerField.put(field, tmpLead);
+ }
+
+ return new Impacts() {
+
+ @Override
+ public int numLevels() {
+ // max of levels across fields' impactEnums
+ int result = 0;
+
+ for (Impacts impacts : leadingImpactsPerField.values()) {
+ result = Math.max(result, impacts.numLevels());
+ }
+
+ return result;
+ }
+
+ @Override
+ public int getDocIdUpTo(int level) {
+ // min of docIdUpTo across fields' impactEnums
+ int result = Integer.MAX_VALUE;
+
+ for (Impacts impacts : leadingImpactsPerField.values()) {
+ if (impacts.numLevels() > level) {
+ result = Math.min(result, impacts.getDocIdUpTo(level));
+ }
+ }
+
+ return result;
+ }
Review comment:
Thanks for the suggestion! I assume by "highest weight" here, you meant term that has lower doc frequencies, as opposed to field weight?
I also did a quick test with the following updated numLevels / getDocIdupTo implementations to approximate using lower doc frequencies term's impact
```
@Override
public int numLevels() {
// this is changed from Integer.MIN_VALUE
int result = Integer.MAX_VALUE;
// this is changed from Math.max
for (Impacts impacts : leadingImpactsPerField.values()) {
result = Math.min(result, impacts.numLevels());
}
return result;
}
@Override
public int getDocIdUpTo(int level) {
// this is changed from Integer.MAX_VALUE
int result = Integer.MIN_VALUE;
for (Impacts impacts : leadingImpactsPerField.values()) {
if (impacts.numLevels() > level) {
// this is changed from Math.min
result = Math.max(result, impacts.getDocIdUpTo(level));
}
}
return result;
}
```
For the slow query `CFQHighHigh: at united +combinedFields=titleTokenized^4.0,body^2.0 # freq=2834104 freq=1185528` above, it did improve the from `-42%` to `-30%` ~ `-35%`, with the following JFR CPU result:
```
PERCENT CPU SAMPLES STACK
19.41% 11866 org.apache.lucene.sandbox.search.MultiNormsLeafSimScorer$MultiFieldNormValues#advanceExact()
7.35% 4491 org.apache.lucene.search.DisiPriorityQueue#downHeap()
6.07% 3713 org.apache.lucene.search.similarities.BM25Similarity$BM25Scorer#score()
3.59% 2193 org.apache.lucene.search.TopScoreDocCollector$SimpleTopScoreDocCollector$1#collect()
3.50% 2143 org.apache.lucene.sandbox.search.CombinedFieldQuery$CombinedFieldScorer#freq()
3.24% 1983 org.apache.lucene.search.DisjunctionDISIApproximation#advance()
3.09% 1889 org.apache.lucene.sandbox.search.CombinedFieldQuery$WeightedDisiWrapper#freq()
2.87% 1752 org.apache.lucene.search.DisiPriorityQueue#top()
2.75% 1681 java.lang.Math#round()
2.71% 1657 org.apache.lucene.search.DisiPriorityQueue#topList()
2.54% 1555 org.apache.lucene.codecs.lucene90.Lucene90NormsProducer$3#longValue()
2.36% 1441 org.apache.lucene.store.ByteBufferGuard#ensureValid()
2.11% 1292 org.apache.lucene.util.SmallFloat#longToInt4()
1.87% 1142 org.apache.lucene.sandbox.search.CombinedFieldQuery$CombinedFieldScorer#score()
1.73% 1058 org.apache.lucene.search.DisiPriorityQueue#updateTop()
1.71% 1045 org.apache.lucene.sandbox.search.MultiNormsLeafSimScorer#getNormValue()
1.68% 1030 org.apache.lucene.store.ByteBufferGuard#getByte()
1.59% 970 jdk.internal.misc.Unsafe#getByte()
1.57% 959 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader#findFirstGreater()
1.55% 948 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#advance()
1.18% 720 org.apache.lucene.search.ImpactsDISI#docID()
0.84% 515 org.apache.lucene.sandbox.search.CombinedFieldQuery$1$1#doMergeImpactsPerField()
0.79% 485 org.apache.lucene.search.Weight$DefaultBulkScorer#scoreAll()
0.76% 462 org.apache.lucene.search.DisiPriorityQueue#prepend()
0.73% 444 java.lang.Math#toIntExact()
0.60% 367 org.apache.lucene.codecs.lucene90.Lucene90PostingsReader$BlockImpactsDocsEnum#freq()
0.53% 324 org.apache.lucene.search.ImpactsDISI#advanceTarget()
0.51% 314 org.apache.lucene.codecs.MultiLevelSkipListReader#skipTo()
0.50% 306 org.apache.lucene.util.SmallFloat#intToByte4()
0.49% 299 org.apache.lucene.search.ImpactsDISI#nextDoc()
```
This CPU profiling result looks very similar to that of baseline. I'll do more testings to understand why.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org