You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@lucene.apache.org by "Kevin Risden (Jira)" <ji...@apache.org> on 2022/04/30 02:44:00 UTC
[jira] [Updated] (LUCENE-10542) FieldSource exists implementations can avoid value retrieval
[ https://issues.apache.org/jira/browse/LUCENE-10542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kevin Risden updated LUCENE-10542:
----------------------------------
Summary: FieldSource exists implementations can avoid value retrieval (was: FieldSource exists implementation can avoid value retrieval)
> FieldSource exists implementations can avoid value retrieval
> ------------------------------------------------------------
>
> Key: LUCENE-10542
> URL: https://issues.apache.org/jira/browse/LUCENE-10542
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Kevin Risden
> Assignee: Kevin Risden
> Priority: Minor
> Attachments: flamegraph_getValueForDoc.png
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> While looking at LUCENE-10534, found that *FieldSource exists implementation after LUCENE-7407 can avoid value lookup when just checking for exists.
> Flamegraphs - x axis = time spent as a percentage of time being profiled, y axis = stack trace bottom being first call top being last call
> Looking only at the left most getValueForDoc highlight only (and it helps to make it bigger or download the original)
> !flamegraph_getValueForDoc.png|height=410,width=1000!
> LongFieldSource#exists spends MOST of its time doing a LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time doing two things primarily:
> * FilterNumericDocValues#longValue()
> * advance()
> This makes sense based on looking at the code (copied below to make it easier to see at once)
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72
> {code:java}
> private long getValueForDoc(int doc) throws IOException {
> if (doc < lastDocID) {
> throw new IllegalArgumentException(
> "docs were sent out-of-order: lastDocID=" + lastDocID + " vs docID=" + doc);
> }
> lastDocID = doc;
> int curDocID = arr.docID();
> if (doc > curDocID) {
> curDocID = arr.advance(doc);
> }
> if (doc == curDocID) {
> return arr.longValue();
> } else {
> return 0;
> }
> }
> {code}
> LongFieldSource#exists - doesn't care about the actual longValue. Just that there was a value found when iterating through the doc values.
> https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95
> {code:java}
> @Override
> public boolean exists(int doc) throws IOException {
> getValueForDoc(doc);
> return arr.docID() == doc;
> }
> {code}
> So putting this all together for exists calling getValueForDoc, we spent ~50% of the time trying to get the long value when we don't need it in exists. We can save that 50% of time making exists not care about the actual value and just return if doc == curDocID basically.
> This 50% extra is exaggerated in MaxFloatFunction (and other places) since exists() is being called a bunch. Eventually the value will be needed from longVal(), but if we call exists() say 3 times for every longVal(), we are spending a lot of time computing the value when we only need to check for existence.
> I found the same pattern in DoubleFieldSource, EnumFieldSource, FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change showing what this would look like:
> ----
> Simple JMH performance tests comparing the original FloatFieldSource to the new ones from PR #847.
>
> | Benchmark | Mode | Cnt | Score and Error | Units |
> |-----------------------------------------------------------------|-------|-----|------------------|-------|
> | MyBenchmark.testMaxFloatFunction | thrpt | 25 | 64.159 ± 2.031 | ops/s |
> | MyBenchmark.testNewMaxFloatFunction | thrpt | 25 | 94.997 ± 2.365 | ops/s |
> | MyBenchmark.testMaxFloatFunctionNewFloatFieldSource | thrpt | 25 | 123.191 ± 9.291 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource | thrpt | 25 | 123.817 ± 6.191 | ops/s |
> | MyBenchmark.testMaxFloatFunctionRareField | thrpt | 25 | 244.921 ± 6.439 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionRareField | thrpt | 25 | 239.288 ± 5.136 | ops/s |
> | MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 25 | 271.521 ± 3.870 | ops/s |
> | MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField | thrpt | 25 | 279.334 ± 10.511 | ops/s |
> Source: https://github.com/risdenk/lucene-jmh
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@lucene.apache.org
For additional commands, e-mail: issues-help@lucene.apache.org