You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2020/02/06 13:13:17 UTC

[GitHub] [druid] sascha-coenen opened a new issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

sascha-coenen opened a new issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321
 
 
   ### Affected Version
   0.16.0
   
   ### Description
   
   Given a Druid v0.16.0 cluster configured with SQL-compatible null handling enabled,
   the intial performance we measured was inconspicuous, but after a while, there would
   be a drastic performance degradation for topN queries.
   
   After much testing we found out that initial performance of a freshly started Druid 
   cluster would be consistently fast UNTIL a groupbyV2 query gets exececuted for the 
   first time. 
   After that the performance of topN queries would degrade by 70% or more.
   This degradation is specific to topN queries and also seems to apply
   only to heavy topN queries (8 aggregations, several sequential passes).
   
   We looked at any operational metric we have but could not find a root cause for the degradation.
   The degradation would not fade out with time. Also, a forced full garbage collection would not 
   recover any performance.
   Furthermore the execution of a single groupbV2 query, any groupbyV2 query seems to trigger the
   degradation.
   
   We have a performance testsuite and a metrics dashboard.
   In the screenshots from the perf testsuite below you can see the degradation
   in topn queries after the execution of the first groupbyV2 query in the before/after
   view.
   
   Furthermore, the dashboard shows a different test we performed to illustrate the 
   performance degradation: we initially sent a sequential stream of topN queries to a
   freshly started Druid cluster for a long time. Then we issued a single groupbyV2 query
   while the stream of topN queries would continue. One can clearly see how performance
   degrades immediately and is constant before and after. 
   The dashboard shows the segment-scan-time metrics to illustrate that the degradation
   happens on the historicals by way of decreased scan times.
   
   In the attempt to hone in on root-causes, we ran further tests that had subsystems of Druid disabled:
   * disabling metric emission
   * disabling log emission
   * disabling all caches
   However, in all these cases the performance degradation remained.
   
   As we keep sending the same query many times, we can also rule out effects caused by disk access because the segments needed for serving the query would be paged into memory.
   
   Then we turned off the SQL-compatible null handling and the performance issue was gone.
   
   <img width="970" alt="perf-suite-before-after" src="https://user-images.githubusercontent.com/1635350/73939413-af121880-48e9-11ea-8223-5789d7582112.png">
   
   <img width="1020" alt="performance-issue-topn-groupby-sqlcompatiblenullhandling" src="https://user-images.githubusercontent.com/1635350/73939765-70c92900-48ea-11ea-963b-748438c1374a.png">
   
   We haven't tested yet whether the issue still remains with Druid 0.17.0 because it will take us a while to upgrade, but I meant to report the issue as early as possible. We have no idea what could be causing this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen edited a comment on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
sascha-coenen edited a comment on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-591691480
 
 
   Hi.
   we tried the vectorization engine extensively and it works VERY well for us. Most of the times throughout the years when you guys reported significant performance gains, we could usually not reproduce them in our clusters very well. 
   But with this thing we are right in the center:  we are using local filter conditions a LOT and prior to vectorization, to my understanding, Druid had to scan over those, not applying bitmap filters which with vectorization would already work with bitmaps. 
   I don't know whether this assumption is true, but if feels to me that this is where the speedups might be the biggest. We do see speedups to 3x for timeseries across the board and 2x for groupby queries.
   The only exception is that we have this very often used class of queries, namely groupby's with an order by clause on a measure which don't profit from vectorization at all because they seem to be bound by the data volume streamed back from the historicals.
   To speed up this important class of queries we very much depend on the "forceLimitPushDown' feature which is unfortunately hasn't been working for SQL queries for a while (since 0.11.0).
   This in combination with the vectorization engine would be blow away amazing.    
   
   wrt the above perf issue: 
   what you describe sounds reasonable.
   Let me add that it is very easy for us to reproduce this performance degradation. It does not matter which kind of groupbyV2 query we use. We basically saw it happen with pretty much any query. With topn it seems that the query needs to have a certain weight, like a couple of measures and I more than one sequential pass.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s edited a comment on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
suneet-s edited a comment on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-591551174
 
 
   Thanks for the flamegraphs! I read the code last night and based on the flamegraphs, the best thing I can think of that might be happening is maybe something in the hotspot is less optimal after the group by query is run. The buffers are shared across the TopNQueryEngine and the GroupByQueryEngine, so somehow running a group by query is impacting the group by performance.
   
   Since it's this low level, I'd also question the reality of the scenario. It might be less visible if the performance tests were running a mixture of queries in both runs.
   
   **My guess at this point is this an edge case, and I don't think we'd see a huge difference if this were a mixed load scenario**
   
   I'm also attaching screenshots from your flamegraphs that highlight what you pointed out, so others with more knowledge than me can either confirm these suspicions or suggest what might actually be happening.
   When only topN queries were run - the time spent in NullableNumericBuffer#aggregate was mainly spent in reading the bytes (LZ4...)
   <img width="1423" alt="topN-after-groupBy" src="https://user-images.githubusercontent.com/44787917/75371559-d2423f00-587b-11ea-819b-96502be7012f.png">
   
   After a groupBy query was run, and topN queries were profiled - NullableNumericBuffer#aggregate has a bunch of time spent where it isn't reading the bytes, but doing something else
   <img width="1428" alt="topN-before-groupBy" src="https://user-images.githubusercontent.com/44787917/75371555-cfdfe500-587b-11ea-98ca-d0ef75d6d7ee.png">
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
sascha-coenen commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-591691480
 
 
   Hi.
   we tried the vectorization engine extensively and it works VERY well for us. Most of the times throughout the years when you guys reported significant performance gains, we could usually not reproduce them in our clusters very well. 
   But with this thing are right in the center:  we are using local filter conditions a LOT and prior to vectorization, to my understanding, Druid had to scan over those, not applying bitmap filters which with vectorization would already work with bitmaps. 
   I don't know whether this assumption is true, but if feels to me that this is where the speedups might be the biggest. We do see speedups to 3x for timeseries across the board and 2x for groupby queries.
   The only exception is that we have this very often used class of queries, namely groupby's with an order by clause on a measure which don't profit from vectorization at all because they seem to be bound by the data volume streamed back from the historicals.
   To speed up this important class of queries we very much depend on the "forceLimitPushDown' feature which is unfortunately hasn't been working for SQL queries for a while (since 0.11.0).
   This in combination with the vectorization engine would be blow away amazing.    
   
   wrt the above perf issue: 
   what you describe sounds reasonable.
   Let me add that it is very easy for us to reproduce this performance degradation. It does not matter which kind of groupbyV2 query we use. We basically saw it happen with pretty much any query. With topn it seems that the query needs to have a certain weight, like a couple of measures and I more than one sequential pass.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen edited a comment on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
sascha-coenen edited a comment on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-591691480
 
 
   Hi.
   we tried the vectorization engine extensively and it works VERY well for us. Most of the times throughout the years when you guys reported significant performance gains, we could usually not reproduce them in our clusters very well. 
   But with this thing we are right in the center:  we are using local filter conditions a LOT and prior to vectorization, to my understanding, Druid had to scan over those, not applying bitmap filters whereas vectorization would already work with bitmaps. 
   I don't know whether this assumption is true, but if feels to me that this is where the speedups might be the biggest. We do see speedups to 3x for timeseries across the board and 2x for groupby queries.
   The only exception is that we have this very often used class of queries, namely groupby's with an order by clause on a measure which don't profit from vectorization at all because they seem to be bound by the data volume streamed back from the historicals.
   To speed up this important class of queries we very much depend on the "forceLimitPushDown' feature which is unfortunately hasn't been working for SQL queries for a while (since 0.11.0).
   This in combination with the vectorization engine would be blow away amazing.    
   
   wrt the above perf issue: 
   what you describe sounds reasonable.
   Let me add that it is very easy for us to reproduce this performance degradation. It does not matter which kind of groupbyV2 query we use. We basically saw it happen with pretty much any query. With topn it seems that the query needs to have a certain weight, like a couple of measures and I more than one sequential pass.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
suneet-s commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-591551174
 
 
   Thanks for the flamegraphs! I read the code last night and based on the flamegraphs, the best thing I can think of that might be happening is maybe something in the hotspot is less optimal after the group by query is run. The buffers are shared across the TopNQueryEngine and the GroupByQueryEngine, so somehow running a group by query is impacting the group by performance.
   
   Since it's this low level, I'd also question the reality of the scenario. It might be less visible if the performance tests were running a mixture of queries in both runs.
   
   **My guess at this point is this an edge case, and I don't think we'd see a huge difference if this were a mixed load scenario**
   
   I'm also attaching screenshots from your flamegraphs that highlight what you pointed out, so others with more knowledge than me can either confirm these suspicions or suggest what might actually be happening.
   
   <img width="1428" alt="topN-before-groupBy" src="https://user-images.githubusercontent.com/44787917/75371555-cfdfe500-587b-11ea-98ca-d0ef75d6d7ee.png">
   <img width="1423" alt="topN-after-groupBy" src="https://user-images.githubusercontent.com/44787917/75371559-d2423f00-587b-11ea-819b-96502be7012f.png">
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
suneet-s commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-590849778
 
 
   interesting finding... Can you profile the setup to see if there are any clues about what is different between before / after running the first group by v2 query.
   
   Some instructions on how to do that - https://support.imply.io/hc/en-us/articles/360033747953-Profiling-Druid-queries-using-flame-graphs

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
sascha-coenen commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-591252082
 
 
   Hi, thanks for the pointer. sjk looks like a supernice tool. really easy to use.
   
   I attached a zip archive with the "before and after" flamegraph reports.
   I'm not an expert at reading flamegraphs, but to me the biggest difference seems to be the 
   org.apache.druid.query.aggregation.NullableNumericBufferAggregator.aggegate method.
   
   [flamegraphs.zip](https://github.com/apache/druid/files/4253564/flamegraphs.zip)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-591672502
 
 
   Btw, have you tried the vectorized groupBy engine? It doesn't apply in every case yet (see the docs at https://druid.apache.org/docs/latest/querying/query-context.html#vectorizable-queries) but I'm curious how well it works for you. We'd like to make this the default in a future release.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s edited a comment on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
suneet-s edited a comment on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-591551174
 
 
   Thanks for the flamegraphs! I read the code last night and based on the flamegraphs, the best thing I can think of that might be happening is maybe something in the hotspot is less optimal after the group by query is run. The buffers are shared across the TopNQueryEngine and the GroupByQueryEngine, so somehow running a group by query is impacting the group by performance.
   
   Since it's this low level, I'd also question the reality of the scenario. It might be less visible if the performance tests were running a mixture of queries in both runs.
   
   **My guess at this point is this an edge case, and I don't think we'd see a huge difference if this were a mixed load scenario**
   
   I'm also attaching screenshots from your flamegraphs that highlight what you pointed out, so others with more knowledge than me can either confirm these suspicions or suggest what might actually be happening.
   
   When only topN queries were run - the time spent in NullableNumericBuffer#aggregate was mainly spent in reading the bytes (LZ4...)
   <img width="1428" alt="topN-before-groupBy" src="https://user-images.githubusercontent.com/44787917/75371555-cfdfe500-587b-11ea-98ca-d0ef75d6d7ee.png">
   
   After a groupBy query was run, and topN queries were profiled - NullableNumericBuffer#aggregate has a bunch of time spent where it isn't reading the bytes, but doing something else
   <img width="1423" alt="topN-after-groupBy" src="https://user-images.githubusercontent.com/44787917/75371559-d2423f00-587b-11ea-819b-96502be7012f.png">
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
sascha-coenen commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-590844294
 
 
   I updated the ticket description.
   We have now tested this with druid 0.17.0 and can report that the issue is still present in this latest release.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen edited a comment on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
sascha-coenen edited a comment on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-591691480
 
 
   Hi.
   we tried the vectorization engine extensively and it works VERY well for us. Most of the times throughout the years when you guys reported significant performance gains, we could usually not reproduce them in our clusters very well. 
   But with this thing we are right in the center:  we are using local filter conditions a LOT and prior to vectorization, to my understanding, Druid had to scan over those, not applying bitmap filters whereas vectorization would already work with bitmaps. 
   I don't know whether this assumption is true, but if feels to me that this is where the speedups might be the biggest. We do see speedups to 3x for timeseries across the board and 2x for groupby queries.
   The only exception is that we have this very often used class of queries, namely groupby's with an order by clause on a measure which don't profit from vectorization at all because they seem to be bound by the data volume streamed back from the historicals.
   To speed up this important class of queries we very much depend on the "forceLimitPushDown' feature which unfortunately hasn't been working for SQL queries for a while (since 0.11.0).
   This in combination with the vectorization engine would be blow away amazing.    
   
   wrt the above perf issue: 
   what you describe sounds reasonable.
   Let me add that it is very easy for us to reproduce this performance degradation. It does not matter which kind of groupbyV2 query we use. We basically saw it happen with pretty much any query. With topn it seems that the query needs to have a certain weight, like a couple of measures and I more than one sequential pass.
   You should be able to reproduce this on your end easily. Do you not experience any performance penalty in topNs at all when using SQL compatible null handling?
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] a2l007 commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
a2l007 commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-585418475
 
 
   This might be related: https://github.com/apache/druid/pull/8822

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm edited a comment on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
gianm edited a comment on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-591671621
 
 
   It could be that something about your sequence of queries defeats de-virtualization or inlining as described on https://shipilev.net/blog/2015/black-magic-method-dispatch/. Maybe something that was monomorphic or bimorphic with only topNs in the mix becomes megamorphic once groupBys are in the mix as well (due to different usage patterns of some shared component). I don't think I can speculate about what component that might be, since there are a lot of shared components and some of them are pretty complex. It might not even have anything to do with groupBy — it's possible that your groupBy queries are using some features (specific aggregators, filters, etc) that your topNs aren't using.
   
   But if this is what's going on, then vectorization should fix it, since it makes function call overhead less of an issue. So we should continue with our program of vectorizing everything (topN isn't yet).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] sascha-coenen commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
sascha-coenen commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-591636541
 
 
   > Since it's this low level, I'd also question the reality of the scenario. It might be less visible if the performance tests were running a mixture of queries in both runs
   > My guess at this point is this an edge case, and I don't think we'd see a huge difference if this were a mixed load scenario
   
   Let me explain again, perhaps I can be more clear:
   From the point that a Druid historical executes a single groupbyV2 query for the first time, the execution of any later topN queries is much slower than before and remains at that deteriorated performance forever. Prior to the execution of a groupbyV2 query, the topN queries are consistently fast and remain so forever. These are steady-states. Within a mixed workload before or after, the topN queries will show the same degradation. It's not a dynamic temporary effect.
   
   If you look at the segment scan time graph, you see that scan times for topN queries are very consistent before and after the single execution of a groupBy query, just much higher afterwards. 
   This graph would continue in this static state forever. 
   Also, the topN peformance prior to executing a first groupBy query is thee same as in a Druid cluster configured without SQL-compatible null handling.
   
   So this is not an edge case by any means: If I were to switch the SQL-compatible null handling on in our production cluster which is running mixed workloads, the moment a groupby query would get executed for the first time, query performance for topNs would plummet considerably and not recover.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] suneet-s edited a comment on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
suneet-s edited a comment on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-591551174
 
 
   Thanks for the flamegraphs! I read the code last night and based on the flamegraphs, the best thing I can think of that might be happening is maybe something in the hotspot is less optimal after the group by query is run. The buffers are shared across the TopNQueryEngine and the GroupByQueryEngine, so somehow running a group by query is impacting the topN performance.
   
   Since it's this low level, I'd also question the reality of the scenario. It might be less visible if the performance tests were running a mixture of queries in both runs.
   
   **My guess at this point is this an edge case, and I don't think we'd see a huge difference if this were a mixed load scenario**
   
   I'm also attaching screenshots from your flamegraphs that highlight what you pointed out, so others with more knowledge than me can either confirm these suspicions or suggest what might actually be happening.
   
   When only topN queries were run - the time spent in NullableNumericBuffer#aggregate was mainly spent in reading the bytes (LZ4...)
   <img width="1428" alt="topN-before-groupBy" src="https://user-images.githubusercontent.com/44787917/75371555-cfdfe500-587b-11ea-98ca-d0ef75d6d7ee.png">
   
   After a groupBy query was run, and topN queries were profiled - NullableNumericBuffer#aggregate has a bunch of time spent where it isn't reading the bytes, but doing something else
   <img width="1423" alt="topN-after-groupBy" src="https://user-images.githubusercontent.com/44787917/75371559-d2423f00-587b-11ea-819b-96502be7012f.png">
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org


[GitHub] [druid] gianm commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled

Posted by GitBox <gi...@apache.org>.
gianm commented on issue #9321: Performance degradation in topN queries when SQL-compatible null handling is enabled
URL: https://github.com/apache/druid/issues/9321#issuecomment-591671621
 
 
   It could be that something about calling groupBy defeats de-virtualization or inlining as described on https://shipilev.net/blog/2015/black-magic-method-dispatch/. Maybe something that was monomorphic or bimorphic with only topNs in the mix becomes megamorphic once groupBys are in the mix as well (due to different usage patterns of some shared component). I don't think I can speculate about what component that might be, since there are a lot of shared components and some of them are pretty complex. It might not even have anything to do with groupBy — it's possible that your groupBy queries are using some features (specific aggregators, filters, etc) that your topNs aren't using.
   
   But if this is what's going on, then vectorization should fix it, since it makes function call overhead less of an issue. So we should continue with our program of vectorizing everything (topN isn't yet).

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org