You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2022/02/07 07:43:08 UTC

[GitHub] [druid] paul-rogers commented on issue #11933: Discussion: operator structure for Druid queries?

paul-rogers commented on issue #11933:
URL: https://github.com/apache/druid/issues/11933#issuecomment-1031161332

Did a bit of performance testing. Created a mock storage adapter and mock cursor that simulates a Wikipedia segment, but just makes up 5M rows of data. Compared performance of operators and yielders. In the default mode, the scan query uses the list-of-maps format, and map creation takes a long time (about 1/4 of the run time), so switched to a compact list.

As it turns out, under this scenario, with 245 batches, the operators and yielders take about the same amount of time: about 490 ms average over 100 runs, after 100 runs of warm-up.

The 245 batches does not place much emphasis on the parts that differ: the per-batch overhead. The default batch size is 5 * 4K rows, so reduced batch size to 1000 rows, so we get 5000 batches.

With this change, the difference in the two approaches becomes clearer. Again, times averaged over 100 queries:

Operators: 499 ms.
Yielders: 607 ms.

This shows that the reduced number of function call levels in the operator solution takes 82% of the time of stack of sequences and yielders doing the same work.

To be fair, in a normal query, testing suggests that the call yielder stack overhead is lost in the noise of the actual work of the query. So, while operators might be a bit faster, the real win is in the code simplification, and the ability to compose operators beyond what the query runners can do today.

FWIW, the setup is an ancient Intel I7 with 8 cores, running Linux Mint (Ubuntu), Java 14. A python client sent a native scan query to a historical running in the IDE. The query itself selects three columns from one segment with no filters. This is not the optimal performance setup, but all we're interested in is relative costs.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org