You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/11/20 06:35:17 UTC

[GitHub] [druid] paul-rogers edited a comment on issue #11933: Discussion: operator structure for Druid queries?

paul-rogers edited a comment on issue #11933:
URL: https://github.com/apache/druid/issues/11933#issuecomment-974603066


   ### Prototype Based on Sequences
   
   Per a suggestion by @cheddar, did an experiment using the `Sequence`/`Yielder` pair. The question is: if we used straightforward classes, could the code based on `Sequence` be as simple as the prototype based on the operator protocol? Is the reason for stack clutter and extra code due more to excessive use of inner classes and lambdas than to the `Sequence` abstraction itself?
   
   The result is a [limit sequence](https://github.com/paul-rogers/druid/blob/ad5147a72ee77fd8e86655ea4e03c7e64e88f9b5/processing/src/main/java/org/apache/druid/query/scan/ScanQueryLimitSequence.java), and a [base transform sequence](https://github.com/paul-rogers/druid/blob/ad5147a72ee77fd8e86655ea4e03c7e64e88f9b5/processing/src/main/java/org/apache/druid/query/scan/TransformSequence.java). Didn't refactor all of the scan query, just did the scan query limit operation. 
   
   This one little experiment suggests that, in fact, it probably is the `Sequence`/`Yielder` pattern that leads to complex code. In fact, the current limit implementation might actually be about as simple as we an make it, if we are required to use the `Sequence`/`Yielder` pattern.
   
   First, let's describe the new prototype, then later comments will outline what was learned.
   
   After fiddling about, the simplest implementation was to use the previous [operator](https://github.com/paul-rogers/druid/blob/pipe/processing/src/main/java/org/apache/druid/query/pipeline/ScanResultLimitOperator.java) version as the basis for the new prototype. The reason is that the operator code is about as simple as one can make a limit operation. That code was refactored from the original [limit iterator](https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/scan/ScanQueryLimitRowIterator.java) by replacing the original  `Sequence`/`Yielder` input with an input operator.
   
   Then, to turn the prototype limit into a `Sequence`, the code added the sequence methods `accumulate()` and `toYielder()`. Of these, `toYielder()` seems to especially complex. The easiest path seemed to be to adapt the code from `BaseSequence` to implement both `accumulate()` and `toYielder()` in terms of the iteration methods from the prior code.
   
   The result is that our limit sequence is a simple iterator over batches (doing the needed counting, and truncating the last batch when needed), along with a bunch of extra overhead to implement the `Sequence`/`Yielder` protocol.
   
   In fact, it is probably simpler to split these two concepts, as in the first prototype: an operator is fancy iterator: super simple. If you want a sequence, just wrap the operator in a standard sequence wrapper class. The result would probably be simpler than the current prototype: we'd use composition rather than the inheritance used to convert the limit into a sequence.
   
   We can now see why the current Druid limit code is already close to optimal, given the `Sequence`/`Yielder` requirement. It is an iterator wrapped in a reusable sequence class that provides a standard yielder.
   
   This also suggests why Druid uses so many extra knick-knacks at runtime. One often needs an iterator-like thing that does the work, a sequence wrapper around the iterator, and a yielder/yielding accumulator wrapper around the sequence. For. Every. Transform.
   
   Bottom line: the core limit operation is simple. It becomes bogged down, however, when it has to implement the `Sequence`/`Yielder` protocol. The operator prototype shows that code is actually not needed. Druid would be simpler if we just removed it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org