You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@druid.apache.org by GitBox <gi...@apache.org> on 2021/11/17 21:08:51 UTC

[GitHub] [druid] paul-rogers commented on issue #11933: Discussion: operator structure for Druid queries?

paul-rogers commented on issue #11933:
URL: https://github.com/apache/druid/issues/11933#issuecomment-972029941

@cheddar, thanks for the comments. I'm impressed that you dug deep into the code! The posted description is kind of high level because I didn't really expect folks to dive in as far as you did. So, let's go into the details. I suppose what we really need is a table that maps the existing concepts to the corresponding operator concepts. I'll add that at some point.

First, let's discuss some of the prototype bits you mentioned to give context for anyone else reading along.

### Prototype Overview

You did find the key abstraction: the [`Operator`](https://github.com/paul-rogers/druid/blob/pipe/processing/src/main/java/org/apache/druid/query/pipeline/Operator.java) interface. This is just an iterator with `start()` and `close()` methods. In that sense, `Operator` is like a `Sequence`/`Yielder` pair. The difference is that the code to process a batch of records is located entirely in the `Operator.next()` method, meaning that the `Operator` implementation can hold any required state as member variables.

Once we have an `Operator` construct, we can pretty easily [churn out those needed by the Scan Query](https://github.com/paul-rogers/druid/tree/pipe/processing/src/main/java/org/apache/druid/query/pipeline). In this case, the list of operators was created by refactoring the existing code to move the execution bits into an operator class. Basic [unit tests](https://github.com/paul-rogers/druid/tree/pipe/processing/src/test/java/org/apache/druid/query/pipeline) exist for each operator. (One of the nice things about the operator model is that they are easy to test.)

Operators are entirely run-time concepts: they just start, process batches, and close. `QueryRunner`s, as noted, actually do quite a bit of planning. So, the plan-time component moves into a planner abstraction. The planner here is a quick refactor of the scan query `QueryRunner`s into the [HistoricalQueryPlannerStub](https://github.com/paul-rogers/druid/blob/pipe/server/src/main/java/org/apache/druid/server/pipeline/HistoricalQueryPlannerStub.java). This is mostly a proof of concept, created in parallel with the operators: the run-time stuff went into operators, the plan-time stuff into the planner.

The planner works roughly like the `QueryPlanner`s: it is a recursive set of functions, each of which builds its operator given its child, and returns it. A bit like the `QueryPlanner` call's its child's `run()`, then creates another `Sequence` around it. Here, however, we build the operators at plan time, and run them at run time (both of which are just two steps within the overall "run a scan query" task.)

There is two more minor bits. The fragment runner, which ensures operators are closed whether the query succeeds or fails. The `FragmentContext` which is just a holder for all the stuff that an operator might need to do its work. Since the current code is a refactoring, the context just holds the query ID and the response context, which is all the current code uses. The fragment context is just a way of not having to pass in common state into every operator constructor.

### Why are we Considering This?

Let's tackle the main question first: why are we bothering with this prototype? Let's answer this first since it motivates everything else. There are several reasons as outlined in the proposal.

The first thing to note is that the operator concept is not a new idea: it is the standard approach in a bunch of other tools (Spark, Drill, Presto/Trino, Impala, and many others.) So, we're just trying to draft off of what others have learned.

* Operators are independent of one another. An operator accepts inputs, does its thing, and returns results. It does not care what the input is. By contrast, a `QueryRunner` has to know what child to create (often via a helper such as a toolchest.)
* Separation of planning and execution. Both occur within a single query request. In the current code, the two operations are mixed. With the operator model, a planner decides which operators are needed, then the operators go off and run the query. Classic separation of concerns.
* Operators are stateful. Almost all query operators, in any tool, are stateful. Limits need a row count. Merges need to keep track of the next row from each input. And so on. Operators provide a simple, clean way to hold that necessary state. The current implementation has state, of course, but it is added via closures, lambdas and all manner of cleverness.
* Operators are simple. The operator model has evolved over many years and many projects. The result is the simplest possible way to get the job done: often with very few classes, and other constructs. Simple code is faster, easier to understand and easier to test. Again, a classic observation.
* Simpler to debug. The stack trace shows that the runtime code is simpler: the actual operations are front and center.
* Perhaps more efficient: Smaller amounts of code and a smaller runtime stack can mean less overhead. A good next step would be to run an actual performance test to determine if there is any actual benefit. (It could be that most of the time is spent in I/O, so reducing CPU won't help.)
* Easier to test. Each operator is independent of others. We can test a limit operator, say, by using a mock data source ("reader"): the limit has no way to know (or care) where its data comes from. This allows other query engines to have robust, detailed unit tests for each operator.
* Reusable. Because operators are agnostic, they are reusable. Of course, the current implementation could also be more resuable, operators are inherently so because they are designed to limit dependencies.

The point of this discussion is: can we get some of this operator goodness for Druid?

### Answers

With that background, let's answer the questions you raised.

*Different implementation*: The prototype is a refactoring of the existing scan query code, as described above. Since the planner works differently than how the toolchest, etc. works, it was not possible to directly use those in the prototype. Instead, the prototype uses those parts which work as-is, and made a refactored copy of those bits that need to change to work in the context of the new planner. Basically, those parts, such as `mergeResults()`, which build `QueryRunner`s are mapped into a different, but parallel, implementation in the prototype operator planner. If we were to proceed, that code should move to the toolchest, since the toolchest represents the query-type behavior. In other cases, since this is a refactoring, I kept whatever I could, even if that meant the prototype is somewhat more scattered than a final implementation would be.

*Operator vs. Sequence*: Clearly, the `Sequence` idea was meant to represent the result set, without the caller actually having to work with the rows in the result set. The accumulate method would handle any per-row operations. Hence, `Sequence` reifies (represents as an object) the result set, but not the operations on that result set.

By contrast, the `Operator` model reifies (represents as an object) operations, while leaving the result set as a non-reified abstraction. This is important when working with billions of rows: one can never hope to hold the whole result set, one is just building a pipeline that the rows flow through. Hence, the focus on the "pipe fittings", not the water. (Storm made this idea very explicit back in the day.)

Which is "right"? Neither. The question is, which is fastest and simplest. Hence this discussion.

*Semantic difference*: To see the difference in the two approaches, one has to pop up from the code to see the design. We've discussed the design above. The proposal separates planning and execution. It also combines a pile of sequences, yielders, lambdas, closures and other knick-knacks into a simple, clean operator structure. Why? For the reasons above: simpler to understand, easier to test, easier to debug, easier to reuse.

Said another way, both the current design and the prototype run a scan query. In that sense they are the "same." The proposal suggests how we can do the same work with far less code and complexity by leveraging the idea of accepting that a query is a pipeline of stateful operators.

The current code tries to follow the functional programming model to be stateless. But, since queries are, in fact, stateful, the code becomes very complex: state is often added via the "back door" of closures. The operator model accepts that queries have state and manages that state cleanly.

*Stack trace is so much cleaner*: you are on the right track. The reason the stack trace is simpler is that, at runtime, we do only the required work and no overhead, no unnecessary abstractions. A limit operator counts its rows and says its done when its done. That's all. That can be implemented right in the operator. Since operators are very light weight, no reason to have multiple layers of abstraction. The stack trace is smaller because operator are the KISS solution for queries.

Yes, if someone were to implement operators by adding multiple layers of abstraction, then it would be ugly. The point is, those layers are not necessary. Every operator has one very simple task: get some input, do a transform, produce an output. If you want to do two different things, create two operators. We can do this because the planner will choose wisely about which operator to include: we don't include operators we don't need. Operators don't have to be two things. The result is a very simple runtime stack. And, as we've all learned, the name of the game in performance is to minimize the amount of code in the per-row path.

*WrappingSequence, etc.*: You point out the many uses of wrapping sequences. Yes, those are a pain. Many seem to want to do something on sequence start or end, but end up having to insert themselves into the per-batch stream to do that. (We're really in th weeds now.)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@druid.apache.org
For additional commands, e-mail: commits-help@druid.apache.org