You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@druid.apache.org by Jason Koch <jk...@netflix.com.INVALID> on 2021/11/22 17:18:32 UTC

Re: Push-down of operations for SystemSchema tables

Hi Gian

It looks like I have a solution that works, and I'd like to present it for
your thoughts.

https://github.com/apache/druid/compare/master...jasonk000:segment-table-interim?diff=unified

At runtime, the DruidQuery attempts to construct a ScanQuery. To do so, it
checks with the underlying datasource whether a scan is supported, and if
it is, it constructs a ScanQuery to match and passes it on. Later,
ScanQueryEngine will look at whether the adapter is a SortedCursorFactory
and if it is, it will request the cursors to be constructed and pass
through all relevent sort/limit/filter to the underlying table logic.
VirtualTable implementation can then perform its magic and return a result!

The solution is:
- Create a new VirtualDataSource, a new VirtualSegment and
VirtualStorageAdapter, as well as a new SortedCursorFactory to support the
VirtualDataSource.
- System tables have a DruidVirtualTableRule that will convert a
LogicalTableScan of a VirtualTable into a PartialDruidQuery whilst passing
the query authentication information.
- DataSource gets a function canScanOrdered so that it can advise the Query
parser whether it can handle a given sort function - default is that only
existing time-ordering is supported. Implementations (ie:
VirtualDataSource) can decide to accept an ordered scan or not.

Overall, I think this provides a viable solution and meets your goals. In
rebasing this morning I see you're on the same pathway with ordered scan
query, so I could rebase on top of that and break into a smaller set of
PRs, nonetheless the conceptual approach and direction is something that I
think will work.

Thanks!
Jason






On Wed, May 19, 2021 at 9:54 PM Gian Merlino <gi...@apache.org> wrote:

> Hey Jason,
>
> It sounds like we have two different, but related goals:
>
> 1) Your goal is to improve the performance of system tables.
>
> 2) My goal with the branch Clint linked is to enable using Druid's native
> query engine for system tables, in order to achieve consistency in how SQL
> queries are executed and also so all of Druid's special functions,
> aggregations, extensions, etc, are available to use in system tables.
>
> Two notes about what I'm trying to do, btw, in response to things you
> raised. First, I'm interested in using Druid's native query engine under
> the hood, but I'm not necessarily driving towards being able to actually
> query system tables using native Druid queries. It still achieves my goals
> if these tables are only available in SQL queries. Second, you're correct
> that for this to work for queries with 'order by' but no 'group by', we'd
> need to add ordering support to the Scan query.
>
> That's actually the main reason I stopped working on this branch: I started
> looking into Scan ordering instead. Then I got distracted with other stuff,
> and now I'm working on neither of those things 🙂. Anyway, I think it'll be
> somewhat involved to implement Scan ordering in a scalable way for any
> possible query on any possible datasource, but if we're focused on sys
> tables, we can take a shortcut that is less-scalable. It wouldn't be tough
> to make something that works for anything that works today, since the
> bindable convention we use today simply does the sort in memory (see
> org.apache.calcite.interpreter.SortNode). That might be a good way to
> unblock the sys-table-via-native-engine work. We'd just need some safeguard
> to prevent that code from executing on real datasources that are too big to
> materialize. Perhaps a row limit, or perhaps enabling in-memory ordering
> using an undocumented Scan context parameter set by the SQL layer only for
> sys tables.
>
> > I am interested. For my current work, I do want to keep focus on the
> > sys.* performance work. If there's a way to do it and lay the
> > groundwork or even get all the work done, then I am 100% for that.
> >  Looking at what you want to do to convert these sys.* to native
> > tables, if we have a viable solution or are comfortable with my
> > suggestions above I'd be happy to build it out.
>
> I think the plan you laid out makes sense, especially part (3). Using a
> VirtualDruidTable that 'represents' a system table, but hasn't yet actually
> built an Iterable, would allow us to defer creation of the
> VirtualDataSource til later in the planning process. That'll enable more
> kinds of pushdown via rules, like you mentioned. Coupling that with an
> in-memory sort for the Scan query — appropriately guarded — I think would
> get us all the way there. Later on, as a separate project, we'll want to
> extend the Scan ordering to scale beyond what can be done in memory, but
> that wouldn't be necessary for the system tables.
>
> Btw, I would suggest not removing the bindable stuff completely. I did that
> in my branch, but I regret it, since I think the change is risky enough
> that it should be an option that is opt-in for a while. We could rip out
> the old stuff after a few releases, once we're happy with the stability of
> the new mechanism.
>
> What do you think?
>
> On Fri, May 14, 2021 at 2:51 PM Jason Koch <jk...@netflix.com.invalid>
> wrote:
>
> > @Julian - thank you for review & confirming.
> >
> > Hi Clint
> >
> > Thank you, I appreciate the response. I have responded Inline, some
> > q's, I've also written in my words as a confirmation that I understand
> > ...
> >
> > > In the mid term, I think that some of us have been thinking that moving
> > > system tables into the Druid native query engine is the way to go, and
> > have
> > > been working on resolving a number of hurdles that are required to make
> > > this happen. One of the main motivators to do this is so that we have
> > just
> > > the Druid query path in the planner in the Calcite layer, and
> deprecating
> > > and eventually dropping the "bindable" path completely, described in
> > > https://github.com/apache/druid/issues/9896. System tables would be
> > pushed
> > > into Druid Datasource implementations, and queries would be handled in
> > the
> > > native engine. Gian has even made a prototype of what this might look
> > like,
> > >
> >
> https://github.com/apache/druid/compare/master...gianm:sql-sys-table-native
> > > since much of the ground work is now in place, though it takes a
> > hard-line
> > > approach of completely removing bindable instead of hiding it behind a
> > > flag, and doesn't implement all of the system tables yet, at least last
> > > time I looked at it.
> >
> > Looking over the changes it seems that:
> > - a new VirtualDataSource is introduced, which the Druid non-sql
> > processing engine can process, that can wrap an Iterable. This exposes
> > lazy segment & iterable using  InlineDataSource.
> > - the SegmentsTable has been converted from a ScannableTable to a
> > DruidTable, and a ScannableTableIterator is introduced to generate an
> > iterable containing the rows; the new VirtualDataSource can be used to
> > access the rows of this table.
> > - finally, the Bindable convention is discarded from DruidPlanner and
> > Rules.
> >
> > > I think there are a couple of remaining parts to resolve that would
> make
> > > this feasible. The first is native scan queries need support for
> ordering
> > > by arbitrary columns, instead of just time, so that we can retain
> > > capabilities of the existing system tables.
> >
> > It seems you want to use the native queries to support ordering; do
> > you mean here the underlying SegmentsTable, or something in the Druid
> > engine? Currently, the SegmentsTable etc relies on, as you say, the
> > bindable convention to provide sort. If it was a DruidTable then it
> > seems that Sorting gets pushed into PartialDruidQuery->DruidQuery,
> > which conceptually is able to do a sort, but as described in [1] [2]
> > the ordering is not supported by the underlying druid engine [3].
> >
> > This would mean that an order by, sort, limit query would not be
> > supported on any of the migrated sys.* tables until Druid has a way to
> > perform the sort on a ScanQuery.
> >
> > [1]
> >
> https://druid.apache.org/docs/latest/querying/scan-query.html#time-ordering
> > [2]
> >
> https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java#L1075-L1078
> > [3]
> >
> https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/scan/ScanQueryEngine.java
> >
> > > This isn't actually a blocker
> > > for adding native system table queries, but rather a blocker for
> > replacing
> > > the bindable convention by default so that there isn't a loss (or
> rather
> > > trade) of functionality. Additionally, I think there is maybe some
> > matters
> > > regarding authorization of system tables when handled by the native
> > engine
> > > that will need resolved, but this can be done while adding the native
> > > implementations.
> >
> > It looks like the port of the tables from classic ScannableTable to a
> > DruidTable itself is straightforward. However, it seems this PR
> > doesn't bring them across from SQL domain to be available in any
> > native queries. I'm not sure if this is expected or an interim step or
> > if I have misunderstood the goal.
> >
> > > I think there are some various ideas and experiments underway of how to
> > do
> > > sorting on scan queries at normal Druid datasource scale, which is sort
> > of
> > > a big project, but in the short term we might be able to do something
> > less
> > > ambitious that works well enough at system tables scale to allow this
> > plan
> > > to fully proceed.
> >
> > One possible way, that I think leads in the correct direction:
> > 1) We have an existing rule for LogicalTable with DruidTable to
> > DruidQueryRel which can eventually construct a DruidQuery.
> > 2) The VirtualDataSource, created during SQL parsing takes an
> > already-constructed Iterable; so, we need to have already performed
> > the filter/sort before creating the VirtualDataSource (and
> > DruidQuery). This means the push-down filter logic has to happen
> > during sql/ stage setup and before handoff to processing/ engine.
> > 3) Perhaps a new VirtualDruidTable subclassing DruidTable w/ a
> > RelOptRule that can identify LogicalXxx above a VirtualDruidTable and
> > push down? Then, our SegmentTable and friends can expose the correct
> > Iterable. This should allow us to solve the perf concerns, and would
> > allow us to present a correctly constructed VirtualDataSource. Sort
> > from SQL _should_ be supported (I think) as the planner can push the
> > sort etc down to these nodes directly.
> >
> > In this, the majority of the work would have had to have happened
> > prior to Druid engine, in sql/, before reaching Druid and so Druid
> > core doesn't actually need to know anything about these changes.
> >
> > On the other hand, whilst it keeps the pathway open, I'm not sure this
> > does any of the actual work to make the sys.* tables available as
> > native tables. If we are to try and make these into truly native
> > tables, without a native sort, and remove their implementation from
> > sql/, the DruidQuery in the planner would need to be configured to
> > pass the ScanQuery sort to the processing engine _but only for sys.*
> > tables_ and then processing engine would need to know how to find
> > these tables. (I haven't explored this). As you mention, implementing
> > native sort across multiple data sources seems like a more ambitious
> > piece of work.
> >
> > As another idea, we could consider creating a bridge
> > Bindable/EnumerableToDruid rule that would allow druid to embed these
> > tables, and move them out of sql/ into processing/, exposed as
> > Iterable/Enumerable, and make them available in queries if that is a
> > goal. I'm not really sure that adds anything to the overall goals
> > though.
> >
> > > Does this approach make sense? I don't believe Gian is actively working
> > on
> > > this at the moment, so I think if you're interested in moving along
> this
> > > approach and want to start laying the groundwork I'm happy to provide
> > > guidance and help out.
> > >
> >
> > I am interested. For my current work, I do want to keep focus on the
> > sys.* performance work. If there's a way to do it and lay the
> > groundwork or even get all the work done, then I am 100% for that.
> > Looking at what you want to do to convert these sys.* to native
> > tables, if we have a viable solution or are comfortable with my
> > suggestions above I'd be happy to build it out.
> >
> > Thanks
> > Jason
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
> > For additional commands, e-mail: dev-help@druid.apache.org
> >
> >
>

Re: Push-down of operations for SystemSchema tables

Posted by Paul Rogers <pa...@imply.io>.
Hi All,

Cool feature! I got bit by the “system tables don’t fully support SQL” thing myself. Thanks for helping to improve this area.

Jumping in here because I’m toying with an idea that might help with this feature. See https://github.com/apache/druid/issues/11933 <https://github.com/apache/druid/issues/11933>. Basically, the idea is to refactor the current query code to use the standard operator concept found in many other tools. A prototype of the idea exists which, it turns out, converts the Scan Query to operator form, hence my interest in this thread.

How would that help here? Rather than implement one-off solutions for sorting and filtering in system tables, we could simply insert standard filter and sort operators on top of the cursor scan for operations which the storage adapter does not itself support. This is a very common trick in other tools.

In practice, this means that the operator-equivalent of the ScanQueryEngine would check which features the data source supports. The engine would delegate all that it can, as with the segment storage adapter today. Any remaining unmet needs would be handled by inserting the required operator.

It would be cool, in the long term, for Calcite to plan out the details of the query DAG, as it does in other engines. I that world, Calcite would do the insertion of needed operators. (Actually, in Calcite, it would start with the needed logical operators, then push down those that the adapter can handle.) Until we get to that time, the Calcite part is converted to a native query, and adding full support to the scan query for system tables might be a quicker short-term solution.

In this model, the SQL engine can be less aware of the differences between storage adapters: we’d rely on Druid’s “just in time planning” (i.e. QueryRunner) mechanism in native queries to handle those details. To SQL, a scan is a scan regardless of the storage adapter.

I’ve not looked at non-segment storage adapters. Perhaps some already have the code needed to implement sorting and filtering. If so, we simply refactor that into operator form to make it reusable. It also seems that there is already a merge sort in the scan query “stack” that, once made into a reusable operator (already done in the prototype mentioned in that issue) could be applied to this problem. (As noted, system tables are small, so an in-memory sort is probably OK to start with.)

This might be an interesting way both prove the operator concept and help solve the system tables problem.

Sounds like Gian’s got some thoughts about where he wants to take this area. I think the operator structure could be orthogonal to his work, but I can’t quite tell from this e-mail thread.

Just tossing out the idea to see if it helps this discussion.

Thanks,

 - Paul

> On Nov 22, 2021, at 9:18 AM, Jason Koch <jk...@netflix.com.INVALID> wrote:
> 
> Hi Gian
> 
> It looks like I have a solution that works, and I'd like to present it for
> your thoughts.
> 
> https://github.com/apache/druid/compare/master...jasonk000:segment-table-interim?diff=unified
> 
> At runtime, the DruidQuery attempts to construct a ScanQuery. To do so, it
> checks with the underlying datasource whether a scan is supported, and if
> it is, it constructs a ScanQuery to match and passes it on. Later,
> ScanQueryEngine will look at whether the adapter is a SortedCursorFactory
> and if it is, it will request the cursors to be constructed and pass
> through all relevent sort/limit/filter to the underlying table logic.
> VirtualTable implementation can then perform its magic and return a result!
> 
> The solution is:
> - Create a new VirtualDataSource, a new VirtualSegment and
> VirtualStorageAdapter, as well as a new SortedCursorFactory to support the
> VirtualDataSource.
> - System tables have a DruidVirtualTableRule that will convert a
> LogicalTableScan of a VirtualTable into a PartialDruidQuery whilst passing
> the query authentication information.
> - DataSource gets a function canScanOrdered so that it can advise the Query
> parser whether it can handle a given sort function - default is that only
> existing time-ordering is supported. Implementations (ie:
> VirtualDataSource) can decide to accept an ordered scan or not.
> 
> Overall, I think this provides a viable solution and meets your goals. In
> rebasing this morning I see you're on the same pathway with ordered scan
> query, so I could rebase on top of that and break into a smaller set of
> PRs, nonetheless the conceptual approach and direction is something that I
> think will work.
> 
> Thanks!
> Jason
> 
> 
> 
> 
> 
> 
> On Wed, May 19, 2021 at 9:54 PM Gian Merlino <gi...@apache.org> wrote:
> 
>> Hey Jason,
>> 
>> It sounds like we have two different, but related goals:
>> 
>> 1) Your goal is to improve the performance of system tables.
>> 
>> 2) My goal with the branch Clint linked is to enable using Druid's native
>> query engine for system tables, in order to achieve consistency in how SQL
>> queries are executed and also so all of Druid's special functions,
>> aggregations, extensions, etc, are available to use in system tables.
>> 
>> Two notes about what I'm trying to do, btw, in response to things you
>> raised. First, I'm interested in using Druid's native query engine under
>> the hood, but I'm not necessarily driving towards being able to actually
>> query system tables using native Druid queries. It still achieves my goals
>> if these tables are only available in SQL queries. Second, you're correct
>> that for this to work for queries with 'order by' but no 'group by', we'd
>> need to add ordering support to the Scan query.
>> 
>> That's actually the main reason I stopped working on this branch: I started
>> looking into Scan ordering instead. Then I got distracted with other stuff,
>> and now I'm working on neither of those things 🙂. Anyway, I think it'll be
>> somewhat involved to implement Scan ordering in a scalable way for any
>> possible query on any possible datasource, but if we're focused on sys
>> tables, we can take a shortcut that is less-scalable. It wouldn't be tough
>> to make something that works for anything that works today, since the
>> bindable convention we use today simply does the sort in memory (see
>> org.apache.calcite.interpreter.SortNode). That might be a good way to
>> unblock the sys-table-via-native-engine work. We'd just need some safeguard
>> to prevent that code from executing on real datasources that are too big to
>> materialize. Perhaps a row limit, or perhaps enabling in-memory ordering
>> using an undocumented Scan context parameter set by the SQL layer only for
>> sys tables.
>> 
>>> I am interested. For my current work, I do want to keep focus on the
>>> sys.* performance work. If there's a way to do it and lay the
>>> groundwork or even get all the work done, then I am 100% for that.
>>> Looking at what you want to do to convert these sys.* to native
>>> tables, if we have a viable solution or are comfortable with my
>>> suggestions above I'd be happy to build it out.
>> 
>> I think the plan you laid out makes sense, especially part (3). Using a
>> VirtualDruidTable that 'represents' a system table, but hasn't yet actually
>> built an Iterable, would allow us to defer creation of the
>> VirtualDataSource til later in the planning process. That'll enable more
>> kinds of pushdown via rules, like you mentioned. Coupling that with an
>> in-memory sort for the Scan query — appropriately guarded — I think would
>> get us all the way there. Later on, as a separate project, we'll want to
>> extend the Scan ordering to scale beyond what can be done in memory, but
>> that wouldn't be necessary for the system tables.
>> 
>> Btw, I would suggest not removing the bindable stuff completely. I did that
>> in my branch, but I regret it, since I think the change is risky enough
>> that it should be an option that is opt-in for a while. We could rip out
>> the old stuff after a few releases, once we're happy with the stability of
>> the new mechanism.
>> 
>> What do you think?
>> 
>> On Fri, May 14, 2021 at 2:51 PM Jason Koch <jk...@netflix.com.invalid>
>> wrote:
>> 
>>> @Julian - thank you for review & confirming.
>>> 
>>> Hi Clint
>>> 
>>> Thank you, I appreciate the response. I have responded Inline, some
>>> q's, I've also written in my words as a confirmation that I understand
>>> ...
>>> 
>>>> In the mid term, I think that some of us have been thinking that moving
>>>> system tables into the Druid native query engine is the way to go, and
>>> have
>>>> been working on resolving a number of hurdles that are required to make
>>>> this happen. One of the main motivators to do this is so that we have
>>> just
>>>> the Druid query path in the planner in the Calcite layer, and
>> deprecating
>>>> and eventually dropping the "bindable" path completely, described in
>>>> https://github.com/apache/druid/issues/9896. System tables would be
>>> pushed
>>>> into Druid Datasource implementations, and queries would be handled in
>>> the
>>>> native engine. Gian has even made a prototype of what this might look
>>> like,
>>>> 
>>> 
>> https://github.com/apache/druid/compare/master...gianm:sql-sys-table-native
>>>> since much of the ground work is now in place, though it takes a
>>> hard-line
>>>> approach of completely removing bindable instead of hiding it behind a
>>>> flag, and doesn't implement all of the system tables yet, at least last
>>>> time I looked at it.
>>> 
>>> Looking over the changes it seems that:
>>> - a new VirtualDataSource is introduced, which the Druid non-sql
>>> processing engine can process, that can wrap an Iterable. This exposes
>>> lazy segment & iterable using  InlineDataSource.
>>> - the SegmentsTable has been converted from a ScannableTable to a
>>> DruidTable, and a ScannableTableIterator is introduced to generate an
>>> iterable containing the rows; the new VirtualDataSource can be used to
>>> access the rows of this table.
>>> - finally, the Bindable convention is discarded from DruidPlanner and
>>> Rules.
>>> 
>>>> I think there are a couple of remaining parts to resolve that would
>> make
>>>> this feasible. The first is native scan queries need support for
>> ordering
>>>> by arbitrary columns, instead of just time, so that we can retain
>>>> capabilities of the existing system tables.
>>> 
>>> It seems you want to use the native queries to support ordering; do
>>> you mean here the underlying SegmentsTable, or something in the Druid
>>> engine? Currently, the SegmentsTable etc relies on, as you say, the
>>> bindable convention to provide sort. If it was a DruidTable then it
>>> seems that Sorting gets pushed into PartialDruidQuery->DruidQuery,
>>> which conceptually is able to do a sort, but as described in [1] [2]
>>> the ordering is not supported by the underlying druid engine [3].
>>> 
>>> This would mean that an order by, sort, limit query would not be
>>> supported on any of the migrated sys.* tables until Druid has a way to
>>> perform the sort on a ScanQuery.
>>> 
>>> [1]
>>> 
>> https://druid.apache.org/docs/latest/querying/scan-query.html#time-ordering
>>> [2]
>>> 
>> https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java#L1075-L1078
>>> [3]
>>> 
>> https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/scan/ScanQueryEngine.java
>>> 
>>>> This isn't actually a blocker
>>>> for adding native system table queries, but rather a blocker for
>>> replacing
>>>> the bindable convention by default so that there isn't a loss (or
>> rather
>>>> trade) of functionality. Additionally, I think there is maybe some
>>> matters
>>>> regarding authorization of system tables when handled by the native
>>> engine
>>>> that will need resolved, but this can be done while adding the native
>>>> implementations.
>>> 
>>> It looks like the port of the tables from classic ScannableTable to a
>>> DruidTable itself is straightforward. However, it seems this PR
>>> doesn't bring them across from SQL domain to be available in any
>>> native queries. I'm not sure if this is expected or an interim step or
>>> if I have misunderstood the goal.
>>> 
>>>> I think there are some various ideas and experiments underway of how to
>>> do
>>>> sorting on scan queries at normal Druid datasource scale, which is sort
>>> of
>>>> a big project, but in the short term we might be able to do something
>>> less
>>>> ambitious that works well enough at system tables scale to allow this
>>> plan
>>>> to fully proceed.
>>> 
>>> One possible way, that I think leads in the correct direction:
>>> 1) We have an existing rule for LogicalTable with DruidTable to
>>> DruidQueryRel which can eventually construct a DruidQuery.
>>> 2) The VirtualDataSource, created during SQL parsing takes an
>>> already-constructed Iterable; so, we need to have already performed
>>> the filter/sort before creating the VirtualDataSource (and
>>> DruidQuery). This means the push-down filter logic has to happen
>>> during sql/ stage setup and before handoff to processing/ engine.
>>> 3) Perhaps a new VirtualDruidTable subclassing DruidTable w/ a
>>> RelOptRule that can identify LogicalXxx above a VirtualDruidTable and
>>> push down? Then, our SegmentTable and friends can expose the correct
>>> Iterable. This should allow us to solve the perf concerns, and would
>>> allow us to present a correctly constructed VirtualDataSource. Sort
>>> from SQL _should_ be supported (I think) as the planner can push the
>>> sort etc down to these nodes directly.
>>> 
>>> In this, the majority of the work would have had to have happened
>>> prior to Druid engine, in sql/, before reaching Druid and so Druid
>>> core doesn't actually need to know anything about these changes.
>>> 
>>> On the other hand, whilst it keeps the pathway open, I'm not sure this
>>> does any of the actual work to make the sys.* tables available as
>>> native tables. If we are to try and make these into truly native
>>> tables, without a native sort, and remove their implementation from
>>> sql/, the DruidQuery in the planner would need to be configured to
>>> pass the ScanQuery sort to the processing engine _but only for sys.*
>>> tables_ and then processing engine would need to know how to find
>>> these tables. (I haven't explored this). As you mention, implementing
>>> native sort across multiple data sources seems like a more ambitious
>>> piece of work.
>>> 
>>> As another idea, we could consider creating a bridge
>>> Bindable/EnumerableToDruid rule that would allow druid to embed these
>>> tables, and move them out of sql/ into processing/, exposed as
>>> Iterable/Enumerable, and make them available in queries if that is a
>>> goal. I'm not really sure that adds anything to the overall goals
>>> though.
>>> 
>>>> Does this approach make sense? I don't believe Gian is actively working
>>> on
>>>> this at the moment, so I think if you're interested in moving along
>> this
>>>> approach and want to start laying the groundwork I'm happy to provide
>>>> guidance and help out.
>>>> 
>>> 
>>> I am interested. For my current work, I do want to keep focus on the
>>> sys.* performance work. If there's a way to do it and lay the
>>> groundwork or even get all the work done, then I am 100% for that.
>>> Looking at what you want to do to convert these sys.* to native
>>> tables, if we have a viable solution or are comfortable with my
>>> suggestions above I'd be happy to build it out.
>>> 
>>> Thanks
>>> Jason
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
>>> For additional commands, e-mail: dev-help@druid.apache.org
>>> 
>>> 
>> 


Re: Push-down of operations for SystemSchema tables

Posted by Gian Merlino <gi...@apache.org>.
Hey Jason,

The concept you sketched out looks promising! I hope to have a chance to
look into the code soon, but for now I just had a couple of questions based
on your email.

First: what do you think of the relative merits of having the CursorFactory
do the sort vs. having the Scan engine do the sort?

Second: I recall you were originally trying to speed up system tables and
help them scale better. Have you tried this patch to see if it helps in
that regard?

On Mon, Nov 22, 2021 at 9:19 AM Jason Koch <jk...@netflix.com.invalid>
wrote:

> Hi Gian
>
> It looks like I have a solution that works, and I'd like to present it for
> your thoughts.
>
>
> https://github.com/apache/druid/compare/master...jasonk000:segment-table-interim?diff=unified
>
> At runtime, the DruidQuery attempts to construct a ScanQuery. To do so, it
> checks with the underlying datasource whether a scan is supported, and if
> it is, it constructs a ScanQuery to match and passes it on. Later,
> ScanQueryEngine will look at whether the adapter is a SortedCursorFactory
> and if it is, it will request the cursors to be constructed and pass
> through all relevent sort/limit/filter to the underlying table logic.
> VirtualTable implementation can then perform its magic and return a result!
>
> The solution is:
> - Create a new VirtualDataSource, a new VirtualSegment and
> VirtualStorageAdapter, as well as a new SortedCursorFactory to support the
> VirtualDataSource.
> - System tables have a DruidVirtualTableRule that will convert a
> LogicalTableScan of a VirtualTable into a PartialDruidQuery whilst passing
> the query authentication information.
> - DataSource gets a function canScanOrdered so that it can advise the Query
> parser whether it can handle a given sort function - default is that only
> existing time-ordering is supported. Implementations (ie:
> VirtualDataSource) can decide to accept an ordered scan or not.
>
> Overall, I think this provides a viable solution and meets your goals. In
> rebasing this morning I see you're on the same pathway with ordered scan
> query, so I could rebase on top of that and break into a smaller set of
> PRs, nonetheless the conceptual approach and direction is something that I
> think will work.
>
> Thanks!
> Jason
>
>
>
>
>
>
> On Wed, May 19, 2021 at 9:54 PM Gian Merlino <gi...@apache.org> wrote:
>
> > Hey Jason,
> >
> > It sounds like we have two different, but related goals:
> >
> > 1) Your goal is to improve the performance of system tables.
> >
> > 2) My goal with the branch Clint linked is to enable using Druid's native
> > query engine for system tables, in order to achieve consistency in how
> SQL
> > queries are executed and also so all of Druid's special functions,
> > aggregations, extensions, etc, are available to use in system tables.
> >
> > Two notes about what I'm trying to do, btw, in response to things you
> > raised. First, I'm interested in using Druid's native query engine under
> > the hood, but I'm not necessarily driving towards being able to actually
> > query system tables using native Druid queries. It still achieves my
> goals
> > if these tables are only available in SQL queries. Second, you're correct
> > that for this to work for queries with 'order by' but no 'group by', we'd
> > need to add ordering support to the Scan query.
> >
> > That's actually the main reason I stopped working on this branch: I
> started
> > looking into Scan ordering instead. Then I got distracted with other
> stuff,
> > and now I'm working on neither of those things 🙂. Anyway, I think it'll
> be
> > somewhat involved to implement Scan ordering in a scalable way for any
> > possible query on any possible datasource, but if we're focused on sys
> > tables, we can take a shortcut that is less-scalable. It wouldn't be
> tough
> > to make something that works for anything that works today, since the
> > bindable convention we use today simply does the sort in memory (see
> > org.apache.calcite.interpreter.SortNode). That might be a good way to
> > unblock the sys-table-via-native-engine work. We'd just need some
> safeguard
> > to prevent that code from executing on real datasources that are too big
> to
> > materialize. Perhaps a row limit, or perhaps enabling in-memory ordering
> > using an undocumented Scan context parameter set by the SQL layer only
> for
> > sys tables.
> >
> > > I am interested. For my current work, I do want to keep focus on the
> > > sys.* performance work. If there's a way to do it and lay the
> > > groundwork or even get all the work done, then I am 100% for that.
> > >  Looking at what you want to do to convert these sys.* to native
> > > tables, if we have a viable solution or are comfortable with my
> > > suggestions above I'd be happy to build it out.
> >
> > I think the plan you laid out makes sense, especially part (3). Using a
> > VirtualDruidTable that 'represents' a system table, but hasn't yet
> actually
> > built an Iterable, would allow us to defer creation of the
> > VirtualDataSource til later in the planning process. That'll enable more
> > kinds of pushdown via rules, like you mentioned. Coupling that with an
> > in-memory sort for the Scan query — appropriately guarded — I think would
> > get us all the way there. Later on, as a separate project, we'll want to
> > extend the Scan ordering to scale beyond what can be done in memory, but
> > that wouldn't be necessary for the system tables.
> >
> > Btw, I would suggest not removing the bindable stuff completely. I did
> that
> > in my branch, but I regret it, since I think the change is risky enough
> > that it should be an option that is opt-in for a while. We could rip out
> > the old stuff after a few releases, once we're happy with the stability
> of
> > the new mechanism.
> >
> > What do you think?
> >
> > On Fri, May 14, 2021 at 2:51 PM Jason Koch <jk...@netflix.com.invalid>
> > wrote:
> >
> > > @Julian - thank you for review & confirming.
> > >
> > > Hi Clint
> > >
> > > Thank you, I appreciate the response. I have responded Inline, some
> > > q's, I've also written in my words as a confirmation that I understand
> > > ...
> > >
> > > > In the mid term, I think that some of us have been thinking that
> moving
> > > > system tables into the Druid native query engine is the way to go,
> and
> > > have
> > > > been working on resolving a number of hurdles that are required to
> make
> > > > this happen. One of the main motivators to do this is so that we have
> > > just
> > > > the Druid query path in the planner in the Calcite layer, and
> > deprecating
> > > > and eventually dropping the "bindable" path completely, described in
> > > > https://github.com/apache/druid/issues/9896. System tables would be
> > > pushed
> > > > into Druid Datasource implementations, and queries would be handled
> in
> > > the
> > > > native engine. Gian has even made a prototype of what this might look
> > > like,
> > > >
> > >
> >
> https://github.com/apache/druid/compare/master...gianm:sql-sys-table-native
> > > > since much of the ground work is now in place, though it takes a
> > > hard-line
> > > > approach of completely removing bindable instead of hiding it behind
> a
> > > > flag, and doesn't implement all of the system tables yet, at least
> last
> > > > time I looked at it.
> > >
> > > Looking over the changes it seems that:
> > > - a new VirtualDataSource is introduced, which the Druid non-sql
> > > processing engine can process, that can wrap an Iterable. This exposes
> > > lazy segment & iterable using  InlineDataSource.
> > > - the SegmentsTable has been converted from a ScannableTable to a
> > > DruidTable, and a ScannableTableIterator is introduced to generate an
> > > iterable containing the rows; the new VirtualDataSource can be used to
> > > access the rows of this table.
> > > - finally, the Bindable convention is discarded from DruidPlanner and
> > > Rules.
> > >
> > > > I think there are a couple of remaining parts to resolve that would
> > make
> > > > this feasible. The first is native scan queries need support for
> > ordering
> > > > by arbitrary columns, instead of just time, so that we can retain
> > > > capabilities of the existing system tables.
> > >
> > > It seems you want to use the native queries to support ordering; do
> > > you mean here the underlying SegmentsTable, or something in the Druid
> > > engine? Currently, the SegmentsTable etc relies on, as you say, the
> > > bindable convention to provide sort. If it was a DruidTable then it
> > > seems that Sorting gets pushed into PartialDruidQuery->DruidQuery,
> > > which conceptually is able to do a sort, but as described in [1] [2]
> > > the ordering is not supported by the underlying druid engine [3].
> > >
> > > This would mean that an order by, sort, limit query would not be
> > > supported on any of the migrated sys.* tables until Druid has a way to
> > > perform the sort on a ScanQuery.
> > >
> > > [1]
> > >
> >
> https://druid.apache.org/docs/latest/querying/scan-query.html#time-ordering
> > > [2]
> > >
> >
> https://github.com/apache/druid/blob/master/sql/src/main/java/org/apache/druid/sql/calcite/rel/DruidQuery.java#L1075-L1078
> > > [3]
> > >
> >
> https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/scan/ScanQueryEngine.java
> > >
> > > > This isn't actually a blocker
> > > > for adding native system table queries, but rather a blocker for
> > > replacing
> > > > the bindable convention by default so that there isn't a loss (or
> > rather
> > > > trade) of functionality. Additionally, I think there is maybe some
> > > matters
> > > > regarding authorization of system tables when handled by the native
> > > engine
> > > > that will need resolved, but this can be done while adding the native
> > > > implementations.
> > >
> > > It looks like the port of the tables from classic ScannableTable to a
> > > DruidTable itself is straightforward. However, it seems this PR
> > > doesn't bring them across from SQL domain to be available in any
> > > native queries. I'm not sure if this is expected or an interim step or
> > > if I have misunderstood the goal.
> > >
> > > > I think there are some various ideas and experiments underway of how
> to
> > > do
> > > > sorting on scan queries at normal Druid datasource scale, which is
> sort
> > > of
> > > > a big project, but in the short term we might be able to do something
> > > less
> > > > ambitious that works well enough at system tables scale to allow this
> > > plan
> > > > to fully proceed.
> > >
> > > One possible way, that I think leads in the correct direction:
> > > 1) We have an existing rule for LogicalTable with DruidTable to
> > > DruidQueryRel which can eventually construct a DruidQuery.
> > > 2) The VirtualDataSource, created during SQL parsing takes an
> > > already-constructed Iterable; so, we need to have already performed
> > > the filter/sort before creating the VirtualDataSource (and
> > > DruidQuery). This means the push-down filter logic has to happen
> > > during sql/ stage setup and before handoff to processing/ engine.
> > > 3) Perhaps a new VirtualDruidTable subclassing DruidTable w/ a
> > > RelOptRule that can identify LogicalXxx above a VirtualDruidTable and
> > > push down? Then, our SegmentTable and friends can expose the correct
> > > Iterable. This should allow us to solve the perf concerns, and would
> > > allow us to present a correctly constructed VirtualDataSource. Sort
> > > from SQL _should_ be supported (I think) as the planner can push the
> > > sort etc down to these nodes directly.
> > >
> > > In this, the majority of the work would have had to have happened
> > > prior to Druid engine, in sql/, before reaching Druid and so Druid
> > > core doesn't actually need to know anything about these changes.
> > >
> > > On the other hand, whilst it keeps the pathway open, I'm not sure this
> > > does any of the actual work to make the sys.* tables available as
> > > native tables. If we are to try and make these into truly native
> > > tables, without a native sort, and remove their implementation from
> > > sql/, the DruidQuery in the planner would need to be configured to
> > > pass the ScanQuery sort to the processing engine _but only for sys.*
> > > tables_ and then processing engine would need to know how to find
> > > these tables. (I haven't explored this). As you mention, implementing
> > > native sort across multiple data sources seems like a more ambitious
> > > piece of work.
> > >
> > > As another idea, we could consider creating a bridge
> > > Bindable/EnumerableToDruid rule that would allow druid to embed these
> > > tables, and move them out of sql/ into processing/, exposed as
> > > Iterable/Enumerable, and make them available in queries if that is a
> > > goal. I'm not really sure that adds anything to the overall goals
> > > though.
> > >
> > > > Does this approach make sense? I don't believe Gian is actively
> working
> > > on
> > > > this at the moment, so I think if you're interested in moving along
> > this
> > > > approach and want to start laying the groundwork I'm happy to provide
> > > > guidance and help out.
> > > >
> > >
> > > I am interested. For my current work, I do want to keep focus on the
> > > sys.* performance work. If there's a way to do it and lay the
> > > groundwork or even get all the work done, then I am 100% for that.
> > > Looking at what you want to do to convert these sys.* to native
> > > tables, if we have a viable solution or are comfortable with my
> > > suggestions above I'd be happy to build it out.
> > >
> > > Thanks
> > > Jason
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@druid.apache.org
> > > For additional commands, e-mail: dev-help@druid.apache.org
> > >
> > >
> >
>