You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Andy Grove <an...@gmail.com> on 2020/01/11 22:57:43 UTC

Looking for advice on integrating with a custom data source

Hi,

I'd like to use Apache Drill with a custom data source that supports a
subset of SQL.

My goal is to have Drill push selection and predicates down to my data
source but the rest of the query processing should take place in Drill.

I started out by writing a JDBC driver for the data source and registering
that with Drill using the Jdbc Storage Plugin but it seems to just pass the
whole query through to my data source, so that approach isn't going to work
unless I'm missing something?

Is there any way to configure the JDBC storage plugin to only push certain
parts of the query to the data source?

If this isn't a good approach, do I need to write a custom storage plugin?
Can these be added on the classpath or would that require me maintaining a
fork of the project?

I appreciate any pointers anyone can give me.

Thanks,

Andy.

Re: Looking for advice on integrating with a custom data source

Posted by Andy Grove <an...@gmail.com>.
Hi Charles,

I would like to be able to contribute something out of this effort. The PoC
I am working on is quite fluid at the moment but one possible outcome is
that this storage engine ends up supporting Arrow Flight, but I'm not sure
yet.

Andy.

On Wed, Jan 15, 2020 at 7:19 AM Charles Givre <cg...@gmail.com> wrote:

> Andy,
> Glad to hear you got it working!!   Can you share what data source you are
> working with?  Is it completely custom to your organization?  If not, would
> you consider submitting this as a pull request?
> Best,
> -- C
>
>
>
> > On Jan 15, 2020, at 9:07 AM, Andy Grove <an...@gmail.com> wrote:
> >
> > And boom! With just 3 extra lines of code to adjust the CBO to make the
> row
> > count inversely proportional to the number of predicates, my little Poc
> > works :-)
> >
> > Now that I've achieved the instant gratification (relatively speaking!)
> of
> > making something work, I think it's time to step back and start doing
> this
> > the right way with the PR you mentioned.
> >
> > I would not have been able to get this working at all without all the
> > fantastic support!
> >
> > Thanks,
> >
> > Andy.
> >
> >
> >
> > On Tue, Jan 14, 2020 at 11:43 PM Paul Rogers <pa...@yahoo.com.invalid>
> > wrote:
> >
> >> Hi Andy,
> >>
> >> Congratulations on making such fast progress!
> >>
> >> The code to do filter pushdowns is rather complex and, it seems, most
> >> plugins copy/paste the same wad of code (with the same bugs). PR 1914
> >> provides a layer that converts the messy Drill logical plan into a nice,
> >> simple set of predicates. You can then pick and choose which to push
> down,
> >> allowing the framework to do the rest.
> >>
> >> Note that most of the plugins do push-down as part of physical planning.
> >> While this works in most case, it WILL NOT work if you are doing
> push-down
> >> in order to shard the scan. For example, in order to divide a time
> range up
> >> into pieces for a time series scan. The PR thus does push-down in the
> >> logical phase so that we can "do the right thing."
> >>
> >> When you say that getNewWithChildren() is for an earlier instance, it is
> >> very likely because Calcite gave up on your filter-push-down version
> >> because there was no cost reduction.
> >>
> >>
> >> The Wiki page mentioned earlier explains all the copies a bit.
> Basically,
> >> Drill creates many copies of your GroupScan as it proceeds. First a
> "blank"
> >> one, then another with projected columns, then another full copy as
> Calcite
> >> explores planning options, and so on.
> >>
> >> One key trick is that if you implement filter push down, you MUST
> return a
> >> lower cost estimate after the push-down than before. Else, Calcite
> decides
> >> that it is not worth the hassle of doing the push-down if the costs
> remain
> >> the same. See the Wiki for details. this is what getScanStats() does:
> >> report stats that must get lower as you improve the scan.
> >>
> >> That is, one cost at the start, a lower cost after projection push down
> >> (reflecting the fact that we presumably now read less data per row) and
> a
> >> lower cost again after filter-push down (because we read fewer rows.)
> There
> >> is a "Dummy" storage plugin in PR 1914 that illustrates all of this.
> >>
> >> Don't worry about getDigest(), it is just Calcite trying to get a label
> to
> >> use for its internal objects. You will need to implement getString(),
> using
> >> Drill's "EXPLAIN PLAN" format, so your scan can appear in the text plan
> >> output. EXPLAIN PLAN output is:
> >>
> >> ClassName [field1=x, field2=y]
> >>
> >> There is a little builder in PR 1914 to do this for you.
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >>    On Tuesday, January 14, 2020, 7:07:58 PM PST, Andy Grove <
> >> andygrove73@gmail.com> wrote:
> >>
> >> With some extra debugging I can see that the getNewWithChildren call is
> >> made to an earlier instance of GroupScan and not the instance created by
> >> the filter push-down rule. I'm wondering if this is some kind of
> >> hashCode/equals/toString/getDigest issue?
> >>
> >> On Tue, Jan 14, 2020 at 7:52 PM Andy Grove <an...@gmail.com>
> wrote:
> >>
> >>> I'm now working on predicate push down ... I have a filter rule that is
> >>> correctly extracting the predicates that the backend database supports
> >> and
> >>> I am creating a new GroupScan containing these predicates, using the
> >> Kafka
> >>> plugin as a reference. I see the GroupScan constructor being called
> after
> >>> this, with the predicates populated So far so good ... but then I see
> >> calls
> >>> to getDigest, getScanStats, and getNewWithChildren, and then I see
> calls
> >> to
> >>> the GroupScan constructor with the predicates missing.
> >>>
> >>> Any pointers on what I might be missing? Is there more magic I need to
> >>> know?
> >>>
> >>> Thanks!
> >>>
> >>> On Sun, Jan 12, 2020 at 5:34 PM Paul Rogers <par0328@yahoo.com.invalid
> >
> >>> wrote:
> >>>
> >>>> Hi Andy,
> >>>>
> >>>> Congrats! You are making good progress. Yes, the BatchCreator is a bit
> >> of
> >>>> magic: Drill looks for a subclass that has your SubScan subclass as
> the
> >>>> second parameter. Looks like you figured that out.
> >>>>
> >>>> Thanks,
> >>>> - Paul
> >>>>
> >>>>
> >>>>
> >>>>   On Sunday, January 12, 2020, 1:45:16 PM PST, Andy Grove <
> >>>> andygrove73@gmail.com> wrote:
> >>>>
> >>>> Actually I managed to get past that error with an educated guess that
> >> if
> >>>> I
> >>>> created a BatchCreator class, it would automagically be picked up
> >> somehow.
> >>>> I'm now at the point where my RecordReader is being invoked!
> >>>>
> >>>> On Sun, Jan 12, 2020 at 2:03 PM Andy Grove <an...@gmail.com>
> >> wrote:
> >>>>
> >>>>> Between reading the tutorial and copying and pasting code from the
> >> Kudu
> >>>>> storage plugin, I've been making reasonable progress with this but am
> >> I
> >>>> but
> >>>>> confused by one error I'm now hitting.
> >>>>> ExecutionSetupException: Failure finding OperatorCreator constructor
> >> for
> >>>>> config com.mydb.MyDbSubScan
> >>>>> Prior to this, Drill had called getSpecificScan and then called a few
> >> of
> >>>>> the methods on my subscan object. I wasn't sure what to return for
> >>>>> getOperatorType so just returned the kudu subscan operator type and
> >> I'm
> >>>>> wondering if the issue is related to that somehow?
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>
> >>>>> On Sat, Jan 11, 2020 at 10:13 PM Andy Grove <an...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> Thank you both for the those responses. This is very helpful. I have
> >>>>>> ordered a copy of the book too. I'm using Drill 1.17.0.
> >>>>>>
> >>>>>> I'll take a look at the Jdbc Storage Plugin code and see if it would
> >> be
> >>>>>> feasible to add the logic I need there. In parallel, I've started
> >>>>>> implementing a new storage plugin. I'll be working on this more
> >>>> tomorrow
> >>>>>> and I'm sure I'll be back with more questions soon.
> >>>>>>
> >>>>>> Thanks again for your help!
> >>>>>>
> >>>>>> Andy.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre <cg...@gmail.com>
> >>>> wrote:
> >>>>>>
> >>>>>>> HI Andy,
> >>>>>>> Thanks for your interest in Drill.  I'm glad to see that Paul wrote
> >>>> you
> >>>>>>> back as well.  I was going to say I thought the JDBC storage plugin
> >>>> did in
> >>>>>>> fact push down columns and filters to the source system.
> >>>>>>>
> >>>>>>> Also, what version of Drill are you using?
> >>>>>>>
> >>>>>>> Writing a storage plugin for Drill is not trivial and I'd
> definitely
> >>>>>>> recommend using the code from Paul's PR as that greatly simplifies
> >>>> things.
> >>>>>>> Here is a tutorial as well:
> >>>>>>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
> >>>>>>>
> >>>>>>> If you need additional help, please let us know.
> >>>>>>> -- C
> >>>>>>>
> >>>>>>>
> >>>>>>> On Jan 11, 2020, at 5:57 PM, Andy Grove <an...@gmail.com>
> >>>> wrote:
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> I'd like to use Apache Drill with a custom data source that
> >> supports a
> >>>>>>> subset of SQL.
> >>>>>>>
> >>>>>>> My goal is to have Drill push selection and predicates down to my
> >> data
> >>>>>>> source but the rest of the query processing should take place in
> >>>> Drill.
> >>>>>>>
> >>>>>>> I started out by writing a JDBC driver for the data source and
> >>>>>>> registering
> >>>>>>> that with Drill using the Jdbc Storage Plugin but it seems to just
> >>>> pass
> >>>>>>> the
> >>>>>>> whole query through to my data source, so that approach isn't going
> >> to
> >>>>>>> work
> >>>>>>> unless I'm missing something?
> >>>>>>>
> >>>>>>> Is there any way to configure the JDBC storage plugin to only push
> >>>>>>> certain
> >>>>>>> parts of the query to the data source?
> >>>>>>>
> >>>>>>> If this isn't a good approach, do I need to write a custom storage
> >>>>>>> plugin?
> >>>>>>> Can these be added on the classpath or would that require me
> >>>> maintaining
> >>>>>>> a
> >>>>>>> fork of the project?
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> I appreciate any pointers anyone can give me.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Andy.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>
> >>>
> >>>
> >>
>
>

Re: Looking for advice on integrating with a custom data source

Posted by Charles Givre <cg...@gmail.com>.
Andy, 
Glad to hear you got it working!!   Can you share what data source you are working with?  Is it completely custom to your organization?  If not, would you consider submitting this as a pull request?
Best,
-- C



> On Jan 15, 2020, at 9:07 AM, Andy Grove <an...@gmail.com> wrote:
> 
> And boom! With just 3 extra lines of code to adjust the CBO to make the row
> count inversely proportional to the number of predicates, my little Poc
> works :-)
> 
> Now that I've achieved the instant gratification (relatively speaking!) of
> making something work, I think it's time to step back and start doing this
> the right way with the PR you mentioned.
> 
> I would not have been able to get this working at all without all the
> fantastic support!
> 
> Thanks,
> 
> Andy.
> 
> 
> 
> On Tue, Jan 14, 2020 at 11:43 PM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
> 
>> Hi Andy,
>> 
>> Congratulations on making such fast progress!
>> 
>> The code to do filter pushdowns is rather complex and, it seems, most
>> plugins copy/paste the same wad of code (with the same bugs). PR 1914
>> provides a layer that converts the messy Drill logical plan into a nice,
>> simple set of predicates. You can then pick and choose which to push down,
>> allowing the framework to do the rest.
>> 
>> Note that most of the plugins do push-down as part of physical planning.
>> While this works in most case, it WILL NOT work if you are doing push-down
>> in order to shard the scan. For example, in order to divide a time range up
>> into pieces for a time series scan. The PR thus does push-down in the
>> logical phase so that we can "do the right thing."
>> 
>> When you say that getNewWithChildren() is for an earlier instance, it is
>> very likely because Calcite gave up on your filter-push-down version
>> because there was no cost reduction.
>> 
>> 
>> The Wiki page mentioned earlier explains all the copies a bit. Basically,
>> Drill creates many copies of your GroupScan as it proceeds. First a "blank"
>> one, then another with projected columns, then another full copy as Calcite
>> explores planning options, and so on.
>> 
>> One key trick is that if you implement filter push down, you MUST return a
>> lower cost estimate after the push-down than before. Else, Calcite decides
>> that it is not worth the hassle of doing the push-down if the costs remain
>> the same. See the Wiki for details. this is what getScanStats() does:
>> report stats that must get lower as you improve the scan.
>> 
>> That is, one cost at the start, a lower cost after projection push down
>> (reflecting the fact that we presumably now read less data per row) and a
>> lower cost again after filter-push down (because we read fewer rows.) There
>> is a "Dummy" storage plugin in PR 1914 that illustrates all of this.
>> 
>> Don't worry about getDigest(), it is just Calcite trying to get a label to
>> use for its internal objects. You will need to implement getString(), using
>> Drill's "EXPLAIN PLAN" format, so your scan can appear in the text plan
>> output. EXPLAIN PLAN output is:
>> 
>> ClassName [field1=x, field2=y]
>> 
>> There is a little builder in PR 1914 to do this for you.
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>    On Tuesday, January 14, 2020, 7:07:58 PM PST, Andy Grove <
>> andygrove73@gmail.com> wrote:
>> 
>> With some extra debugging I can see that the getNewWithChildren call is
>> made to an earlier instance of GroupScan and not the instance created by
>> the filter push-down rule. I'm wondering if this is some kind of
>> hashCode/equals/toString/getDigest issue?
>> 
>> On Tue, Jan 14, 2020 at 7:52 PM Andy Grove <an...@gmail.com> wrote:
>> 
>>> I'm now working on predicate push down ... I have a filter rule that is
>>> correctly extracting the predicates that the backend database supports
>> and
>>> I am creating a new GroupScan containing these predicates, using the
>> Kafka
>>> plugin as a reference. I see the GroupScan constructor being called after
>>> this, with the predicates populated So far so good ... but then I see
>> calls
>>> to getDigest, getScanStats, and getNewWithChildren, and then I see calls
>> to
>>> the GroupScan constructor with the predicates missing.
>>> 
>>> Any pointers on what I might be missing? Is there more magic I need to
>>> know?
>>> 
>>> Thanks!
>>> 
>>> On Sun, Jan 12, 2020 at 5:34 PM Paul Rogers <pa...@yahoo.com.invalid>
>>> wrote:
>>> 
>>>> Hi Andy,
>>>> 
>>>> Congrats! You are making good progress. Yes, the BatchCreator is a bit
>> of
>>>> magic: Drill looks for a subclass that has your SubScan subclass as the
>>>> second parameter. Looks like you figured that out.
>>>> 
>>>> Thanks,
>>>> - Paul
>>>> 
>>>> 
>>>> 
>>>>   On Sunday, January 12, 2020, 1:45:16 PM PST, Andy Grove <
>>>> andygrove73@gmail.com> wrote:
>>>> 
>>>> Actually I managed to get past that error with an educated guess that
>> if
>>>> I
>>>> created a BatchCreator class, it would automagically be picked up
>> somehow.
>>>> I'm now at the point where my RecordReader is being invoked!
>>>> 
>>>> On Sun, Jan 12, 2020 at 2:03 PM Andy Grove <an...@gmail.com>
>> wrote:
>>>> 
>>>>> Between reading the tutorial and copying and pasting code from the
>> Kudu
>>>>> storage plugin, I've been making reasonable progress with this but am
>> I
>>>> but
>>>>> confused by one error I'm now hitting.
>>>>> ExecutionSetupException: Failure finding OperatorCreator constructor
>> for
>>>>> config com.mydb.MyDbSubScan
>>>>> Prior to this, Drill had called getSpecificScan and then called a few
>> of
>>>>> the methods on my subscan object. I wasn't sure what to return for
>>>>> getOperatorType so just returned the kudu subscan operator type and
>> I'm
>>>>> wondering if the issue is related to that somehow?
>>>>> 
>>>>> Thanks.
>>>>> 
>>>>> 
>>>>> On Sat, Jan 11, 2020 at 10:13 PM Andy Grove <an...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Thank you both for the those responses. This is very helpful. I have
>>>>>> ordered a copy of the book too. I'm using Drill 1.17.0.
>>>>>> 
>>>>>> I'll take a look at the Jdbc Storage Plugin code and see if it would
>> be
>>>>>> feasible to add the logic I need there. In parallel, I've started
>>>>>> implementing a new storage plugin. I'll be working on this more
>>>> tomorrow
>>>>>> and I'm sure I'll be back with more questions soon.
>>>>>> 
>>>>>> Thanks again for your help!
>>>>>> 
>>>>>> Andy.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre <cg...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>>> HI Andy,
>>>>>>> Thanks for your interest in Drill.  I'm glad to see that Paul wrote
>>>> you
>>>>>>> back as well.  I was going to say I thought the JDBC storage plugin
>>>> did in
>>>>>>> fact push down columns and filters to the source system.
>>>>>>> 
>>>>>>> Also, what version of Drill are you using?
>>>>>>> 
>>>>>>> Writing a storage plugin for Drill is not trivial and I'd definitely
>>>>>>> recommend using the code from Paul's PR as that greatly simplifies
>>>> things.
>>>>>>> Here is a tutorial as well:
>>>>>>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
>>>>>>> 
>>>>>>> If you need additional help, please let us know.
>>>>>>> -- C
>>>>>>> 
>>>>>>> 
>>>>>>> On Jan 11, 2020, at 5:57 PM, Andy Grove <an...@gmail.com>
>>>> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> I'd like to use Apache Drill with a custom data source that
>> supports a
>>>>>>> subset of SQL.
>>>>>>> 
>>>>>>> My goal is to have Drill push selection and predicates down to my
>> data
>>>>>>> source but the rest of the query processing should take place in
>>>> Drill.
>>>>>>> 
>>>>>>> I started out by writing a JDBC driver for the data source and
>>>>>>> registering
>>>>>>> that with Drill using the Jdbc Storage Plugin but it seems to just
>>>> pass
>>>>>>> the
>>>>>>> whole query through to my data source, so that approach isn't going
>> to
>>>>>>> work
>>>>>>> unless I'm missing something?
>>>>>>> 
>>>>>>> Is there any way to configure the JDBC storage plugin to only push
>>>>>>> certain
>>>>>>> parts of the query to the data source?
>>>>>>> 
>>>>>>> If this isn't a good approach, do I need to write a custom storage
>>>>>>> plugin?
>>>>>>> Can these be added on the classpath or would that require me
>>>> maintaining
>>>>>>> a
>>>>>>> fork of the project?
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I appreciate any pointers anyone can give me.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> 
>>>>>>> Andy.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>> 
>>> 
>>> 
>> 


Re: Looking for advice on integrating with a custom data source

Posted by Andy Grove <an...@gmail.com>.
And boom! With just 3 extra lines of code to adjust the CBO to make the row
count inversely proportional to the number of predicates, my little Poc
works :-)

Now that I've achieved the instant gratification (relatively speaking!) of
making something work, I think it's time to step back and start doing this
the right way with the PR you mentioned.

I would not have been able to get this working at all without all the
fantastic support!

Thanks,

Andy.



On Tue, Jan 14, 2020 at 11:43 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Andy,
>
> Congratulations on making such fast progress!
>
> The code to do filter pushdowns is rather complex and, it seems, most
> plugins copy/paste the same wad of code (with the same bugs). PR 1914
> provides a layer that converts the messy Drill logical plan into a nice,
> simple set of predicates. You can then pick and choose which to push down,
> allowing the framework to do the rest.
>
> Note that most of the plugins do push-down as part of physical planning.
> While this works in most case, it WILL NOT work if you are doing push-down
> in order to shard the scan. For example, in order to divide a time range up
> into pieces for a time series scan. The PR thus does push-down in the
> logical phase so that we can "do the right thing."
>
> When you say that getNewWithChildren() is for an earlier instance, it is
> very likely because Calcite gave up on your filter-push-down version
> because there was no cost reduction.
>
>
> The Wiki page mentioned earlier explains all the copies a bit. Basically,
> Drill creates many copies of your GroupScan as it proceeds. First a "blank"
> one, then another with projected columns, then another full copy as Calcite
> explores planning options, and so on.
>
> One key trick is that if you implement filter push down, you MUST return a
> lower cost estimate after the push-down than before. Else, Calcite decides
> that it is not worth the hassle of doing the push-down if the costs remain
> the same. See the Wiki for details. this is what getScanStats() does:
> report stats that must get lower as you improve the scan.
>
> That is, one cost at the start, a lower cost after projection push down
> (reflecting the fact that we presumably now read less data per row) and a
> lower cost again after filter-push down (because we read fewer rows.) There
> is a "Dummy" storage plugin in PR 1914 that illustrates all of this.
>
> Don't worry about getDigest(), it is just Calcite trying to get a label to
> use for its internal objects. You will need to implement getString(), using
> Drill's "EXPLAIN PLAN" format, so your scan can appear in the text plan
> output. EXPLAIN PLAN output is:
>
> ClassName [field1=x, field2=y]
>
> There is a little builder in PR 1914 to do this for you.
>
> Thanks,
> - Paul
>
>
>
>     On Tuesday, January 14, 2020, 7:07:58 PM PST, Andy Grove <
> andygrove73@gmail.com> wrote:
>
>  With some extra debugging I can see that the getNewWithChildren call is
> made to an earlier instance of GroupScan and not the instance created by
> the filter push-down rule. I'm wondering if this is some kind of
> hashCode/equals/toString/getDigest issue?
>
> On Tue, Jan 14, 2020 at 7:52 PM Andy Grove <an...@gmail.com> wrote:
>
> > I'm now working on predicate push down ... I have a filter rule that is
> > correctly extracting the predicates that the backend database supports
> and
> > I am creating a new GroupScan containing these predicates, using the
> Kafka
> > plugin as a reference. I see the GroupScan constructor being called after
> > this, with the predicates populated So far so good ... but then I see
> calls
> > to getDigest, getScanStats, and getNewWithChildren, and then I see calls
> to
> > the GroupScan constructor with the predicates missing.
> >
> > Any pointers on what I might be missing? Is there more magic I need to
> > know?
> >
> > Thanks!
> >
> > On Sun, Jan 12, 2020 at 5:34 PM Paul Rogers <pa...@yahoo.com.invalid>
> > wrote:
> >
> >> Hi Andy,
> >>
> >> Congrats! You are making good progress. Yes, the BatchCreator is a bit
> of
> >> magic: Drill looks for a subclass that has your SubScan subclass as the
> >> second parameter. Looks like you figured that out.
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >>    On Sunday, January 12, 2020, 1:45:16 PM PST, Andy Grove <
> >> andygrove73@gmail.com> wrote:
> >>
> >>  Actually I managed to get past that error with an educated guess that
> if
> >> I
> >> created a BatchCreator class, it would automagically be picked up
> somehow.
> >> I'm now at the point where my RecordReader is being invoked!
> >>
> >> On Sun, Jan 12, 2020 at 2:03 PM Andy Grove <an...@gmail.com>
> wrote:
> >>
> >> > Between reading the tutorial and copying and pasting code from the
> Kudu
> >> > storage plugin, I've been making reasonable progress with this but am
> I
> >> but
> >> > confused by one error I'm now hitting.
> >> > ExecutionSetupException: Failure finding OperatorCreator constructor
> for
> >> > config com.mydb.MyDbSubScan
> >> > Prior to this, Drill had called getSpecificScan and then called a few
> of
> >> > the methods on my subscan object. I wasn't sure what to return for
> >> > getOperatorType so just returned the kudu subscan operator type and
> I'm
> >> > wondering if the issue is related to that somehow?
> >> >
> >> > Thanks.
> >> >
> >> >
> >> > On Sat, Jan 11, 2020 at 10:13 PM Andy Grove <an...@gmail.com>
> >> wrote:
> >> >
> >> >> Thank you both for the those responses. This is very helpful. I have
> >> >> ordered a copy of the book too. I'm using Drill 1.17.0.
> >> >>
> >> >> I'll take a look at the Jdbc Storage Plugin code and see if it would
> be
> >> >> feasible to add the logic I need there. In parallel, I've started
> >> >> implementing a new storage plugin. I'll be working on this more
> >> tomorrow
> >> >> and I'm sure I'll be back with more questions soon.
> >> >>
> >> >> Thanks again for your help!
> >> >>
> >> >> Andy.
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >>
> >> >> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre <cg...@gmail.com>
> >> wrote:
> >> >>
> >> >>> HI Andy,
> >> >>> Thanks for your interest in Drill.  I'm glad to see that Paul wrote
> >> you
> >> >>> back as well.  I was going to say I thought the JDBC storage plugin
> >> did in
> >> >>> fact push down columns and filters to the source system.
> >> >>>
> >> >>> Also, what version of Drill are you using?
> >> >>>
> >> >>> Writing a storage plugin for Drill is not trivial and I'd definitely
> >> >>> recommend using the code from Paul's PR as that greatly simplifies
> >> things.
> >> >>> Here is a tutorial as well:
> >> >>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
> >> >>>
> >> >>> If you need additional help, please let us know.
> >> >>> -- C
> >> >>>
> >> >>>
> >> >>> On Jan 11, 2020, at 5:57 PM, Andy Grove <an...@gmail.com>
> >> wrote:
> >> >>>
> >> >>> Hi,
> >> >>>
> >> >>> I'd like to use Apache Drill with a custom data source that
> supports a
> >> >>> subset of SQL.
> >> >>>
> >> >>> My goal is to have Drill push selection and predicates down to my
> data
> >> >>> source but the rest of the query processing should take place in
> >> Drill.
> >> >>>
> >> >>> I started out by writing a JDBC driver for the data source and
> >> >>> registering
> >> >>> that with Drill using the Jdbc Storage Plugin but it seems to just
> >> pass
> >> >>> the
> >> >>> whole query through to my data source, so that approach isn't going
> to
> >> >>> work
> >> >>> unless I'm missing something?
> >> >>>
> >> >>> Is there any way to configure the JDBC storage plugin to only push
> >> >>> certain
> >> >>> parts of the query to the data source?
> >> >>>
> >> >>> If this isn't a good approach, do I need to write a custom storage
> >> >>> plugin?
> >> >>> Can these be added on the classpath or would that require me
> >> maintaining
> >> >>> a
> >> >>> fork of the project?
> >> >>>
> >> >>>
> >> >>>
> >> >>> I appreciate any pointers anyone can give me.
> >> >>>
> >> >>> Thanks,
> >> >>>
> >> >>> Andy.
> >> >>>
> >> >>>
> >> >>>
> >>
> >
> >
>

Re: Looking for advice on integrating with a custom data source

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Andy,

Congratulations on making such fast progress!

The code to do filter pushdowns is rather complex and, it seems, most plugins copy/paste the same wad of code (with the same bugs). PR 1914 provides a layer that converts the messy Drill logical plan into a nice, simple set of predicates. You can then pick and choose which to push down, allowing the framework to do the rest.

Note that most of the plugins do push-down as part of physical planning. While this works in most case, it WILL NOT work if you are doing push-down in order to shard the scan. For example, in order to divide a time range up into pieces for a time series scan. The PR thus does push-down in the logical phase so that we can "do the right thing."

When you say that getNewWithChildren() is for an earlier instance, it is very likely because Calcite gave up on your filter-push-down version because there was no cost reduction.


The Wiki page mentioned earlier explains all the copies a bit. Basically, Drill creates many copies of your GroupScan as it proceeds. First a "blank" one, then another with projected columns, then another full copy as Calcite explores planning options, and so on.

One key trick is that if you implement filter push down, you MUST return a lower cost estimate after the push-down than before. Else, Calcite decides that it is not worth the hassle of doing the push-down if the costs remain the same. See the Wiki for details. this is what getScanStats() does: report stats that must get lower as you improve the scan.

That is, one cost at the start, a lower cost after projection push down (reflecting the fact that we presumably now read less data per row) and a lower cost again after filter-push down (because we read fewer rows.) There is a "Dummy" storage plugin in PR 1914 that illustrates all of this.

Don't worry about getDigest(), it is just Calcite trying to get a label to use for its internal objects. You will need to implement getString(), using Drill's "EXPLAIN PLAN" format, so your scan can appear in the text plan output. EXPLAIN PLAN output is:

ClassName [field1=x, field2=y]

There is a little builder in PR 1914 to do this for you.

Thanks,
- Paul

 

    On Tuesday, January 14, 2020, 7:07:58 PM PST, Andy Grove <an...@gmail.com> wrote:  
 
 With some extra debugging I can see that the getNewWithChildren call is
made to an earlier instance of GroupScan and not the instance created by
the filter push-down rule. I'm wondering if this is some kind of
hashCode/equals/toString/getDigest issue?

On Tue, Jan 14, 2020 at 7:52 PM Andy Grove <an...@gmail.com> wrote:

> I'm now working on predicate push down ... I have a filter rule that is
> correctly extracting the predicates that the backend database supports and
> I am creating a new GroupScan containing these predicates, using the Kafka
> plugin as a reference. I see the GroupScan constructor being called after
> this, with the predicates populated So far so good ... but then I see calls
> to getDigest, getScanStats, and getNewWithChildren, and then I see calls to
> the GroupScan constructor with the predicates missing.
>
> Any pointers on what I might be missing? Is there more magic I need to
> know?
>
> Thanks!
>
> On Sun, Jan 12, 2020 at 5:34 PM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
>> Hi Andy,
>>
>> Congrats! You are making good progress. Yes, the BatchCreator is a bit of
>> magic: Drill looks for a subclass that has your SubScan subclass as the
>> second parameter. Looks like you figured that out.
>>
>> Thanks,
>> - Paul
>>
>>
>>
>>    On Sunday, January 12, 2020, 1:45:16 PM PST, Andy Grove <
>> andygrove73@gmail.com> wrote:
>>
>>  Actually I managed to get past that error with an educated guess that if
>> I
>> created a BatchCreator class, it would automagically be picked up somehow.
>> I'm now at the point where my RecordReader is being invoked!
>>
>> On Sun, Jan 12, 2020 at 2:03 PM Andy Grove <an...@gmail.com> wrote:
>>
>> > Between reading the tutorial and copying and pasting code from the Kudu
>> > storage plugin, I've been making reasonable progress with this but am I
>> but
>> > confused by one error I'm now hitting.
>> > ExecutionSetupException: Failure finding OperatorCreator constructor for
>> > config com.mydb.MyDbSubScan
>> > Prior to this, Drill had called getSpecificScan and then called a few of
>> > the methods on my subscan object. I wasn't sure what to return for
>> > getOperatorType so just returned the kudu subscan operator type and I'm
>> > wondering if the issue is related to that somehow?
>> >
>> > Thanks.
>> >
>> >
>> > On Sat, Jan 11, 2020 at 10:13 PM Andy Grove <an...@gmail.com>
>> wrote:
>> >
>> >> Thank you both for the those responses. This is very helpful. I have
>> >> ordered a copy of the book too. I'm using Drill 1.17.0.
>> >>
>> >> I'll take a look at the Jdbc Storage Plugin code and see if it would be
>> >> feasible to add the logic I need there. In parallel, I've started
>> >> implementing a new storage plugin. I'll be working on this more
>> tomorrow
>> >> and I'm sure I'll be back with more questions soon.
>> >>
>> >> Thanks again for your help!
>> >>
>> >> Andy.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre <cg...@gmail.com>
>> wrote:
>> >>
>> >>> HI Andy,
>> >>> Thanks for your interest in Drill.  I'm glad to see that Paul wrote
>> you
>> >>> back as well.  I was going to say I thought the JDBC storage plugin
>> did in
>> >>> fact push down columns and filters to the source system.
>> >>>
>> >>> Also, what version of Drill are you using?
>> >>>
>> >>> Writing a storage plugin for Drill is not trivial and I'd definitely
>> >>> recommend using the code from Paul's PR as that greatly simplifies
>> things.
>> >>> Here is a tutorial as well:
>> >>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
>> >>>
>> >>> If you need additional help, please let us know.
>> >>> -- C
>> >>>
>> >>>
>> >>> On Jan 11, 2020, at 5:57 PM, Andy Grove <an...@gmail.com>
>> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I'd like to use Apache Drill with a custom data source that supports a
>> >>> subset of SQL.
>> >>>
>> >>> My goal is to have Drill push selection and predicates down to my data
>> >>> source but the rest of the query processing should take place in
>> Drill.
>> >>>
>> >>> I started out by writing a JDBC driver for the data source and
>> >>> registering
>> >>> that with Drill using the Jdbc Storage Plugin but it seems to just
>> pass
>> >>> the
>> >>> whole query through to my data source, so that approach isn't going to
>> >>> work
>> >>> unless I'm missing something?
>> >>>
>> >>> Is there any way to configure the JDBC storage plugin to only push
>> >>> certain
>> >>> parts of the query to the data source?
>> >>>
>> >>> If this isn't a good approach, do I need to write a custom storage
>> >>> plugin?
>> >>> Can these be added on the classpath or would that require me
>> maintaining
>> >>> a
>> >>> fork of the project?
>> >>>
>> >>>
>> >>>
>> >>> I appreciate any pointers anyone can give me.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Andy.
>> >>>
>> >>>
>> >>>
>>
>
>
  

Re: Looking for advice on integrating with a custom data source

Posted by Andy Grove <an...@gmail.com>.
With some extra debugging I can see that the getNewWithChildren call is
made to an earlier instance of GroupScan and not the instance created by
the filter push-down rule. I'm wondering if this is some kind of
hashCode/equals/toString/getDigest issue?

On Tue, Jan 14, 2020 at 7:52 PM Andy Grove <an...@gmail.com> wrote:

> I'm now working on predicate push down ... I have a filter rule that is
> correctly extracting the predicates that the backend database supports and
> I am creating a new GroupScan containing these predicates, using the Kafka
> plugin as a reference. I see the GroupScan constructor being called after
> this, with the predicates populated So far so good ... but then I see calls
> to getDigest, getScanStats, and getNewWithChildren, and then I see calls to
> the GroupScan constructor with the predicates missing.
>
> Any pointers on what I might be missing? Is there more magic I need to
> know?
>
> Thanks!
>
> On Sun, Jan 12, 2020 at 5:34 PM Paul Rogers <pa...@yahoo.com.invalid>
> wrote:
>
>> Hi Andy,
>>
>> Congrats! You are making good progress. Yes, the BatchCreator is a bit of
>> magic: Drill looks for a subclass that has your SubScan subclass as the
>> second parameter. Looks like you figured that out.
>>
>> Thanks,
>> - Paul
>>
>>
>>
>>     On Sunday, January 12, 2020, 1:45:16 PM PST, Andy Grove <
>> andygrove73@gmail.com> wrote:
>>
>>  Actually I managed to get past that error with an educated guess that if
>> I
>> created a BatchCreator class, it would automagically be picked up somehow.
>> I'm now at the point where my RecordReader is being invoked!
>>
>> On Sun, Jan 12, 2020 at 2:03 PM Andy Grove <an...@gmail.com> wrote:
>>
>> > Between reading the tutorial and copying and pasting code from the Kudu
>> > storage plugin, I've been making reasonable progress with this but am I
>> but
>> > confused by one error I'm now hitting.
>> > ExecutionSetupException: Failure finding OperatorCreator constructor for
>> > config com.mydb.MyDbSubScan
>> > Prior to this, Drill had called getSpecificScan and then called a few of
>> > the methods on my subscan object. I wasn't sure what to return for
>> > getOperatorType so just returned the kudu subscan operator type and I'm
>> > wondering if the issue is related to that somehow?
>> >
>> > Thanks.
>> >
>> >
>> > On Sat, Jan 11, 2020 at 10:13 PM Andy Grove <an...@gmail.com>
>> wrote:
>> >
>> >> Thank you both for the those responses. This is very helpful. I have
>> >> ordered a copy of the book too. I'm using Drill 1.17.0.
>> >>
>> >> I'll take a look at the Jdbc Storage Plugin code and see if it would be
>> >> feasible to add the logic I need there. In parallel, I've started
>> >> implementing a new storage plugin. I'll be working on this more
>> tomorrow
>> >> and I'm sure I'll be back with more questions soon.
>> >>
>> >> Thanks again for your help!
>> >>
>> >> Andy.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre <cg...@gmail.com>
>> wrote:
>> >>
>> >>> HI Andy,
>> >>> Thanks for your interest in Drill.  I'm glad to see that Paul wrote
>> you
>> >>> back as well.  I was going to say I thought the JDBC storage plugin
>> did in
>> >>> fact push down columns and filters to the source system.
>> >>>
>> >>> Also, what version of Drill are you using?
>> >>>
>> >>> Writing a storage plugin for Drill is not trivial and I'd definitely
>> >>> recommend using the code from Paul's PR as that greatly simplifies
>> things.
>> >>> Here is a tutorial as well:
>> >>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
>> >>>
>> >>> If you need additional help, please let us know.
>> >>> -- C
>> >>>
>> >>>
>> >>> On Jan 11, 2020, at 5:57 PM, Andy Grove <an...@gmail.com>
>> wrote:
>> >>>
>> >>> Hi,
>> >>>
>> >>> I'd like to use Apache Drill with a custom data source that supports a
>> >>> subset of SQL.
>> >>>
>> >>> My goal is to have Drill push selection and predicates down to my data
>> >>> source but the rest of the query processing should take place in
>> Drill.
>> >>>
>> >>> I started out by writing a JDBC driver for the data source and
>> >>> registering
>> >>> that with Drill using the Jdbc Storage Plugin but it seems to just
>> pass
>> >>> the
>> >>> whole query through to my data source, so that approach isn't going to
>> >>> work
>> >>> unless I'm missing something?
>> >>>
>> >>> Is there any way to configure the JDBC storage plugin to only push
>> >>> certain
>> >>> parts of the query to the data source?
>> >>>
>> >>> If this isn't a good approach, do I need to write a custom storage
>> >>> plugin?
>> >>> Can these be added on the classpath or would that require me
>> maintaining
>> >>> a
>> >>> fork of the project?
>> >>>
>> >>>
>> >>>
>> >>> I appreciate any pointers anyone can give me.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Andy.
>> >>>
>> >>>
>> >>>
>>
>
>

Re: Looking for advice on integrating with a custom data source

Posted by Andy Grove <an...@gmail.com>.
I'm now working on predicate push down ... I have a filter rule that is
correctly extracting the predicates that the backend database supports and
I am creating a new GroupScan containing these predicates, using the Kafka
plugin as a reference. I see the GroupScan constructor being called after
this, with the predicates populated So far so good ... but then I see calls
to getDigest, getScanStats, and getNewWithChildren, and then I see calls to
the GroupScan constructor with the predicates missing.

Any pointers on what I might be missing? Is there more magic I need to know?

Thanks!

On Sun, Jan 12, 2020 at 5:34 PM Paul Rogers <pa...@yahoo.com.invalid>
wrote:

> Hi Andy,
>
> Congrats! You are making good progress. Yes, the BatchCreator is a bit of
> magic: Drill looks for a subclass that has your SubScan subclass as the
> second parameter. Looks like you figured that out.
>
> Thanks,
> - Paul
>
>
>
>     On Sunday, January 12, 2020, 1:45:16 PM PST, Andy Grove <
> andygrove73@gmail.com> wrote:
>
>  Actually I managed to get past that error with an educated guess that if I
> created a BatchCreator class, it would automagically be picked up somehow.
> I'm now at the point where my RecordReader is being invoked!
>
> On Sun, Jan 12, 2020 at 2:03 PM Andy Grove <an...@gmail.com> wrote:
>
> > Between reading the tutorial and copying and pasting code from the Kudu
> > storage plugin, I've been making reasonable progress with this but am I
> but
> > confused by one error I'm now hitting.
> > ExecutionSetupException: Failure finding OperatorCreator constructor for
> > config com.mydb.MyDbSubScan
> > Prior to this, Drill had called getSpecificScan and then called a few of
> > the methods on my subscan object. I wasn't sure what to return for
> > getOperatorType so just returned the kudu subscan operator type and I'm
> > wondering if the issue is related to that somehow?
> >
> > Thanks.
> >
> >
> > On Sat, Jan 11, 2020 at 10:13 PM Andy Grove <an...@gmail.com>
> wrote:
> >
> >> Thank you both for the those responses. This is very helpful. I have
> >> ordered a copy of the book too. I'm using Drill 1.17.0.
> >>
> >> I'll take a look at the Jdbc Storage Plugin code and see if it would be
> >> feasible to add the logic I need there. In parallel, I've started
> >> implementing a new storage plugin. I'll be working on this more tomorrow
> >> and I'm sure I'll be back with more questions soon.
> >>
> >> Thanks again for your help!
> >>
> >> Andy.
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre <cg...@gmail.com> wrote:
> >>
> >>> HI Andy,
> >>> Thanks for your interest in Drill.  I'm glad to see that Paul wrote you
> >>> back as well.  I was going to say I thought the JDBC storage plugin
> did in
> >>> fact push down columns and filters to the source system.
> >>>
> >>> Also, what version of Drill are you using?
> >>>
> >>> Writing a storage plugin for Drill is not trivial and I'd definitely
> >>> recommend using the code from Paul's PR as that greatly simplifies
> things.
> >>> Here is a tutorial as well:
> >>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
> >>>
> >>> If you need additional help, please let us know.
> >>> -- C
> >>>
> >>>
> >>> On Jan 11, 2020, at 5:57 PM, Andy Grove <an...@gmail.com> wrote:
> >>>
> >>> Hi,
> >>>
> >>> I'd like to use Apache Drill with a custom data source that supports a
> >>> subset of SQL.
> >>>
> >>> My goal is to have Drill push selection and predicates down to my data
> >>> source but the rest of the query processing should take place in Drill.
> >>>
> >>> I started out by writing a JDBC driver for the data source and
> >>> registering
> >>> that with Drill using the Jdbc Storage Plugin but it seems to just pass
> >>> the
> >>> whole query through to my data source, so that approach isn't going to
> >>> work
> >>> unless I'm missing something?
> >>>
> >>> Is there any way to configure the JDBC storage plugin to only push
> >>> certain
> >>> parts of the query to the data source?
> >>>
> >>> If this isn't a good approach, do I need to write a custom storage
> >>> plugin?
> >>> Can these be added on the classpath or would that require me
> maintaining
> >>> a
> >>> fork of the project?
> >>>
> >>>
> >>>
> >>> I appreciate any pointers anyone can give me.
> >>>
> >>> Thanks,
> >>>
> >>> Andy.
> >>>
> >>>
> >>>
>

Re: Looking for advice on integrating with a custom data source

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Andy,

Congrats! You are making good progress. Yes, the BatchCreator is a bit of magic: Drill looks for a subclass that has your SubScan subclass as the second parameter. Looks like you figured that out.

Thanks,
- Paul

 

    On Sunday, January 12, 2020, 1:45:16 PM PST, Andy Grove <an...@gmail.com> wrote:  
 
 Actually I managed to get past that error with an educated guess that if I
created a BatchCreator class, it would automagically be picked up somehow.
I'm now at the point where my RecordReader is being invoked!

On Sun, Jan 12, 2020 at 2:03 PM Andy Grove <an...@gmail.com> wrote:

> Between reading the tutorial and copying and pasting code from the Kudu
> storage plugin, I've been making reasonable progress with this but am I but
> confused by one error I'm now hitting.
> ExecutionSetupException: Failure finding OperatorCreator constructor for
> config com.mydb.MyDbSubScan
> Prior to this, Drill had called getSpecificScan and then called a few of
> the methods on my subscan object. I wasn't sure what to return for
> getOperatorType so just returned the kudu subscan operator type and I'm
> wondering if the issue is related to that somehow?
>
> Thanks.
>
>
> On Sat, Jan 11, 2020 at 10:13 PM Andy Grove <an...@gmail.com> wrote:
>
>> Thank you both for the those responses. This is very helpful. I have
>> ordered a copy of the book too. I'm using Drill 1.17.0.
>>
>> I'll take a look at the Jdbc Storage Plugin code and see if it would be
>> feasible to add the logic I need there. In parallel, I've started
>> implementing a new storage plugin. I'll be working on this more tomorrow
>> and I'm sure I'll be back with more questions soon.
>>
>> Thanks again for your help!
>>
>> Andy.
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre <cg...@gmail.com> wrote:
>>
>>> HI Andy,
>>> Thanks for your interest in Drill.  I'm glad to see that Paul wrote you
>>> back as well.  I was going to say I thought the JDBC storage plugin did in
>>> fact push down columns and filters to the source system.
>>>
>>> Also, what version of Drill are you using?
>>>
>>> Writing a storage plugin for Drill is not trivial and I'd definitely
>>> recommend using the code from Paul's PR as that greatly simplifies things.
>>> Here is a tutorial as well:
>>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
>>>
>>> If you need additional help, please let us know.
>>> -- C
>>>
>>>
>>> On Jan 11, 2020, at 5:57 PM, Andy Grove <an...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I'd like to use Apache Drill with a custom data source that supports a
>>> subset of SQL.
>>>
>>> My goal is to have Drill push selection and predicates down to my data
>>> source but the rest of the query processing should take place in Drill.
>>>
>>> I started out by writing a JDBC driver for the data source and
>>> registering
>>> that with Drill using the Jdbc Storage Plugin but it seems to just pass
>>> the
>>> whole query through to my data source, so that approach isn't going to
>>> work
>>> unless I'm missing something?
>>>
>>> Is there any way to configure the JDBC storage plugin to only push
>>> certain
>>> parts of the query to the data source?
>>>
>>> If this isn't a good approach, do I need to write a custom storage
>>> plugin?
>>> Can these be added on the classpath or would that require me maintaining
>>> a
>>> fork of the project?
>>>
>>>
>>>
>>> I appreciate any pointers anyone can give me.
>>>
>>> Thanks,
>>>
>>> Andy.
>>>
>>>
>>>
  

Re: Looking for advice on integrating with a custom data source

Posted by Andy Grove <an...@gmail.com>.
Actually I managed to get past that error with an educated guess that if I
created a BatchCreator class, it would automagically be picked up somehow.
I'm now at the point where my RecordReader is being invoked!

On Sun, Jan 12, 2020 at 2:03 PM Andy Grove <an...@gmail.com> wrote:

> Between reading the tutorial and copying and pasting code from the Kudu
> storage plugin, I've been making reasonable progress with this but am I but
> confused by one error I'm now hitting.
> ExecutionSetupException: Failure finding OperatorCreator constructor for
> config com.mydb.MyDbSubScan
> Prior to this, Drill had called getSpecificScan and then called a few of
> the methods on my subscan object. I wasn't sure what to return for
> getOperatorType so just returned the kudu subscan operator type and I'm
> wondering if the issue is related to that somehow?
>
> Thanks.
>
>
> On Sat, Jan 11, 2020 at 10:13 PM Andy Grove <an...@gmail.com> wrote:
>
>> Thank you both for the those responses. This is very helpful. I have
>> ordered a copy of the book too. I'm using Drill 1.17.0.
>>
>> I'll take a look at the Jdbc Storage Plugin code and see if it would be
>> feasible to add the logic I need there. In parallel, I've started
>> implementing a new storage plugin. I'll be working on this more tomorrow
>> and I'm sure I'll be back with more questions soon.
>>
>> Thanks again for your help!
>>
>> Andy.
>>
>>
>>
>>
>>
>>
>>
>> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre <cg...@gmail.com> wrote:
>>
>>> HI Andy,
>>> Thanks for your interest in Drill.  I'm glad to see that Paul wrote you
>>> back as well.  I was going to say I thought the JDBC storage plugin did in
>>> fact push down columns and filters to the source system.
>>>
>>> Also, what version of Drill are you using?
>>>
>>> Writing a storage plugin for Drill is not trivial and I'd definitely
>>> recommend using the code from Paul's PR as that greatly simplifies things.
>>> Here is a tutorial as well:
>>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
>>>
>>> If you need additional help, please let us know.
>>> -- C
>>>
>>>
>>> On Jan 11, 2020, at 5:57 PM, Andy Grove <an...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> I'd like to use Apache Drill with a custom data source that supports a
>>> subset of SQL.
>>>
>>> My goal is to have Drill push selection and predicates down to my data
>>> source but the rest of the query processing should take place in Drill.
>>>
>>> I started out by writing a JDBC driver for the data source and
>>> registering
>>> that with Drill using the Jdbc Storage Plugin but it seems to just pass
>>> the
>>> whole query through to my data source, so that approach isn't going to
>>> work
>>> unless I'm missing something?
>>>
>>> Is there any way to configure the JDBC storage plugin to only push
>>> certain
>>> parts of the query to the data source?
>>>
>>> If this isn't a good approach, do I need to write a custom storage
>>> plugin?
>>> Can these be added on the classpath or would that require me maintaining
>>> a
>>> fork of the project?
>>>
>>>
>>>
>>> I appreciate any pointers anyone can give me.
>>>
>>> Thanks,
>>>
>>> Andy.
>>>
>>>
>>>

Re: Looking for advice on integrating with a custom data source

Posted by Andy Grove <an...@gmail.com>.
Between reading the tutorial and copying and pasting code from the Kudu
storage plugin, I've been making reasonable progress with this but am I but
confused by one error I'm now hitting.
ExecutionSetupException: Failure finding OperatorCreator constructor for
config com.mydb.MyDbSubScan
Prior to this, Drill had called getSpecificScan and then called a few of
the methods on my subscan object. I wasn't sure what to return for
getOperatorType so just returned the kudu subscan operator type and I'm
wondering if the issue is related to that somehow?

Thanks.


On Sat, Jan 11, 2020 at 10:13 PM Andy Grove <an...@gmail.com> wrote:

> Thank you both for the those responses. This is very helpful. I have
> ordered a copy of the book too. I'm using Drill 1.17.0.
>
> I'll take a look at the Jdbc Storage Plugin code and see if it would be
> feasible to add the logic I need there. In parallel, I've started
> implementing a new storage plugin. I'll be working on this more tomorrow
> and I'm sure I'll be back with more questions soon.
>
> Thanks again for your help!
>
> Andy.
>
>
>
>
>
>
>
> On Sat, Jan 11, 2020 at 6:03 PM Charles Givre <cg...@gmail.com> wrote:
>
>> HI Andy,
>> Thanks for your interest in Drill.  I'm glad to see that Paul wrote you
>> back as well.  I was going to say I thought the JDBC storage plugin did in
>> fact push down columns and filters to the source system.
>>
>> Also, what version of Drill are you using?
>>
>> Writing a storage plugin for Drill is not trivial and I'd definitely
>> recommend using the code from Paul's PR as that greatly simplifies things.
>> Here is a tutorial as well:
>> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
>>
>> If you need additional help, please let us know.
>> -- C
>>
>>
>> On Jan 11, 2020, at 5:57 PM, Andy Grove <an...@gmail.com> wrote:
>>
>> Hi,
>>
>> I'd like to use Apache Drill with a custom data source that supports a
>> subset of SQL.
>>
>> My goal is to have Drill push selection and predicates down to my data
>> source but the rest of the query processing should take place in Drill.
>>
>> I started out by writing a JDBC driver for the data source and registering
>> that with Drill using the Jdbc Storage Plugin but it seems to just pass
>> the
>> whole query through to my data source, so that approach isn't going to
>> work
>> unless I'm missing something?
>>
>> Is there any way to configure the JDBC storage plugin to only push certain
>> parts of the query to the data source?
>>
>> If this isn't a good approach, do I need to write a custom storage plugin?
>> Can these be added on the classpath or would that require me maintaining a
>> fork of the project?
>>
>>
>>
>> I appreciate any pointers anyone can give me.
>>
>> Thanks,
>>
>> Andy.
>>
>>
>>

Re: Looking for advice on integrating with a custom data source

Posted by Andy Grove <an...@gmail.com>.
Thank you both for the those responses. This is very helpful. I have
ordered a copy of the book too. I'm using Drill 1.17.0.

I'll take a look at the Jdbc Storage Plugin code and see if it would be
feasible to add the logic I need there. In parallel, I've started
implementing a new storage plugin. I'll be working on this more tomorrow
and I'm sure I'll be back with more questions soon.

Thanks again for your help!

Andy.







On Sat, Jan 11, 2020 at 6:03 PM Charles Givre <cg...@gmail.com> wrote:

> HI Andy,
> Thanks for your interest in Drill.  I'm glad to see that Paul wrote you
> back as well.  I was going to say I thought the JDBC storage plugin did in
> fact push down columns and filters to the source system.
>
> Also, what version of Drill are you using?
>
> Writing a storage plugin for Drill is not trivial and I'd definitely
> recommend using the code from Paul's PR as that greatly simplifies things.
> Here is a tutorial as well:
> https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin
>
> If you need additional help, please let us know.
> -- C
>
>
> On Jan 11, 2020, at 5:57 PM, Andy Grove <an...@gmail.com> wrote:
>
> Hi,
>
> I'd like to use Apache Drill with a custom data source that supports a
> subset of SQL.
>
> My goal is to have Drill push selection and predicates down to my data
> source but the rest of the query processing should take place in Drill.
>
> I started out by writing a JDBC driver for the data source and registering
> that with Drill using the Jdbc Storage Plugin but it seems to just pass the
> whole query through to my data source, so that approach isn't going to work
> unless I'm missing something?
>
> Is there any way to configure the JDBC storage plugin to only push certain
> parts of the query to the data source?
>
> If this isn't a good approach, do I need to write a custom storage plugin?
> Can these be added on the classpath or would that require me maintaining a
> fork of the project?
>
>
>
> I appreciate any pointers anyone can give me.
>
> Thanks,
>
> Andy.
>
>
>

Re: Looking for advice on integrating with a custom data source

Posted by Charles Givre <cg...@gmail.com>.
HI Andy, 
Thanks for your interest in Drill.  I'm glad to see that Paul wrote you back as well.  I was going to say I thought the JDBC storage plugin did in fact push down columns and filters to the source system. 

Also, what version of Drill are you using?

Writing a storage plugin for Drill is not trivial and I'd definitely recommend using the code from Paul's PR as that greatly simplifies things.  Here is a tutorial as well: https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin <https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin>

If you need additional help, please let us know. 
-- C


> On Jan 11, 2020, at 5:57 PM, Andy Grove <an...@gmail.com> wrote:
> 
> Hi,
> 
> I'd like to use Apache Drill with a custom data source that supports a
> subset of SQL.
> 
> My goal is to have Drill push selection and predicates down to my data
> source but the rest of the query processing should take place in Drill.
> 
> I started out by writing a JDBC driver for the data source and registering
> that with Drill using the Jdbc Storage Plugin but it seems to just pass the
> whole query through to my data source, so that approach isn't going to work
> unless I'm missing something?
> 
> Is there any way to configure the JDBC storage plugin to only push certain
> parts of the query to the data source?
> 
> If this isn't a good approach, do I need to write a custom storage plugin?
> Can these be added on the classpath or would that require me maintaining a
> fork of the project?

> 
> I appreciate any pointers anyone can give me.
> 
> Thanks,
> 
> Andy.


Re: Looking for advice on integrating with a custom data source

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Andy,

There are likely multiple approaches; here are two. Some bit of code has to decide what can be pushed to your data source and what must remain in Drill. At present, there is no declarative way to say, "OK to push such-and-so expression, but keep this-and-that."

Instead, the current approach is for your plugin to tie into Drill's Calcite-based query planner. You define Calcite rules that fire to perform the push operations you want to support. The code in this area is somewhat obscure, but multiple examples exist in the Kafka and other plugins.

Also, at present, storage "plugins" are not really plugins at compile time: they pretty much need to be built within the Drill source tree. This is especially true to run unit tests. (We'd like to improve this area of the project; suggestions welcome.) Generally, folks put their plugin in the "contrib" directory within Drill. Yes, you must maintain your own branch. However, as long as you do not modify Drill code (you shouldn't need to), it is not too hard to simply occasionally rebase your branch on top of a new Drill release.

At runtime, however, plugins are true plugins: you can take the plugin jar you create using the above process and drop it into an "official" release directory. We talk a bit about this process in the book Learning Apache Drill from O'Reilly.


We recently tried to clean up the plugin structure just a bit in PR 1914 (DRILL-7458) [1]. The PR provides just a few baby steps and suggestions are encouraged. The key new feature in this PR is an standardized way to handle filter push-downs to avoid the large amount of copy-and-paste previously required.


The PR is the result of a recent project to create a storage plugin that included filter push-down. Notes on that process are in [2].

You mentioned that your data source is similar to JDBC. So, another approach is to modify the existing storage plugin to provide storage plugin config options to control what gets pushed down (assuming that the decision is simple enough to express as a few options.) In this case, you could offer your changes as a PR which the Drill project would maintain as part of the source base, saving you from creating your own fork.

Thanks,
- Paul


[1] https://github.com/apache/drill/pull/1914
 
[2] https://github.com/paul-rogers/drill/wiki/Create-a-Storage-Plugin



    On Saturday, January 11, 2020, 2:58:08 PM PST, Andy Grove <an...@gmail.com> wrote:  
 
 Hi,

I'd like to use Apache Drill with a custom data source that supports a
subset of SQL.

My goal is to have Drill push selection and predicates down to my data
source but the rest of the query processing should take place in Drill.

I started out by writing a JDBC driver for the data source and registering
that with Drill using the Jdbc Storage Plugin but it seems to just pass the
whole query through to my data source, so that approach isn't going to work
unless I'm missing something?

Is there any way to configure the JDBC storage plugin to only push certain
parts of the query to the data source?

If this isn't a good approach, do I need to write a custom storage plugin?
Can these be added on the classpath or would that require me maintaining a
fork of the project?

I appreciate any pointers anyone can give me.

Thanks,

Andy.