You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Sarnath K <st...@gmail.com> on 2019/04/09 15:59:30 UTC

Drill for Data Virtualization

Hi,

I have a requirement where I need to split data between a fast RDBMS system
(A) that will have HOT data and a slower cold storage (B)

Both A and B provide JDBC drivers

I am looking to see if Drill will help me in coming with a JDBC URL (C)
which will hide the fact that data is split between A and B. i.e. Can Drill
be used to implement Data Virtualization?

As much as I can read about Drill, I can definitely create 2 tables in
Drill one pointing to A and another to B.
However when I do GROUP BY queries or FILTER queries -- Does Drill take
advantage of the existing JDBC systems by actually sending a part of the
GROUP BY to A and another to B and then reduce the result again? i.e. Some
kind of smart predicate push-down for Analytical queries?

Hope I sound clear to you. Appreciate your response much.

Thank you,

Best,
Sarnath

Re: Drill for Data Virtualization

Posted by Kunal Khatua <ku...@apache.org>.

I was expecting that #1 would not do any push down at all, because it involves Calcite to get smart about doing query re-writes and pushing down aggregation as part of separate queries into the JDBC sources A & B, before reapplying the aggregation again within the UNIONized data.

If #2 partially works, I'd recommend you pick Case 1 as your candidate query (assuming that pushing down GROUP BY on MySQL is more beneficial than pushing it only on SQLLite).
Your snippet is truncated, but my suspicion is that it might have to do with the row-count estimate for Source B being considered much greater than Source A; causing Calcite to recommend pushing down the aggregation to only one of the 2 sources.

Unfortunately, there is no flag to push it down to both and for something like this, one would need to dig into Calcite to understand why it didn't apply the push-down to both sides before a code change can be suggested.

Glad you've liked the WebUI. Apache Drill 1.16 will get further improvements in the WebUI, though nothing that will specifically address your current problem.

~ Kunal  
On 4/17/2019 7:20:38 AM, Sarnath K <st...@gmail.com> wrote:
Hi Kunal,

Few experiments that we did and the results:
1)
UNION from A and B and then have a Single GROUP BY
select GK, count(*) from ((select * from A) UNION ALL (select * from B))
GROUP BY GK
This one did the HashAggregate only in Drill. There was absolutely NO
PUSHDOWN to JDBC sources.

2)
GROUP BY from A UNION GROUP BY from B and then GROUP AGAIN
select GK, count(*) from ((select GK, count(*) as cnt from A GROUP BY GK)
UNION ALL (select GK, count(*) as cnt from B GROUP BY GK)) GROUP BY GK
For this one, the Predicate Push Down is happening ONLY for 'A' (regardless
of which data source it is).
There is no Predicate Push Down for 'B' (no matter which Data source it is
in)
Case1) We tested with A as MySQL and B as SQLite
Case2) We tested with A as SQLite and B as MySQL
In the first case the Predicate Pushdown was for 'A' alone i.e. the first
table which is Grouped.

Here is a physical plan of such an execution:
00-00 Screen : rowType = RecordType(VARCHAR(65536) empname, BIGINT cnt):
rowcount = 2.0, cumula
00-01 Project(empname=[$0], cnt=[$1]) : rowType =
RecordType(VARCHAR(65536) empname, BIGINT c
00-02 HashAgg(group=[{0}], cnt=[COUNT()]) : rowType =
RecordType(VARCHAR(65536) empname, BI
00-03 UnionAll(all=[true]) : rowType = RecordType(VARCHAR(65536)
empname): rowcount = 20.
00-05 Jdbc(sql=[SELECT `EmpName` AS `empname` FROM
`test`.`emp` GROUP BY `EmpName` ]) : rowType = RecordType(VARCHAR(255)
empname): rowcount = 10.0, cumulative c
00-04 Project(empname=[$0]) : rowType =
RecordType(VARCHAR(65536) empname): rowcount =
00-06 HashAgg(group=[{0}], cnt=[COUNT()]) : rowType =
RecordType(VARCHAR(65536) empna
00-07 Project(empname=[$1]) : rowType =
RecordType(VARCHAR(65536) empname): rowcoun
00-08 Jdbc(sql=[SELECT * FROM "Emp" ]) : rowType =
RecordType(INTEGER EmpId, VARCHAR(65536) EmpName, VARCHAR(65536) Phone,

One can see that the JDBC at 00-05 has GROUP BY pushed down
Whereas the JDBC at 00-08 does not have the GROUP BY pushed down. It is
rather handled after Union at 00-02

Is there any option that we could enable to make them GROUP BY at both
places?

btw... I am very impressed with Drill. It has a very simple and elegant
interface. A place where we can see all Queries that are getting executed,
see logs from web-app itself and beautifully vertically rendered Plans that
are so easily understandable.

This is exactly how intuitive and easy a user-interface has to be. Well
done, Drill Team!
Of course, we are impressed with reading multiple sources and joining them.
Looks like Magic!
Great job!

Thanks,
Best,
Sarnath





On Sat, Apr 13, 2019 at 2:59 PM Sarnath K wrote:

> Hi Kunal,
> I tried examining the plan for a simple group by. I see that the group by
> is pushed to JDBC step whose output goes to the project ... Which seems
> like pushdown is working fine ...
> We are trying other cases. I will keep posted.
> Thank you for your help and time.
> I understand Calcite is a great community effort...I have been following
> it for quite some time. Thanks!!
> Best,
> Sarnath
>
> On Fri, Apr 12, 2019, 02:28 Kunal Khatua wrote:
>
>> On 4/11/2019 12:39:24 PM, Sarnath K wrote:
>> Thank you Kunal.
>>
>> >>>You could try creating views for each source and then doing a group by
>> on the union of those views... that *might* get you the results you want
>>
>> When you mention views, do you mean to say each view will be a group by
>> statement for that particular source....And we try to union them and group
>> again...This way, explicitly making up the Query do the pushdown....
>> That's
>> the idea you are referring to. Right!??
>> Kunal Khatua: That is correct. Worth a try. Start with querying the views
>> individually to see if the pushdown occurs in the first place.
>>
>>
>>
>> Btw....Calcite (possibly) not recognizing the pushdown opportunity would
>> be
>> a let down ... especially for flexible frameworks like Drill... In my
>> opinion...
>> Kunal Khatua: Yes, but again... Calcite is an independent open-source
>> project in use by many other OSS and commercial vendors. Considering many
>> such projects are driven by volunteer contributions, it's a miracle in my
>> opinion that the open-source software is able to achieve so much (and,
>> sometimes, putting commercial offerings to shame) without charging a penny
>> to the end users.
>>
>> Developers behind Drill have made contributions to Calcite in their
>> limited capacity, as have developers from other projects.. so, in many
>> ways, Drill has actually benefited from Calcite in more ways than it could
>> have by implementing its own Calcite substitute. Hopefully, someone in the
>> community can take a look at enhancing this feature as well.
>>
>> Thanks for your time. Appreciate much. I will keep posted.
>>
>> Best,
>> Satnath
>>
>> On Thu, Apr 11, 2019, 23:25 Kunal Khatua wrote:
>>
>> > Hi Sarnath
>> >
>> > I haven't tried your specific requirement, and it is possible that if
>> you
>> > are querying only A or only B, Drill would be able to push it down to
>> the
>> > source.
>> >
>> > However, it gets tricky when you are querying 2 or more sources in the
>> > same query, because (from my limited knowledge of Calcite) the Calcite
>> > parser needs to be aware that it can push filters down to both sources.
>> > With GROUP BY, multiple groupings across a single source versus across
>> > multiple sources are not semantically the same.
>> >
>> > You could try creating views for each source and then doing a group by
>> on
>> > the union of those views... that *might* get you the results you want.
>> >
>> > You can give it a shot, but I suspect it won't be as performant. Let us
>> > know if you find it otherwise.
>> >
>> > ~ Kunal
>> >
>> > On 4/10/2019 9:02:24 PM, Sarnath K wrote:
>> > Hi Kunal,
>> >
>> > Thank you for your response. But what I read in this URL says it can be
>> > done (though my own interpretation is muddled)
>> > https://drill.apache.org/docs/rdbms-storage-plugin/
>> >
>> > There is a statement in the documentation that says:
>> >
>> > As with any source, Drill supports joins within and between all systems.
>> > Drill additionally has powerful pushdown capabilities with RDBMS
>> sources.
>> > This includes support to push down join, where, group by, intersect and
>> > other SQL operations into a particular RDBMS source (as appropriate).
>> >
>> >
>> > >> That said, even if the feature existed, by design, only one fragment
>> can
>> > read from a JDBC storage plugin, as it uses a single connection to
>> stream
>> > out the resultset.
>> >
>> > I did not understand this. Say, I GROUP BY a particular column and
>> perform
>> > "max", "min" and "sum" aggregation. These are all associative group
>> summary
>> > operations. So, I have send MAX Query to A and then MAX query to B. Get
>> the
>> > results from both into Drill cluster and then perform a MAX on the
>> > partially reduced result. This will be cheaper than loading all data
>> from A
>> > and B into Drill and then performing the GROUP BY operation.
>> >
>> > Can Drill do these smart group-by operations as on today? The
>> documentation
>> > I read above is encouraging (its pretty recent - Dec 2018).
>> >
>> > Thanks for your time,
>> > Best,
>> > Sarnath
>> >
>> >
>> >
>> > On Thu, Apr 11, 2019 at 1:54 AM Kunal Khatua wrote:
>> >
>> > > Hi Sarnath
>> > >
>> > > From what I understand by your description, you are looking to see if
>> > > Drill can push down the GROUP BY clause to the underlying JDBC
>> sources A
>> > > and B.
>> > >
>> > > Unfortunately, Drill does not support pushdown for the JDBC storage
>> > plugin
>> > > as yet. That said, even if the feature existed, by design, only one
>> > > fragment can read from a JDBC storage plugin, as it uses a single
>> > > connection to stream out the resultset.
>> > >
>> > > ~ Kunal
>> > >
>> > > On 4/9/2019 8:59:49 AM, Sarnath K wrote:
>> > > Hi,
>> > >
>> > > I have a requirement where I need to split data between a fast RDBMS
>> > system
>> > > (A) that will have HOT data and a slower cold storage (B)
>> > >
>> > > Both A and B provide JDBC drivers
>> > >
>> > > I am looking to see if Drill will help me in coming with a JDBC URL
>> (C)
>> > > which will hide the fact that data is split between A and B. i.e. Can
>> > Drill
>> > > be used to implement Data Virtualization?
>> > >
>> > > As much as I can read about Drill, I can definitely create 2 tables in
>> > > Drill one pointing to A and another to B.
>> > > However when I do GROUP BY queries or FILTER queries -- Does Drill
>> take
>> > > advantage of the existing JDBC systems by actually sending a part of
>> the
>> > > GROUP BY to A and another to B and then reduce the result again? i.e.
>> > Some
>> > > kind of smart predicate push-down for Analytical queries?
>> > >
>> > > Hope I sound clear to you. Appreciate your response much.
>> > >
>> > > Thank you,
>> > >
>> > > Best,
>> > > Sarnath
>> > >
>> >
>>
>

Re: Drill for Data Virtualization

Posted by Sarnath K <st...@gmail.com>.

Hi Kunal,

Few experiments that we did and the results:
1)
UNION from A and B and then have a Single GROUP BY
select GK, count(*) from ((select * from A) UNION ALL (select * from B))
GROUP BY GK
This one did the HashAggregate only in Drill. There was absolutely NO
PUSHDOWN to JDBC sources.

2)
GROUP BY from A UNION GROUP BY from B and then GROUP AGAIN
select GK, count(*) from ((select GK, count(*) as cnt from A GROUP BY GK)
UNION ALL (select GK, count(*) as cnt from B GROUP BY GK)) GROUP BY GK
For this one, the Predicate Push Down is happening ONLY for 'A' (regardless
of which data source it is).
There is no Predicate Push Down for 'B' (no matter which Data source it is
in)
Case1) We tested with A as MySQL and B as SQLite
Case2) We tested with A as SQLite and B as MySQL
In the first case the Predicate Pushdown was for 'A' alone i.e. the first
table which is Grouped.

Here is a physical plan of such an execution:
00-00    Screen : rowType = RecordType(VARCHAR(65536) empname, BIGINT cnt):
rowcount = 2.0, cumula
00-01      Project(empname=[$0], cnt=[$1]) : rowType =
RecordType(VARCHAR(65536) empname, BIGINT c
00-02        HashAgg(group=[{0}], cnt=[COUNT()]) : rowType =
RecordType(VARCHAR(65536) empname, BI
00-03          UnionAll(all=[true]) : rowType = RecordType(VARCHAR(65536)
empname): rowcount = 20.
00-05            Jdbc(sql=[SELECT `EmpName` AS `empname`  FROM
`test`.`emp`  GROUP BY `EmpName` ]) : rowType = RecordType(VARCHAR(255)
empname): rowcount = 10.0, cumulative c
00-04            Project(empname=[$0]) : rowType =
RecordType(VARCHAR(65536) empname): rowcount =
00-06              HashAgg(group=[{0}], cnt=[COUNT()]) : rowType =
RecordType(VARCHAR(65536) empna
00-07                Project(empname=[$1]) : rowType =
RecordType(VARCHAR(65536) empname): rowcoun
00-08                  Jdbc(sql=[SELECT *  FROM "Emp" ]) : rowType =
RecordType(INTEGER EmpId, VARCHAR(65536) EmpName, VARCHAR(65536) Phone,

One can see that the JDBC at 00-05 has GROUP BY pushed down
Whereas the JDBC at 00-08 does not have the GROUP BY pushed down. It is
rather handled after Union at 00-02

Is there any option that we could enable to make them GROUP BY at both
places?

btw... I am very impressed with Drill. It has a very simple and elegant
interface. A place where we can see all Queries that are getting executed,
see logs from web-app itself and beautifully vertically rendered Plans that
are so easily understandable.

This is exactly how intuitive and easy a user-interface has to be. Well
done, Drill Team!
Of course, we are impressed with reading multiple sources and joining them.
Looks like Magic!
Great job!

Thanks,
Best,
Sarnath





On Sat, Apr 13, 2019 at 2:59 PM Sarnath K <st...@gmail.com> wrote:

> Hi Kunal,
> I tried examining the plan for a simple group by.  I see that the group by
> is pushed to JDBC step whose output goes to the project ... Which seems
> like pushdown is working fine ...
> We are trying other cases. I will keep posted.
> Thank you for your help and time.
> I understand Calcite is a great community effort...I have been following
> it for quite some time. Thanks!!
> Best,
> Sarnath
>
> On Fri, Apr 12, 2019, 02:28 Kunal Khatua <ku...@apache.org> wrote:
>
>> On 4/11/2019 12:39:24 PM, Sarnath K <st...@gmail.com> wrote:
>> Thank you Kunal.
>>
>> >>>You could try creating views for each source and then doing a group by
>> on the union of those views... that *might* get you the results you want
>>
>> When you mention views, do you mean to say each view will be a group by
>> statement for that particular source....And we try to union them and group
>> again...This way, explicitly making up the Query do the pushdown....
>> That's
>> the idea you are referring to. Right!??
>> Kunal Khatua: That is correct. Worth a try. Start with querying the views
>> individually to see if the pushdown occurs in the first place.
>>
>>
>>
>> Btw....Calcite (possibly) not recognizing the pushdown opportunity would
>> be
>> a let down ... especially for flexible frameworks like Drill... In my
>> opinion...
>> Kunal Khatua: Yes, but again... Calcite is an independent open-source
>> project in use by many other OSS and commercial vendors. Considering many
>> such projects are driven by volunteer contributions, it's a miracle in my
>> opinion that the open-source software is able to achieve so much (and,
>> sometimes, putting commercial offerings to shame) without charging a penny
>> to the end users.
>>
>> Developers behind Drill have made contributions to Calcite in their
>> limited capacity, as have developers from other projects.. so, in many
>> ways, Drill has actually benefited from Calcite in more ways than it could
>> have by implementing its own Calcite substitute. Hopefully, someone in the
>> community can take a look at enhancing this feature as well.
>>
>> Thanks for your time. Appreciate much. I will keep posted.
>>
>> Best,
>> Satnath
>>
>> On Thu, Apr 11, 2019, 23:25 Kunal Khatua wrote:
>>
>> > Hi Sarnath
>> >
>> > I haven't tried your specific requirement, and it is possible that if
>> you
>> > are querying only A or only B, Drill would be able to push it down to
>> the
>> > source.
>> >
>> > However, it gets tricky when you are querying 2 or more sources in the
>> > same query, because (from my limited knowledge of Calcite) the Calcite
>> > parser needs to be aware that it can push filters down to both sources.
>> > With GROUP BY, multiple groupings across a single source versus across
>> > multiple sources are not semantically the same.
>> >
>> > You could try creating views for each source and then doing a group by
>> on
>> > the union of those views... that *might* get you the results you want.
>> >
>> > You can give it a shot, but I suspect it won't be as performant. Let us
>> > know if you find it otherwise.
>> >
>> > ~ Kunal
>> >
>> > On 4/10/2019 9:02:24 PM, Sarnath K wrote:
>> > Hi Kunal,
>> >
>> > Thank you for your response. But what I read in this URL says it can be
>> > done (though my own interpretation is muddled)
>> > https://drill.apache.org/docs/rdbms-storage-plugin/
>> >
>> > There is a statement in the documentation that says:
>> >
>> > As with any source, Drill supports joins within and between all systems.
>> > Drill additionally has powerful pushdown capabilities with RDBMS
>> sources.
>> > This includes support to push down join, where, group by, intersect and
>> > other SQL operations into a particular RDBMS source (as appropriate).
>> >
>> >
>> > >> That said, even if the feature existed, by design, only one fragment
>> can
>> > read from a JDBC storage plugin, as it uses a single connection to
>> stream
>> > out the resultset.
>> >
>> > I did not understand this. Say, I GROUP BY a particular column and
>> perform
>> > "max", "min" and "sum" aggregation. These are all associative group
>> summary
>> > operations. So, I have send MAX Query to A and then MAX query to B. Get
>> the
>> > results from both into Drill cluster and then perform a MAX on the
>> > partially reduced result. This will be cheaper than loading all data
>> from A
>> > and B into Drill and then performing the GROUP BY operation.
>> >
>> > Can Drill do these smart group-by operations as on today? The
>> documentation
>> > I read above is encouraging (its pretty recent - Dec 2018).
>> >
>> > Thanks for your time,
>> > Best,
>> > Sarnath
>> >
>> >
>> >
>> > On Thu, Apr 11, 2019 at 1:54 AM Kunal Khatua wrote:
>> >
>> > > Hi Sarnath
>> > >
>> > > From what I understand by your description, you are looking to see if
>> > > Drill can push down the GROUP BY clause to the underlying JDBC
>> sources A
>> > > and B.
>> > >
>> > > Unfortunately, Drill does not support pushdown for the JDBC storage
>> > plugin
>> > > as yet. That said, even if the feature existed, by design, only one
>> > > fragment can read from a JDBC storage plugin, as it uses a single
>> > > connection to stream out the resultset.
>> > >
>> > > ~ Kunal
>> > >
>> > > On 4/9/2019 8:59:49 AM, Sarnath K wrote:
>> > > Hi,
>> > >
>> > > I have a requirement where I need to split data between a fast RDBMS
>> > system
>> > > (A) that will have HOT data and a slower cold storage (B)
>> > >
>> > > Both A and B provide JDBC drivers
>> > >
>> > > I am looking to see if Drill will help me in coming with a JDBC URL
>> (C)
>> > > which will hide the fact that data is split between A and B. i.e. Can
>> > Drill
>> > > be used to implement Data Virtualization?
>> > >
>> > > As much as I can read about Drill, I can definitely create 2 tables in
>> > > Drill one pointing to A and another to B.
>> > > However when I do GROUP BY queries or FILTER queries -- Does Drill
>> take
>> > > advantage of the existing JDBC systems by actually sending a part of
>> the
>> > > GROUP BY to A and another to B and then reduce the result again? i.e.
>> > Some
>> > > kind of smart predicate push-down for Analytical queries?
>> > >
>> > > Hope I sound clear to you. Appreciate your response much.
>> > >
>> > > Thank you,
>> > >
>> > > Best,
>> > > Sarnath
>> > >
>> >
>>
>

Re: Drill for Data Virtualization

Posted by Sarnath K <st...@gmail.com>.

Hi Kunal,
I tried examining the plan for a simple group by.  I see that the group by
is pushed to JDBC step whose output goes to the project ... Which seems
like pushdown is working fine ...
We are trying other cases. I will keep posted.
Thank you for your help and time.
I understand Calcite is a great community effort...I have been following it
for quite some time. Thanks!!
Best,
Sarnath

On Fri, Apr 12, 2019, 02:28 Kunal Khatua <ku...@apache.org> wrote:

> On 4/11/2019 12:39:24 PM, Sarnath K <st...@gmail.com> wrote:
> Thank you Kunal.
>
> >>>You could try creating views for each source and then doing a group by
> on the union of those views... that *might* get you the results you want
>
> When you mention views, do you mean to say each view will be a group by
> statement for that particular source....And we try to union them and group
> again...This way, explicitly making up the Query do the pushdown.... That's
> the idea you are referring to. Right!??
> Kunal Khatua: That is correct. Worth a try. Start with querying the views
> individually to see if the pushdown occurs in the first place.
>
>
>
> Btw....Calcite (possibly) not recognizing the pushdown opportunity would be
> a let down ... especially for flexible frameworks like Drill... In my
> opinion...
> Kunal Khatua: Yes, but again... Calcite is an independent open-source
> project in use by many other OSS and commercial vendors. Considering many
> such projects are driven by volunteer contributions, it's a miracle in my
> opinion that the open-source software is able to achieve so much (and,
> sometimes, putting commercial offerings to shame) without charging a penny
> to the end users.
>
> Developers behind Drill have made contributions to Calcite in their
> limited capacity, as have developers from other projects.. so, in many
> ways, Drill has actually benefited from Calcite in more ways than it could
> have by implementing its own Calcite substitute. Hopefully, someone in the
> community can take a look at enhancing this feature as well.
>
> Thanks for your time. Appreciate much. I will keep posted.
>
> Best,
> Satnath
>
> On Thu, Apr 11, 2019, 23:25 Kunal Khatua wrote:
>
> > Hi Sarnath
> >
> > I haven't tried your specific requirement, and it is possible that if you
> > are querying only A or only B, Drill would be able to push it down to the
> > source.
> >
> > However, it gets tricky when you are querying 2 or more sources in the
> > same query, because (from my limited knowledge of Calcite) the Calcite
> > parser needs to be aware that it can push filters down to both sources.
> > With GROUP BY, multiple groupings across a single source versus across
> > multiple sources are not semantically the same.
> >
> > You could try creating views for each source and then doing a group by on
> > the union of those views... that *might* get you the results you want.
> >
> > You can give it a shot, but I suspect it won't be as performant. Let us
> > know if you find it otherwise.
> >
> > ~ Kunal
> >
> > On 4/10/2019 9:02:24 PM, Sarnath K wrote:
> > Hi Kunal,
> >
> > Thank you for your response. But what I read in this URL says it can be
> > done (though my own interpretation is muddled)
> > https://drill.apache.org/docs/rdbms-storage-plugin/
> >
> > There is a statement in the documentation that says:
> >
> > As with any source, Drill supports joins within and between all systems.
> > Drill additionally has powerful pushdown capabilities with RDBMS sources.
> > This includes support to push down join, where, group by, intersect and
> > other SQL operations into a particular RDBMS source (as appropriate).
> >
> >
> > >> That said, even if the feature existed, by design, only one fragment
> can
> > read from a JDBC storage plugin, as it uses a single connection to stream
> > out the resultset.
> >
> > I did not understand this. Say, I GROUP BY a particular column and
> perform
> > "max", "min" and "sum" aggregation. These are all associative group
> summary
> > operations. So, I have send MAX Query to A and then MAX query to B. Get
> the
> > results from both into Drill cluster and then perform a MAX on the
> > partially reduced result. This will be cheaper than loading all data
> from A
> > and B into Drill and then performing the GROUP BY operation.
> >
> > Can Drill do these smart group-by operations as on today? The
> documentation
> > I read above is encouraging (its pretty recent - Dec 2018).
> >
> > Thanks for your time,
> > Best,
> > Sarnath
> >
> >
> >
> > On Thu, Apr 11, 2019 at 1:54 AM Kunal Khatua wrote:
> >
> > > Hi Sarnath
> > >
> > > From what I understand by your description, you are looking to see if
> > > Drill can push down the GROUP BY clause to the underlying JDBC sources
> A
> > > and B.
> > >
> > > Unfortunately, Drill does not support pushdown for the JDBC storage
> > plugin
> > > as yet. That said, even if the feature existed, by design, only one
> > > fragment can read from a JDBC storage plugin, as it uses a single
> > > connection to stream out the resultset.
> > >
> > > ~ Kunal
> > >
> > > On 4/9/2019 8:59:49 AM, Sarnath K wrote:
> > > Hi,
> > >
> > > I have a requirement where I need to split data between a fast RDBMS
> > system
> > > (A) that will have HOT data and a slower cold storage (B)
> > >
> > > Both A and B provide JDBC drivers
> > >
> > > I am looking to see if Drill will help me in coming with a JDBC URL (C)
> > > which will hide the fact that data is split between A and B. i.e. Can
> > Drill
> > > be used to implement Data Virtualization?
> > >
> > > As much as I can read about Drill, I can definitely create 2 tables in
> > > Drill one pointing to A and another to B.
> > > However when I do GROUP BY queries or FILTER queries -- Does Drill take
> > > advantage of the existing JDBC systems by actually sending a part of
> the
> > > GROUP BY to A and another to B and then reduce the result again? i.e.
> > Some
> > > kind of smart predicate push-down for Analytical queries?
> > >
> > > Hope I sound clear to you. Appreciate your response much.
> > >
> > > Thank you,
> > >
> > > Best,
> > > Sarnath
> > >
> >
>

Re: Drill for Data Virtualization

Posted by Kunal Khatua <ku...@apache.org>.

On 4/11/2019 12:39:24 PM, Sarnath K <st...@gmail.com> wrote:
Thank you Kunal.

>>>You could try creating views for each source and then doing a group by
on the union of those views... that *might* get you the results you want

When you mention views, do you mean to say each view will be a group by
statement for that particular source....And we try to union them and group
again...This way, explicitly making up the Query do the pushdown.... That's
the idea you are referring to. Right!??
Kunal Khatua: That is correct. Worth a try. Start with querying the views individually to see if the pushdown occurs in the first place. 



Btw....Calcite (possibly) not recognizing the pushdown opportunity would be
a let down ... especially for flexible frameworks like Drill... In my
opinion...
Kunal Khatua: Yes, but again... Calcite is an independent open-source project in use by many other OSS and commercial vendors. Considering many such projects are driven by volunteer contributions, it's a miracle in my opinion that the open-source software is able to achieve so much (and, sometimes, putting commercial offerings to shame) without charging a penny to the end users. 

Developers behind Drill have made contributions to Calcite in their limited capacity, as have developers from other projects.. so, in many ways, Drill has actually benefited from Calcite in more ways than it could have by implementing its own Calcite substitute. Hopefully, someone in the community can take a look at enhancing this feature as well.

Thanks for your time. Appreciate much. I will keep posted.

Best,
Satnath

On Thu, Apr 11, 2019, 23:25 Kunal Khatua wrote:

> Hi Sarnath
>
> I haven't tried your specific requirement, and it is possible that if you
> are querying only A or only B, Drill would be able to push it down to the
> source.
>
> However, it gets tricky when you are querying 2 or more sources in the
> same query, because (from my limited knowledge of Calcite) the Calcite
> parser needs to be aware that it can push filters down to both sources.
> With GROUP BY, multiple groupings across a single source versus across
> multiple sources are not semantically the same.
>
> You could try creating views for each source and then doing a group by on
> the union of those views... that *might* get you the results you want.
>
> You can give it a shot, but I suspect it won't be as performant. Let us
> know if you find it otherwise.
>
> ~ Kunal
>
> On 4/10/2019 9:02:24 PM, Sarnath K wrote:
> Hi Kunal,
>
> Thank you for your response. But what I read in this URL says it can be
> done (though my own interpretation is muddled)
> https://drill.apache.org/docs/rdbms-storage-plugin/
>
> There is a statement in the documentation that says:
>
> As with any source, Drill supports joins within and between all systems.
> Drill additionally has powerful pushdown capabilities with RDBMS sources.
> This includes support to push down join, where, group by, intersect and
> other SQL operations into a particular RDBMS source (as appropriate).
>
>
> >> That said, even if the feature existed, by design, only one fragment can
> read from a JDBC storage plugin, as it uses a single connection to stream
> out the resultset.
>
> I did not understand this. Say, I GROUP BY a particular column and perform
> "max", "min" and "sum" aggregation. These are all associative group summary
> operations. So, I have send MAX Query to A and then MAX query to B. Get the
> results from both into Drill cluster and then perform a MAX on the
> partially reduced result. This will be cheaper than loading all data from A
> and B into Drill and then performing the GROUP BY operation.
>
> Can Drill do these smart group-by operations as on today? The documentation
> I read above is encouraging (its pretty recent - Dec 2018).
>
> Thanks for your time,
> Best,
> Sarnath
>
>
>
> On Thu, Apr 11, 2019 at 1:54 AM Kunal Khatua wrote:
>
> > Hi Sarnath
> >
> > From what I understand by your description, you are looking to see if
> > Drill can push down the GROUP BY clause to the underlying JDBC sources A
> > and B.
> >
> > Unfortunately, Drill does not support pushdown for the JDBC storage
> plugin
> > as yet. That said, even if the feature existed, by design, only one
> > fragment can read from a JDBC storage plugin, as it uses a single
> > connection to stream out the resultset.
> >
> > ~ Kunal
> >
> > On 4/9/2019 8:59:49 AM, Sarnath K wrote:
> > Hi,
> >
> > I have a requirement where I need to split data between a fast RDBMS
> system
> > (A) that will have HOT data and a slower cold storage (B)
> >
> > Both A and B provide JDBC drivers
> >
> > I am looking to see if Drill will help me in coming with a JDBC URL (C)
> > which will hide the fact that data is split between A and B. i.e. Can
> Drill
> > be used to implement Data Virtualization?
> >
> > As much as I can read about Drill, I can definitely create 2 tables in
> > Drill one pointing to A and another to B.
> > However when I do GROUP BY queries or FILTER queries -- Does Drill take
> > advantage of the existing JDBC systems by actually sending a part of the
> > GROUP BY to A and another to B and then reduce the result again? i.e.
> Some
> > kind of smart predicate push-down for Analytical queries?
> >
> > Hope I sound clear to you. Appreciate your response much.
> >
> > Thank you,
> >
> > Best,
> > Sarnath
> >
>

Re: Drill for Data Virtualization

Posted by Sarnath K <st...@gmail.com>.

Thank you Kunal.

>>>You could try creating views for each source and then doing a group by
on the union of those views... that *might* get you the results you want

When you mention views, do you mean to say each view will be a group by
statement for that particular source....And we try to union them and group
again...This way, explicitly making up the Query do the pushdown.... That's
the idea you are referring to. Right!??

Btw....Calcite (possibly) not recognizing the pushdown opportunity would be
a let down ... especially for flexible frameworks like Drill... In my
opinion...

Thanks for your time. Appreciate much. I will keep posted.

Best,
Satnath

On Thu, Apr 11, 2019, 23:25 Kunal Khatua <ku...@apache.org> wrote:

> Hi Sarnath
>
> I haven't tried your specific requirement, and it is possible that if you
> are querying only A or only B, Drill would be able to push it down to the
> source.
>
> However, it gets tricky when you are querying 2 or more sources in the
> same query, because (from my limited knowledge of Calcite) the Calcite
> parser needs to be aware that it can push filters down to both sources.
> With GROUP BY, multiple groupings across a single source versus across
> multiple sources are not semantically the same.
>
> You could try creating views for each source and then doing a group by on
> the union of those views... that *might* get you the results you want.
>
> You can give it a shot, but I suspect it won't be as performant. Let us
> know if you find it otherwise.
>
> ~ Kunal
>
> On 4/10/2019 9:02:24 PM, Sarnath K <st...@gmail.com> wrote:
> Hi Kunal,
>
> Thank you for your response. But what I read in this URL says it can be
> done (though my own interpretation is muddled)
> https://drill.apache.org/docs/rdbms-storage-plugin/
>
> There is a statement in the documentation that says:
>
> As with any source, Drill supports joins within and between all systems.
> Drill additionally has powerful pushdown capabilities with RDBMS sources.
> This includes support to push down join, where, group by, intersect and
> other SQL operations into a particular RDBMS source (as appropriate).
>
>
> >> That said, even if the feature existed, by design, only one fragment can
> read from a JDBC storage plugin, as it uses a single connection to stream
> out the resultset.
>
> I did not understand this. Say, I GROUP BY a particular column and perform
> "max", "min" and "sum" aggregation. These are all associative group summary
> operations. So, I have send MAX Query to A and then MAX query to B. Get the
> results from both into Drill cluster and then perform a MAX on the
> partially reduced result. This will be cheaper than loading all data from A
> and B into Drill and then performing the GROUP BY operation.
>
> Can Drill do these smart group-by operations as on today? The documentation
> I read above is encouraging (its pretty recent - Dec 2018).
>
> Thanks for your time,
> Best,
> Sarnath
>
>
>
> On Thu, Apr 11, 2019 at 1:54 AM Kunal Khatua wrote:
>
> > Hi Sarnath
> >
> > From what I understand by your description, you are looking to see if
> > Drill can push down the GROUP BY clause to the underlying JDBC sources A
> > and B.
> >
> > Unfortunately, Drill does not support pushdown for the JDBC storage
> plugin
> > as yet. That said, even if the feature existed, by design, only one
> > fragment can read from a JDBC storage plugin, as it uses a single
> > connection to stream out the resultset.
> >
> > ~ Kunal
> >
> > On 4/9/2019 8:59:49 AM, Sarnath K wrote:
> > Hi,
> >
> > I have a requirement where I need to split data between a fast RDBMS
> system
> > (A) that will have HOT data and a slower cold storage (B)
> >
> > Both A and B provide JDBC drivers
> >
> > I am looking to see if Drill will help me in coming with a JDBC URL (C)
> > which will hide the fact that data is split between A and B. i.e. Can
> Drill
> > be used to implement Data Virtualization?
> >
> > As much as I can read about Drill, I can definitely create 2 tables in
> > Drill one pointing to A and another to B.
> > However when I do GROUP BY queries or FILTER queries -- Does Drill take
> > advantage of the existing JDBC systems by actually sending a part of the
> > GROUP BY to A and another to B and then reduce the result again? i.e.
> Some
> > kind of smart predicate push-down for Analytical queries?
> >
> > Hope I sound clear to you. Appreciate your response much.
> >
> > Thank you,
> >
> > Best,
> > Sarnath
> >
>

Re: Drill for Data Virtualization

Posted by Kunal Khatua <ku...@apache.org>.

Hi Sarnath

I haven't tried your specific requirement, and it is possible that if you are querying only A or only B, Drill would be able to push it down to the source. 

However, it gets tricky when you are querying 2 or more sources in the same query, because (from my limited knowledge of Calcite) the Calcite parser needs to be aware that it can push filters down to both sources. With GROUP BY, multiple groupings across a single source versus across multiple sources are not semantically the same. 

You could try creating views for each source and then doing a group by on the union of those views... that *might* get you the results you want.

You can give it a shot, but I suspect it won't be as performant. Let us know if you find it otherwise.

~ Kunal

On 4/10/2019 9:02:24 PM, Sarnath K <st...@gmail.com> wrote:
Hi Kunal,

Thank you for your response. But what I read in this URL says it can be
done (though my own interpretation is muddled)
https://drill.apache.org/docs/rdbms-storage-plugin/

There is a statement in the documentation that says:

As with any source, Drill supports joins within and between all systems.
Drill additionally has powerful pushdown capabilities with RDBMS sources.
This includes support to push down join, where, group by, intersect and
other SQL operations into a particular RDBMS source (as appropriate).

>> That said, even if the feature existed, by design, only one fragment can
read from a JDBC storage plugin, as it uses a single connection to stream
out the resultset.

I did not understand this. Say, I GROUP BY a particular column and perform
"max", "min" and "sum" aggregation. These are all associative group summary
operations. So, I have send MAX Query to A and then MAX query to B. Get the
results from both into Drill cluster and then perform a MAX on the
partially reduced result. This will be cheaper than loading all data from A
and B into Drill and then performing the GROUP BY operation.

Can Drill do these smart group-by operations as on today? The documentation
I read above is encouraging (its pretty recent - Dec 2018).

Thanks for your time,
Best,
Sarnath

On Thu, Apr 11, 2019 at 1:54 AM Kunal Khatua wrote:

> Hi Sarnath
>
> From what I understand by your description, you are looking to see if
> Drill can push down the GROUP BY clause to the underlying JDBC sources A
> and B.
>
> Unfortunately, Drill does not support pushdown for the JDBC storage plugin
> as yet. That said, even if the feature existed, by design, only one
> fragment can read from a JDBC storage plugin, as it uses a single
> connection to stream out the resultset.
>
> ~ Kunal
>
> On 4/9/2019 8:59:49 AM, Sarnath K wrote:
> Hi,
>
> I have a requirement where I need to split data between a fast RDBMS system
> (A) that will have HOT data and a slower cold storage (B)
>
> Both A and B provide JDBC drivers
>
> I am looking to see if Drill will help me in coming with a JDBC URL (C)
> which will hide the fact that data is split between A and B. i.e. Can Drill
> be used to implement Data Virtualization?
>
> As much as I can read about Drill, I can definitely create 2 tables in
> Drill one pointing to A and another to B.
> However when I do GROUP BY queries or FILTER queries -- Does Drill take
> advantage of the existing JDBC systems by actually sending a part of the
> GROUP BY to A and another to B and then reduce the result again? i.e. Some
> kind of smart predicate push-down for Analytical queries?
>
> Hope I sound clear to you. Appreciate your response much.
>
> Thank you,
>
> Best,
> Sarnath
>

Re: Drill for Data Virtualization

Posted by Sarnath K <st...@gmail.com>.

Hi Kunal,

Thank you for your response. But what I read in this URL says it can be
done (though my own interpretation is muddled)
https://drill.apache.org/docs/rdbms-storage-plugin/

There is a statement in the documentation that says:

As with any source, Drill supports joins within and between all systems.
Drill additionally has powerful pushdown capabilities with RDBMS sources.
This includes support to push down join, where, group by, intersect and
other SQL operations into a particular RDBMS source (as appropriate).

>> That said, even if the feature existed, by design, only one fragment can
read from a JDBC storage plugin, as it uses a single connection to stream
out the resultset.

I did not understand this. Say, I GROUP BY a particular column and perform
"max", "min" and "sum" aggregation. These are all associative group summary
operations. So, I have send MAX Query to A and then MAX query to B. Get the
results from both into Drill cluster and then perform a MAX on the
partially reduced result. This will be cheaper than loading all data from A
and B into Drill and then performing the GROUP BY operation.

Can Drill do these smart group-by operations as on today? The documentation
I read above is encouraging (its pretty recent - Dec 2018).

Thanks for your time,
Best,
Sarnath

On Thu, Apr 11, 2019 at 1:54 AM Kunal Khatua <ku...@apache.org> wrote:

> Hi Sarnath
>
> From what I understand by your description, you are looking to see if
> Drill can push down the GROUP BY clause to the underlying JDBC sources A
> and B.
>
> Unfortunately, Drill does not support pushdown for the JDBC storage plugin
> as yet. That said, even if the feature existed, by design, only one
> fragment can read from a JDBC storage plugin, as it uses a single
> connection to stream out the resultset.
>
> ~ Kunal
>
> On 4/9/2019 8:59:49 AM, Sarnath K <st...@gmail.com> wrote:
> Hi,
>
> I have a requirement where I need to split data between a fast RDBMS system
> (A) that will have HOT data and a slower cold storage (B)
>
> Both A and B provide JDBC drivers
>
> I am looking to see if Drill will help me in coming with a JDBC URL (C)
> which will hide the fact that data is split between A and B. i.e. Can Drill
> be used to implement Data Virtualization?
>
> As much as I can read about Drill, I can definitely create 2 tables in
> Drill one pointing to A and another to B.
> However when I do GROUP BY queries or FILTER queries -- Does Drill take
> advantage of the existing JDBC systems by actually sending a part of the
> GROUP BY to A and another to B and then reduce the result again? i.e. Some
> kind of smart predicate push-down for Analytical queries?
>
> Hope I sound clear to you. Appreciate your response much.
>
> Thank you,
>
> Best,
> Sarnath
>

Re: Drill for Data Virtualization

Posted by Kunal Khatua <ku...@apache.org>.

Hi Sarnath

From what I understand by your description, you are looking to see if Drill can push down the GROUP BY clause to the underlying JDBC sources A and B.

Unfortunately, Drill does not support pushdown for the JDBC storage plugin as yet. That said, even if the feature existed, by design, only one fragment can read from a JDBC storage plugin, as it uses a single connection to stream out the resultset.

~ Kunal

On 4/9/2019 8:59:49 AM, Sarnath K <st...@gmail.com> wrote:
Hi,

I have a requirement where I need to split data between a fast RDBMS system
(A) that will have HOT data and a slower cold storage (B)

Both A and B provide JDBC drivers

I am looking to see if Drill will help me in coming with a JDBC URL (C)
which will hide the fact that data is split between A and B. i.e. Can Drill
be used to implement Data Virtualization?

As much as I can read about Drill, I can definitely create 2 tables in
Drill one pointing to A and another to B.
However when I do GROUP BY queries or FILTER queries -- Does Drill take
advantage of the existing JDBC systems by actually sending a part of the
GROUP BY to A and another to B and then reduce the result again? i.e. Some
kind of smart predicate push-down for Analytical queries?

Hope I sound clear to you. Appreciate your response much.

Thank you,

Best,
Sarnath