You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Andrew Pashkin <an...@gmx.co.uk> on 2019/12/18 14:26:39 UTC

How to force Drill to offload performing LIMIT onto the target storage?

Hello!

I have a setup where I'm querying Impala server through JDBC connection
using Drill and I noticed that when I'm doing a simple GROUP BY query
with LIMIT with a small value the query runs for a very long time. And
in the query profile I see that hundreds of thousands of rows are being
fetched from the Impala server.

The plan looks like this:

00-00    Screen : rowType = RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0, cumulative cost = {331.0 rows, 2671.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002 memory}, id = 67521
00-01      Project(payment_date=[$0], total_sum=[$1]) : rowType = RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0, cumulative cost = {330.0 rows, 2670.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002 memory}, id = 67520
00-02        SelectionVectorRemover : rowType = RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0, cumulative cost = {320.0 rows, 2650.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002 memory}, id = 67519
00-03          Limit(fetch=[10]) : rowType = RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0, cumulative cost = {310.0 rows, 2640.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002 memory}, id = 67518
00-04            HashAgg(group=[{0}], total_sum=[SUM($1)]) : rowType = RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0, cumulative cost = {300.0 rows, 2600.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002 memory}, id = 67517
00-05              Project(payment_date=[TO_DATE($0)], payed_in_usd_amt=[$11]) : rowType = RecordType(DATE date_field, REAL value_field): rowcount = 100.0, cumulative cost = {200.0 rows, 600.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 67516
00-06                Jdbc(sql=[SELECT * FROM `Impala`.`<my schema>`.`<my table>` ]) : rowType = RecordType(<my fields>): rowcount = 100.0, cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 67435

It seems like Drill issues SELECT * FROM <Impala> query to perform
group-by and limit on the host which seems terribly inefficient and
could be performed by Impala itself with much less effort.

I wonder - is it possible to tune Drill somehow to perform
limit/group-by using the storage capabilities (Impala in my case)? Or if
such optimization is something do be developed - is it absent only for
JDBC connections or for all storage types?

--
With kind regards, Andrew Pashkin.
cell phone - +375 (44) 492-16-85
Skype - waves_in_fluids
e-mail - andrew.pashkin@gmx.co.uk


Re: How to force Drill to offload performing LIMIT onto the target storage?

Posted by Vova Vysotskyi <vv...@gmail.com>.
From the code, it looks like it happens to all functions that are present
in Drill even if such functions are part of the SQL standard (it was done
since function behavior may not be strictly described by standard and
therefore will be implementation-dependent).
But functions that absent in Drill and are present in SqlStdOperatorTable
<https://github.com/apache/calcite/blob/52a57078ba081b24b9d086ed363c715485d1a519/core/src/main/java/org/apache/calcite/sql/fun/SqlStdOperatorTable.java>
(if such exist) will be pushed to JDBC.

Kind regards,
Volodymyr Vysotskyi


On Wed, Dec 18, 2019 at 8:09 PM Andrew Pashkin <an...@gmx.co.uk>
wrote:

> You are right! Group by is being pushed to the storage if I remove
> TO_DATE(). Does it happen with all functions or only with some (perhaps
> the ones that are not in the SQL standard)?
>
> On 18.12.19 20:08, Vova Vysotskyi wrote:
> > Hi Andrew,
> >
> > Drill in general case pushes aggregations into the JDBC storage, but
> Drill
> > doesn't push functions into JDBC storage since some databases may not
> have
> > specific function implementations.
> >
> > In the plan, you have provided, the grouping is produced on top of
> > TO_DATE() function call. If it is possible, please rewrite the query to
> do
> > aggregations, and then call functions. In this case, Drill should be able
> > to push aggregation into JDBC storage.
> >
> > Regarding pushing limit into the JDBC storage, it is a bug which should
> be
> > fixed, I have created a Jira ticket for this: DRILL-7490
> > <https://issues.apache.org/jira/browse/DRILL-7490>.
> >
> > Kind regards,
> > Volodymyr Vysotskyi
> >
> >
> > On Wed, Dec 18, 2019 at 4:26 PM Andrew Pashkin <andrew.pashkin@gmx.co.uk
> >
> > wrote:
> >
> >> Hello!
> >>
> >> I have a setup where I'm querying Impala server through JDBC connection
> >> using Drill and I noticed that when I'm doing a simple GROUP BY query
> >> with LIMIT with a small value the query runs for a very long time. And
> >> in the query profile I see that hundreds of thousands of rows are being
> >> fetched from the Impala server.
> >>
> >> The plan looks like this:
> >>
> >> 00-00    Screen : rowType = RecordType(DATE payment_date, ANY
> total_sum):
> >> rowcount = 10.0, cumulative cost = {331.0 rows, 2671.0 cpu, 0.0 io, 0.0
> >> network, 1760.0000000000002 memory}, id = 67521
> >> 00-01      Project(payment_date=[$0], total_sum=[$1]) : rowType =
> >> RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0,
> cumulative
> >> cost = {330.0 rows, 2670.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002
> >> memory}, id = 67520
> >> 00-02        SelectionVectorRemover : rowType = RecordType(DATE
> >> payment_date, ANY total_sum): rowcount = 10.0, cumulative cost = {320.0
> >> rows, 2650.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002 memory}, id =
> >> 67519
> >> 00-03          Limit(fetch=[10]) : rowType = RecordType(DATE
> payment_date,
> >> ANY total_sum): rowcount = 10.0, cumulative cost = {310.0 rows, 2640.0
> cpu,
> >> 0.0 io, 0.0 network, 1760.0000000000002 memory}, id = 67518
> >> 00-04            HashAgg(group=[{0}], total_sum=[SUM($1)]) : rowType =
> >> RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0,
> cumulative
> >> cost = {300.0 rows, 2600.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002
> >> memory}, id = 67517
> >> 00-05              Project(payment_date=[TO_DATE($0)],
> >> payed_in_usd_amt=[$11]) : rowType = RecordType(DATE date_field, REAL
> >> value_field): rowcount = 100.0, cumulative cost = {200.0 rows, 600.0
> cpu,
> >> 0.0 io, 0.0 network, 0.0 memory}, id = 67516
> >> 00-06                Jdbc(sql=[SELECT * FROM `Impala`.`<my schema>`.`<my
> >> table>` ]) : rowType = RecordType(<my fields>): rowcount = 100.0,
> >> cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0
> memory},
> >> id = 67435
> >>
> >> It seems like Drill issues SELECT * FROM <Impala> query to perform
> >> group-by and limit on the host which seems terribly inefficient and
> >> could be performed by Impala itself with much less effort.
> >>
> >> I wonder - is it possible to tune Drill somehow to perform
> >> limit/group-by using the storage capabilities (Impala in my case)? Or if
> >> such optimization is something do be developed - is it absent only for
> >> JDBC connections or for all storage types?
> >>
> >> --
> >> With kind regards, Andrew Pashkin.
> >> cell phone - +375 (44) 492-16-85
> >> Skype - waves_in_fluids
> >> e-mail - andrew.pashkin@gmx.co.uk
> >>
> >>
>

Re: How to force Drill to offload performing LIMIT onto the target storage?

Posted by Vova Vysotskyi <vv...@gmail.com>.
Hi Andrew,

Drill in general case pushes aggregations into the JDBC storage, but Drill
doesn't push functions into JDBC storage since some databases may not have
specific function implementations.

In the plan, you have provided, the grouping is produced on top of
TO_DATE() function call. If it is possible, please rewrite the query to do
aggregations, and then call functions. In this case, Drill should be able
to push aggregation into JDBC storage.

Regarding pushing limit into the JDBC storage, it is a bug which should be
fixed, I have created a Jira ticket for this: DRILL-7490
<https://issues.apache.org/jira/browse/DRILL-7490>.

Kind regards,
Volodymyr Vysotskyi


On Wed, Dec 18, 2019 at 4:26 PM Andrew Pashkin <an...@gmx.co.uk>
wrote:

> Hello!
>
> I have a setup where I'm querying Impala server through JDBC connection
> using Drill and I noticed that when I'm doing a simple GROUP BY query
> with LIMIT with a small value the query runs for a very long time. And
> in the query profile I see that hundreds of thousands of rows are being
> fetched from the Impala server.
>
> The plan looks like this:
>
> 00-00    Screen : rowType = RecordType(DATE payment_date, ANY total_sum):
> rowcount = 10.0, cumulative cost = {331.0 rows, 2671.0 cpu, 0.0 io, 0.0
> network, 1760.0000000000002 memory}, id = 67521
> 00-01      Project(payment_date=[$0], total_sum=[$1]) : rowType =
> RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0, cumulative
> cost = {330.0 rows, 2670.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002
> memory}, id = 67520
> 00-02        SelectionVectorRemover : rowType = RecordType(DATE
> payment_date, ANY total_sum): rowcount = 10.0, cumulative cost = {320.0
> rows, 2650.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002 memory}, id =
> 67519
> 00-03          Limit(fetch=[10]) : rowType = RecordType(DATE payment_date,
> ANY total_sum): rowcount = 10.0, cumulative cost = {310.0 rows, 2640.0 cpu,
> 0.0 io, 0.0 network, 1760.0000000000002 memory}, id = 67518
> 00-04            HashAgg(group=[{0}], total_sum=[SUM($1)]) : rowType =
> RecordType(DATE payment_date, ANY total_sum): rowcount = 10.0, cumulative
> cost = {300.0 rows, 2600.0 cpu, 0.0 io, 0.0 network, 1760.0000000000002
> memory}, id = 67517
> 00-05              Project(payment_date=[TO_DATE($0)],
> payed_in_usd_amt=[$11]) : rowType = RecordType(DATE date_field, REAL
> value_field): rowcount = 100.0, cumulative cost = {200.0 rows, 600.0 cpu,
> 0.0 io, 0.0 network, 0.0 memory}, id = 67516
> 00-06                Jdbc(sql=[SELECT * FROM `Impala`.`<my schema>`.`<my
> table>` ]) : rowType = RecordType(<my fields>): rowcount = 100.0,
> cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory},
> id = 67435
>
> It seems like Drill issues SELECT * FROM <Impala> query to perform
> group-by and limit on the host which seems terribly inefficient and
> could be performed by Impala itself with much less effort.
>
> I wonder - is it possible to tune Drill somehow to perform
> limit/group-by using the storage capabilities (Impala in my case)? Or if
> such optimization is something do be developed - is it absent only for
> JDBC connections or for all storage types?
>
> --
> With kind regards, Andrew Pashkin.
> cell phone - +375 (44) 492-16-85
> Skype - waves_in_fluids
> e-mail - andrew.pashkin@gmx.co.uk
>
>