You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Kunal Khatua <ku...@apache.org> on 2019/04/24 04:52:51 UTC

RE: Strange join table criteria in apache drill

(Adding this to the user mailing list for future reference and see if anyone else has alternative solutions)

Hi Charles

What you pointed out is the same issue as the mail I shared regarding the GROUP BY clause. There seems to be a limitation in the Calcite planner w.r.t. the JDBC storage plugin. I'm actually wondering how come you do not have the need for table aliases when defining the predicate. Perhaps that *might* resolve the issue of the filter being pushed down to the source.

It seems that the GROUP BY (or in your case, the WHERE) clause is being pushed down to the source only for the first source.

This isn't a bug in the sense that you are getting wrong results, but is definitely a performance bug. As suggested in that thread, one option would be to identify which source will be smaller to stream in the absence of the filter, and make that the second source in the union query.

An alternate would be to make this a two-step (three actually) query, where you create a temporary table with one of the tables (say, Header) and then run the actual query modified to do a union of this temporary table and the other JDBC source - (model2train). Of course, the third step would be to have this temp table dropped for the next run. You'll need a bit of experimentation, but my hunch is that you want to use the faster of the 2 subqueries to create the temporary table. 

Hope this helps. 

~ Kunal

On 4/23/2019 7:13:57 PM, Charles C X CHEN <ch...@hsbc.com> wrote:
Hi Kunal,
 
For your example, it is using group by. We are using union all to join the two table. However, we find the data cannot be filtered in the data source side. Please look into detail in the mail content below. Thx
 
Below is our PoC steps.
 
1)       Catalog the bigquery to apache drill
2)       Catch the oracle db to the apache drill
3)       Run a join table statement and monitor the log
 
We issue the sql select * from oracle_sit.etx01da1.ASP_PMT_HEADER where feature ='GBP' union all select * from bigquery.`hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` where feature ='GBP'. Both the two database has a selection criteria (feature = ‘GBP’). However, just the first data source can put the selection criteria “FEATURE" = 'GBP' into the sql statement and drill will retrieve all the table of the table from the second data source. This will impact the performance much as it will make more data to be transferred from remote data source to our drill node. Please help to advise the cause. thx
 
SELECT * FROM "ETX01DA1"."ASP_PMT_HEADER" WHERE "FEATURE" = 'GBP'
 
SELECT * FROM `hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train`
 
2019-04-23 11:00:40,284 [23411dbf-7fab-c825-e2c2-6bffc88c00ef:foreman] DEBUG o.a.d.e.p.s.h.DefaultSqlHandler - HEP_BOTTOM_UP:Physical Partition Prune Planning (1ms):
ScreenPrel: rowcount = 30.0, cumulative cost = {308.0 rows, 938.0 cpu, 0.0 io, 368640.0 network, 0.0 memory}, id = 862
  ProjectPrel(FEATURE=[$0], TITLE=[$1], LABEL=[$2]): rowcount = 30.0, cumulative cost = {305.0 rows, 935.0 cpu, 0.0 io, 368640.0 network, 0.0 memory}, id = 860
    UnionExchangePrel: rowcount = 30.0, cumulative cost = {275.0 rows, 845.0 cpu, 0.0 io, 368640.0 network, 0.0 memory}, id = 858
      UnionAllPrel(all=[true]): rowcount = 30.0, cumulative cost = {245.0 rows, 605.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 856
        JdbcPrel(sql=[SELECT * FROM "ETX01DA1"."ASP_PMT_HEADER" WHERE "FEATURE" = 'GBP' ]): rowcount = 15.0, cumulative cost = {15.0 rows, 15.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 825
        FilterPrel(condition=[=($0, 'GBP')]): rowcount = 15.0, cumulative cost = {200.0 rows, 560.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 854
          JdbcPrel(sql=[SELECT * FROM `hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` ]): rowcount = 100.0, cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 828
 
2
 9 rows selected (88.054 seconds)
0: jdbc:drill:zk=local> select * from oracle_sit.etx01da1.ASP_PMT_HEADER where feature ='GBP' union all select * from bigquery.`hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` where feature ='GBP';
+----------+----------+--------+
| FEATURE  |  TITLE   | LABEL  |
+----------+----------+--------+
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
+----------+----------+--------+
9 rows selected (10.043 seconds)
 
 
Regards,


Charles C X Chen
 
Senior Consultant Specialist, Development
Payments and Global Liquidity and Cash Management (GLCM) | HSBC Operations, Services and Technology (HOST)
L22, Office Tower 2, Taikoo Hui, 381 Tianhe Road, Tianhe District, Guangzhou, China
___________________________________________________________________________________


Phone.   8620 38589062 Tie-line 718630-89062
Email.   charlescxchen@hsbc.com.cn [mailto:charlescxchen@hsbc.com.cn] 
___________________________________________________________________________________
 
From: kunal@apache.org [mailto:kunal@apache.org]
Sent: Wednesday, April 24, 2019 1:41 AM
To: Charles C X CHEN <ch...@hsbc.com>
Subject: RE: Strange join table criteria in apache drill
 
Hi Charles
That is a current Calcite limitation that would need enhancement.
Here is a similar issue:
https://lists.apache.org/thread.html/f21839ec2c95f7e93980a9d2fafa89adb13998b2303517138f06b4f8@%3Cuser.drill.apache.org%3E [https://lists.apache.org/thread.html/f21839ec2c95f7e93980a9d2fafa89adb13998b2303517138f06b4f8@%3Cuser.drill.apache.org%3E]
In the future, please send an unencrypted mail to the mailing list because questions like this help other users discover similar issues and solutions. It is very unlikely that someone would open these mails from unknown senders even if the domains look genuine.
~ Kunal
 
From: charlescxchen@hsbc.com [mailto:charlescxchen@hsbc.com]
Sent: Tue Apr 23, 2019 3:29 AM
To: user-subscribe@drill.apache.org [mailto:user-subscribe@drill.apache.org], kunal@apache.org [mailto:kunal@apache.org]
Cc: johnnykpchan@hsbc.com.hk [mailto:johnnykpchan@hsbc.com.hk], barnabaschiu@hsbc.com.hk [mailto:barnabaschiu@hsbc.com.hk], conniehchan@hsbc.com.hk [mailto:conniehchan@hsbc.com.hk]
Subject: Strange join table criteria in apache drill
Hi support / kunal,
 
We are exploring if can use apache drill as the data virtualization tool. However, we find there is some issue in handling the join table sql. From the processing log, just the first table can put the filtering criteria into the sql statement to the database. Kindly please help to check if anything wrong in our configuration and how to improve it. Thx
 
Below is our PoC steps.
 
1)       Catalog the bigquery to apache drill
2)       Catch the oracle db to the apache drill
3)       Run a join table statement and monitor the log
 
We issue the sql select * from oracle_sit.etx01da1.ASP_PMT_HEADER where feature ='GBP' union all select * from bigquery.`hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` where feature ='GBP'. Both the two database has a selection criteria (feature = ‘GBP’). However, just the first data source can put the selection criteria “FEATURE" = 'GBP' into the sql statement and drill will retrieve all the table of the table from the second data source. This will impact the performance much as it will make more data to be transferred from remote data source to our drill node. Please help to advise the cause. thx
 
SELECT * FROM "ETX01DA1"."ASP_PMT_HEADER" WHERE "FEATURE" = 'GBP'
 
SELECT * FROM `hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train`
 
2019-04-23 11:00:40,284 [23411dbf-7fab-c825-e2c2-6bffc88c00ef:foreman] DEBUG o.a.d.e.p.s.h.DefaultSqlHandler - HEP_BOTTOM_UP:Physical Partition Prune Planning (1ms):
ScreenPrel: rowcount = 30.0, cumulative cost = {308.0 rows, 938.0 cpu, 0.0 io, 368640.0 network, 0.0 memory}, id = 862
  ProjectPrel(FEATURE=[$0], TITLE=[$1], LABEL=[$2]): rowcount = 30.0, cumulative cost = {305.0 rows, 935.0 cpu, 0.0 io, 368640.0 network, 0.0 memory}, id = 860
    UnionExchangePrel: rowcount = 30.0, cumulative cost = {275.0 rows, 845.0 cpu, 0.0 io, 368640.0 network, 0.0 memory}, id = 858
      UnionAllPrel(all=[true]): rowcount = 30.0, cumulative cost = {245.0 rows, 605.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 856
        JdbcPrel(sql=[SELECT * FROM "ETX01DA1"."ASP_PMT_HEADER" WHERE "FEATURE" = 'GBP' ]): rowcount = 15.0, cumulative cost = {15.0 rows, 15.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 825
        FilterPrel(condition=[=($0, 'GBP')]): rowcount = 15.0, cumulative cost = {200.0 rows, 560.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 854
          JdbcPrel(sql=[SELECT * FROM `hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` ]): rowcount = 100.0, cumulative cost = {100.0 rows, 100.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 828
 
2
 9 rows selected (88.054 seconds)
0: jdbc:drill:zk=local> select * from oracle_sit.etx01da1.ASP_PMT_HEADER where feature ='GBP' union all select * from bigquery.`hsbc-1220317-pcminsight-dev`.`asprecon`.`model2train` where feature ='GBP';
+----------+----------+--------+
| FEATURE  |  TITLE   | LABEL  |
+----------+----------+--------+
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
| GBP      | GBP#GBP  | 3.0    |
+----------+----------+--------+
9 rows selected (10.043 seconds)
 
Regards,
Charles C X Chen
 
Senior Consultant Specialist, Development
Payments and Global Liquidity and Cash Management (GLCM) | HSBC Operations, Services and Technology (HOST)
L22, Office Tower 2, Taikoo Hui, 381 Tianhe Road, Tianhe District, Guangzhou, China
___________________________________________________________________________________
Phone.   8620 38589062 Tie-line 718630-89062
Email.   charlescxchen@hsbc.com.cn [mailto:charlescxchen@hsbc.com.cn] 
___________________________________________________________________________________
 
 
 
*******************************************************************
This e-mail is confidential. It may also be legally privileged.
If you are not the addressee you may not copy, forward, disclose
or use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the
sender immediately by return e-mail.

Internet communications cannot be guaranteed to be timely,
secure, error or virus-free. The sender does not accept liability
for any errors or omissions.
*******************************************************************
"SAVE PAPER - THINK BEFORE YOU PRINT!"
******************************************************************
This E-mail is confidential. It may also be legally privileged. If
you are not the addressee you may not copy, forward, disclose or
use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the
sender immediately by return E-mail.

Internet communications cannot be guaranteed to be timely, secure,
error or virus-free. The sender does not accept liability for any
errors or omissions.
******************************************************************
SAVE PAPER - THINK BEFORE YOU PRINT!
*******************************************************************
This e-mail is confidential. It may also be legally privileged.
If you are not the addressee you may not copy, forward, disclose
or use any part of it. If you have received this message in error,
please delete it and all copies from your system and notify the
sender immediately by return e-mail.

Internet communications cannot be guaranteed to be timely,
secure, error or virus-free. The sender does not accept liability
for any errors or omissions.
*******************************************************************
"SAVE PAPER - THINK BEFORE YOU PRINT!"