You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by Magnus Pierre <mp...@maprtech.com> on 2015/09/16 18:43:23 UTC

Enhancement suggestions for the Drill JDBC Plugin

Hello,

I am not a developer, and don’t aim to become one either, even though I cheat by writing code now and then. I do however have lots of ideas for optimizations of the recently completed JDBC plugin. (DRILL-3180)
Since JDBC is a single drill-bit operation, it would be good to utilize the fact that most databases both store and provide means of accessing individual partitions and by that be able to run the query over multiple drillbits.
For instance Oracle have named partitions that can be queried individually:
SELECT * FROM employees PARTITION (p1);

DB2 has partitioning elimination and if you query within the ranges it is a single or a few partitions that provide the data. By probing metadata you could turn a select query to be divided across the partitions and by that be able to extract the information in parallell.

Teradata does not have partitions in the same sense since everything is hash, but there you could optimize the query execution in other ways: 
Teradata does however support something called multi statement requests meaning that you have one query consisting of many queries separated by ;  and when it executes it will combine all queries into shared steps i.e. one  query plan and making the complete execution more efficient and less costly. (Basically eliminating lots of spool usage, and table access). Each query will then be returned as individual result sets, and could therefore be read in parallell

Example code:
https://developer.teradata.com/doc/connectivity/jdbc/reference/current/samp/T20701JD.java.txt <https://developer.teradata.com/doc/connectivity/jdbc/reference/current/samp/T20701JD.java.txt>

Key point here is that they need to be issued as a multi-statement request otherwise the optimizations will not take place.
With some simple knowledge of the source table you could then turn a simple query into a multi-statement of ranges and then run it as a multi-statement, to get spool elimination as well as parallell read.

Here’s just a few options. Anyone interested to pick this up? I don’t think there’s one strategy that fits all databases, but it would be very good enhancements for those databases that do support partitions or functionality like multi-statements.

Regards,
Magnus