You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Jiang Wu <ji...@mulesoft.com.INVALID> on 2020/01/15 21:48:40 UTC

RDBMS Storage Plugin Configurations

Question on the RDBMS Storage Plugin: is it possible to set various options
for the database connection pool used for this storage plugin?  For
example, max number of connections, idle timeout, etc?

Thanks.

-- Jiang

Re: RDBMS Storage Plugin Configurations

Posted by Jiang Wu <ji...@mulesoft.com.INVALID>.
Thank you all for the replies!  Yes, looks like DRILL-7467 is what I am
looking for.  The use case is do an occational join across two storage
plugins with one of them being the jdbc storage plugin.  Now we would like
to release the JDBC connection used afterward.

-- Jiang



On Thu, Jan 16, 2020 at 4:19 AM Arina Yelchiyeva <ar...@gmail.com>
wrote:

> In the scope of DRILL-7467 one of the enhancements will be to add
> opportunity to control basic data source setup thought the storage plugin
> config.
> You will be able to indicate basic data source parameters and their
> values: https://commons.apache.org/proper/commons-dbcp/configuration.html
> <https://commons.apache.org/proper/commons-dbcp/configuration.html>
>
> Feature will be available in Drill 1.18 or in master branch as soon as it
> will be committed.
>
> Kind regards,
> Arina
>
> > On Jan 16, 2020, at 1:28 AM, Charles Givre <cg...@gmail.com> wrote:
> >
> > Hi Jiang,
> > Welcome to Drill!
> > Just as an FYI, there are several improvements underway for the JDBC
> plugin:
> > https://issues.apache.org/jira/browse/DRILL-7467 <
> https://issues.apache.org/jira/browse/DRILL-7467>
> > https://issues.apache.org/jira/projects/DRILL/issues/DRILL-7490 <
> https://issues.apache.org/jira/projects/DRILL/issues/DRILL-7490?filter=allissues&orderby=created+DESC,+priority+DESC,+updated+DESC
> >
> >
> > With respect to the non-relational model, I'd echo Ted's question and
> ask what are you looking for specifically?  There is work underway to get
> Drill to natively support additional non-relational source systems as well
> as the ability to natively query rest endpoints.
> >
> > Best,
> > -- C
> >
> >
> >> On Jan 15, 2020, at 5:51 PM, Paul Rogers <pa...@yahoo.com.INVALID>
> wrote:
> >>
> >> Hi Jiang,
> >>
> >> Welcome to the Drill mailing list.
> >>
> >> I think you may be making some assumptions about how Drill works,
> perhaps based on how other DB-driven applications work.
> >>
> >> Drill is not primarily a front-end for an RDBS. Instead, it is
> primarily designed to scan distributed data as fast as possible to extract
> records of interest. Drill does support JDBC data sources, but this is not
> the main use case.
> >>
> >> In Drill, each query is stand-alone: Drill opens connections as needed
> to whatever data source you use; reads data, and releases all resources.
> Since Drill is distributed, this happens on each node. Since Drill is
> multi-threaded, this work also happens for each "minor fragment" (thread of
> execution) on each node. Drill is also multi-user; each user might have
> their own DB security restrictions.
> >>
> >> This makes sense: if we want to read at maximum speed across 10 minor
> fragments (say) then all 10 need their own DB connections and all will try
> to keep those connections 100% busy.
> >>
> >> As a result, Drill has no DB connection pool: not within a query and
> not across queries. So, there is no idle timeout. The maximum number of
> connections is set by the maximum "slice width" (number of fragments per
> node) and the total number of nodes. Slice width is, by default, 70% of
> your CPU count. So, if you have 10 nodes with 8 cores each, you will have
> roughly 60 open DB connections for the duration of the query (assuming that
> the DB storage plugin knows how to shard queries across all those minor
> fragments. I'm not sure that the JDBC storage plugin knows how to do this.
> Can anyone clarify this point?)
> >>
> >> It sounds like you have a particular use-case in mind that might
> benefit from connection caching. Can you share that use case to help us
> understand? And, of course, Drill is open source; if you find you need this
> ability, it can certainly be added.
> >>
> >> Drillers: please offer corrections if I've overlooked something; I'm
> not super familiar with the details of the JDBC data source.
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >>   On Wednesday, January 15, 2020, 01:49:21 PM PST, Jiang Wu
> <ji...@mulesoft.com.invalid> wrote:
> >>
> >> Question on the RDBMS Storage Plugin: is it possible to set various
> options
> >> for the database connection pool used for this storage plugin?  For
> >> example, max number of connections, idle timeout, etc?
> >>
> >> Thanks.
> >>
> >> -- Jiang
> >
>
>

Re: RDBMS Storage Plugin Configurations

Posted by Arina Yelchiyeva <ar...@gmail.com>.
In the scope of DRILL-7467 one of the enhancements will be to add opportunity to control basic data source setup thought the storage plugin config.
You will be able to indicate basic data source parameters and their values: https://commons.apache.org/proper/commons-dbcp/configuration.html <https://commons.apache.org/proper/commons-dbcp/configuration.html>

Feature will be available in Drill 1.18 or in master branch as soon as it will be committed.

Kind regards,
Arina

> On Jan 16, 2020, at 1:28 AM, Charles Givre <cg...@gmail.com> wrote:
> 
> Hi Jiang, 
> Welcome to Drill!
> Just as an FYI, there are several improvements underway for the JDBC plugin:
> https://issues.apache.org/jira/browse/DRILL-7467 <https://issues.apache.org/jira/browse/DRILL-7467>
> https://issues.apache.org/jira/projects/DRILL/issues/DRILL-7490 <https://issues.apache.org/jira/projects/DRILL/issues/DRILL-7490?filter=allissues&orderby=created+DESC,+priority+DESC,+updated+DESC>
> 
> With respect to the non-relational model, I'd echo Ted's question and ask what are you looking for specifically?  There is work underway to get Drill to natively support additional non-relational source systems as well as the ability to natively query rest endpoints. 
> 
> Best,
> -- C
> 
> 
>> On Jan 15, 2020, at 5:51 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
>> 
>> Hi Jiang,
>> 
>> Welcome to the Drill mailing list.
>> 
>> I think you may be making some assumptions about how Drill works, perhaps based on how other DB-driven applications work.
>> 
>> Drill is not primarily a front-end for an RDBS. Instead, it is primarily designed to scan distributed data as fast as possible to extract records of interest. Drill does support JDBC data sources, but this is not the main use case.
>> 
>> In Drill, each query is stand-alone: Drill opens connections as needed to whatever data source you use; reads data, and releases all resources. Since Drill is distributed, this happens on each node. Since Drill is multi-threaded, this work also happens for each "minor fragment" (thread of execution) on each node. Drill is also multi-user; each user might have their own DB security restrictions.
>> 
>> This makes sense: if we want to read at maximum speed across 10 minor fragments (say) then all 10 need their own DB connections and all will try to keep those connections 100% busy.
>> 
>> As a result, Drill has no DB connection pool: not within a query and not across queries. So, there is no idle timeout. The maximum number of connections is set by the maximum "slice width" (number of fragments per node) and the total number of nodes. Slice width is, by default, 70% of your CPU count. So, if you have 10 nodes with 8 cores each, you will have roughly 60 open DB connections for the duration of the query (assuming that the DB storage plugin knows how to shard queries across all those minor fragments. I'm not sure that the JDBC storage plugin knows how to do this. Can anyone clarify this point?)
>> 
>> It sounds like you have a particular use-case in mind that might benefit from connection caching. Can you share that use case to help us understand? And, of course, Drill is open source; if you find you need this ability, it can certainly be added.
>> 
>> Drillers: please offer corrections if I've overlooked something; I'm not super familiar with the details of the JDBC data source.
>> 
>> Thanks,
>> - Paul
>> 
>> 
>> 
>>   On Wednesday, January 15, 2020, 01:49:21 PM PST, Jiang Wu <ji...@mulesoft.com.invalid> wrote:  
>> 
>> Question on the RDBMS Storage Plugin: is it possible to set various options
>> for the database connection pool used for this storage plugin?  For
>> example, max number of connections, idle timeout, etc?
>> 
>> Thanks.
>> 
>> -- Jiang
> 


Re: RDBMS Storage Plugin Configurations

Posted by Charles Givre <cg...@gmail.com>.
Hi Jiang, 
Welcome to Drill!
Just as an FYI, there are several improvements underway for the JDBC plugin:
https://issues.apache.org/jira/browse/DRILL-7467 <https://issues.apache.org/jira/browse/DRILL-7467>
https://issues.apache.org/jira/projects/DRILL/issues/DRILL-7490 <https://issues.apache.org/jira/projects/DRILL/issues/DRILL-7490?filter=allissues&orderby=created+DESC,+priority+DESC,+updated+DESC>

With respect to the non-relational model, I'd echo Ted's question and ask what are you looking for specifically?  There is work underway to get Drill to natively support additional non-relational source systems as well as the ability to natively query rest endpoints. 

Best,
-- C


> On Jan 15, 2020, at 5:51 PM, Paul Rogers <pa...@yahoo.com.INVALID> wrote:
> 
> Hi Jiang,
> 
> Welcome to the Drill mailing list.
> 
> I think you may be making some assumptions about how Drill works, perhaps based on how other DB-driven applications work.
> 
> Drill is not primarily a front-end for an RDBS. Instead, it is primarily designed to scan distributed data as fast as possible to extract records of interest. Drill does support JDBC data sources, but this is not the main use case.
> 
> In Drill, each query is stand-alone: Drill opens connections as needed to whatever data source you use; reads data, and releases all resources. Since Drill is distributed, this happens on each node. Since Drill is multi-threaded, this work also happens for each "minor fragment" (thread of execution) on each node. Drill is also multi-user; each user might have their own DB security restrictions.
> 
> This makes sense: if we want to read at maximum speed across 10 minor fragments (say) then all 10 need their own DB connections and all will try to keep those connections 100% busy.
> 
> As a result, Drill has no DB connection pool: not within a query and not across queries. So, there is no idle timeout. The maximum number of connections is set by the maximum "slice width" (number of fragments per node) and the total number of nodes. Slice width is, by default, 70% of your CPU count. So, if you have 10 nodes with 8 cores each, you will have roughly 60 open DB connections for the duration of the query (assuming that the DB storage plugin knows how to shard queries across all those minor fragments. I'm not sure that the JDBC storage plugin knows how to do this. Can anyone clarify this point?)
> 
> It sounds like you have a particular use-case in mind that might benefit from connection caching. Can you share that use case to help us understand? And, of course, Drill is open source; if you find you need this ability, it can certainly be added.
> 
> Drillers: please offer corrections if I've overlooked something; I'm not super familiar with the details of the JDBC data source.
> 
> Thanks,
> - Paul
> 
> 
> 
>    On Wednesday, January 15, 2020, 01:49:21 PM PST, Jiang Wu <ji...@mulesoft.com.invalid> wrote:  
> 
> Question on the RDBMS Storage Plugin: is it possible to set various options
> for the database connection pool used for this storage plugin?  For
> example, max number of connections, idle timeout, etc?
> 
> Thanks.
> 
> -- Jiang


Re: RDBMS Storage Plugin Configurations

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.
Hi Jiang,

Welcome to the Drill mailing list.

I think you may be making some assumptions about how Drill works, perhaps based on how other DB-driven applications work.

Drill is not primarily a front-end for an RDBS. Instead, it is primarily designed to scan distributed data as fast as possible to extract records of interest. Drill does support JDBC data sources, but this is not the main use case.

In Drill, each query is stand-alone: Drill opens connections as needed to whatever data source you use; reads data, and releases all resources. Since Drill is distributed, this happens on each node. Since Drill is multi-threaded, this work also happens for each "minor fragment" (thread of execution) on each node. Drill is also multi-user; each user might have their own DB security restrictions.

This makes sense: if we want to read at maximum speed across 10 minor fragments (say) then all 10 need their own DB connections and all will try to keep those connections 100% busy.

As a result, Drill has no DB connection pool: not within a query and not across queries. So, there is no idle timeout. The maximum number of connections is set by the maximum "slice width" (number of fragments per node) and the total number of nodes. Slice width is, by default, 70% of your CPU count. So, if you have 10 nodes with 8 cores each, you will have roughly 60 open DB connections for the duration of the query (assuming that the DB storage plugin knows how to shard queries across all those minor fragments. I'm not sure that the JDBC storage plugin knows how to do this. Can anyone clarify this point?)

It sounds like you have a particular use-case in mind that might benefit from connection caching. Can you share that use case to help us understand? And, of course, Drill is open source; if you find you need this ability, it can certainly be added.

Drillers: please offer corrections if I've overlooked something; I'm not super familiar with the details of the JDBC data source.

Thanks,
- Paul

 

    On Wednesday, January 15, 2020, 01:49:21 PM PST, Jiang Wu <ji...@mulesoft.com.invalid> wrote:  
 
 Question on the RDBMS Storage Plugin: is it possible to set various options
for the database connection pool used for this storage plugin?  For
example, max number of connections, idle timeout, etc?

Thanks.

-- Jiang