You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@drill.apache.org by Idea Everywhere <ma...@gmail.com> on 2018/12/24 15:45:46 UTC

Query on tables

Hi Team,

My current situation:
I have apache drill installed in AWS EC2 (M4.4x large) instances cluster of 3 nodes. My source data is coming from S3 bucket.
I want to engage drill to read that data from S3, create tables within itself (using CTAS) while the table data is stored in AWS EFS mapped to ec2 instances created as mentioned above and allow the user to read the data from those tables.
Tables and Partitioned tables are created as of now.

Questions:
1. It is observed that, when the tables are created, it reads the data from source and the table is created along with that data (ie., if the original source is 10GB, the tables stored in the file system are comparable to that size). However, I have a question, if the source is growing, how it gets into the CTAS tables or CTAS Partition by tables, so that queries will result latest output

Kind Regards
Kiran

Re: Query on tables

Posted by Kiran Kumar NS <ma...@gmail.com>.

Hi Team,

Is there any update on below questions, please 

Kind Regards 
Kiran

Sent from my iPhone

> On Dec 26, 2018, at 10:41 AM, Kiran Kumar NS <ma...@gmail.com> wrote:
> 
> Hi Vitalii,
> 
> Thanks for getting back.
> Yes, I did try it. But, I got the table is already present. It is because I don’t want to create another table with new table name for that changed data .
> 
> Also, if it works in any which way, every time the source changes, is there any automatic way to get the drill tables also updated?
> 
> Also, I have another set of queries.
> 
> If I set max proc mem, max direct mem it is not reflected in cluster. I see it in UI. But if I change java heap mem, it reflects. Wanted to understand when we set these settings, how still distributes this setting to cluster. I did observe whatever the settings I do in options link of UI, it reflects in the whole cluster. But, if I do it in drill-env.sh, it does not. I have to go and change in all the nodes. Also, it is mentioned that distrib-env.sh should not be used by users. Essentially, I want to understand, how the configurations are propagated to all nodes in cluster and how can I reap the benefit of additional compute of the cluster so that my queries are executed fast. Currently one CTAS command and partition by commands are taking hours to process GBs of data.
> 
> I can provide statistics of elapsed time of query execution if required, for your analysis.
> 
> Kind Regards 
> Kiran
> 
> Sent from my iPhone
> 
>> On Dec 25, 2018, at 2:16 PM, Vitalii Diravka <vi...@apache.org> wrote:
>> 
>> Hello!
>> 
>> Can you filter the newly coming data and run CTAS only for those data? It will allow to avoid extra work.
>> 
>> Kind regards 
>> Vitalii 
>> 
>> 
>>> On Mon, Dec 24, 2018 at 6:53 PM Idea Everywhere <ma...@gmail.com> wrote:
>>> Hi Team,
>>> 
>>> My current situation:
>>> I have apache drill installed in AWS EC2 (M4.4x large) instances cluster of 3 nodes. My source data is coming from S3 bucket.
>>> I want to engage drill to read that data from S3, create tables within itself (using CTAS) while the table data is stored in AWS EFS mapped to ec2 instances created as mentioned above and allow the user to read the data from those tables.
>>> Tables and Partitioned tables are created as of now.
>>> 
>>> Questions:
>>> 1. It is observed that, when the tables are created, it reads the data from source and the table is created along with that data (ie., if the original source is 10GB, the tables stored in the file system are comparable to that size). However, I have a question, if the source is growing, how it gets into the CTAS tables or CTAS Partition by tables, so that queries will result latest output
>>> 
>>> Kind Regards
>>> Kiran
>>>

Re: Query on tables

Posted by Vitalii Diravka <vi...@apache.org>.

Hi Kiran,

You ask something similar to INSERT statements, but Drill doesn't support
this functionality now [1].
Also CREATE OR REPLACE TABLE could be helpful for you (with additional
work), but most databases don't support it [2].
You can write scripts with creating tables with different names in the same
directory (and then possibly manage the newly created files).
Also consider temporary tables and views for your case.

Did you restart drillbits on your cluster? Max Directory memory should be
easily changed for you on the cluster.
Here is an info about it [3] and here is the last improvements in this area
[4].
There are also a lot of system/session options, which can help you to
accelerate your CTAS [5]. It mainly depends on the type and structure of
your data and also operators, which you use in your SQL query.
You can start from investigating the individual SELECT queries and profiles
for them.

[1] https://issues.apache.org/jira/browse/DRILL-3534
[2] https://issues.apache.org/jira/browse/DRILL-3979
[3] https://drill.apache.org/docs/configuring-drill-memory/
[4] https://issues.apache.org/jira/browse/DRILL-5741
[5] https://drill.apache.org/docs/configuration-options-introduction/

Kind regards
Vitalii


On Wed, Dec 26, 2018 at 5:41 PM Kiran Kumar NS <ma...@gmail.com>
wrote:

> Hi Vitalii,
>
> Thanks for getting back.
> Yes, I did try it. But, I got the table is already present. It is because
> I don’t want to create another table with new table name for that changed
> data .
>
> Also, if it works in any which way, every time the source changes, is
> there any automatic way to get the drill tables also updated?
>
> Also, I have another set of queries.
>
> If I set max proc mem, max direct mem it is not reflected in cluster. I
> see it in UI. But if I change java heap mem, it reflects. Wanted to
> understand when we set these settings, how still distributes this setting
> to cluster. I did observe whatever the settings I do in options link of UI,
> it reflects in the whole cluster. But, if I do it in drill-env.sh, it does
> not. I have to go and change in all the nodes. Also, it is mentioned that
> distrib-env.sh should not be used by users. Essentially, I want to
> understand, how the configurations are propagated to all nodes in cluster
> and how can I reap the benefit of additional compute of the cluster so that
> my queries are executed fast. Currently one CTAS command and partition by
> commands are taking hours to process GBs of data.
>
> I can provide statistics of elapsed time of query execution if required,
> for your analysis.
>
> Kind Regards
> Kiran
>
> Sent from my iPhone
>
> > On Dec 25, 2018, at 2:16 PM, Vitalii Diravka <vi...@apache.org> wrote:
> >
> > Hello!
> >
> > Can you filter the newly coming data and run CTAS only for those data?
> It will allow to avoid extra work.
> >
> > Kind regards
> > Vitalii
> >
> >
> >> On Mon, Dec 24, 2018 at 6:53 PM Idea Everywhere <
> mailtonskiran@gmail.com> wrote:
> >> Hi Team,
> >>
> >> My current situation:
> >> I have apache drill installed in AWS EC2 (M4.4x large) instances
> cluster of 3 nodes. My source data is coming from S3 bucket.
> >> I want to engage drill to read that data from S3, create tables within
> itself (using CTAS) while the table data is stored in AWS EFS mapped to ec2
> instances created as mentioned above and allow the user to read the data
> from those tables.
> >> Tables and Partitioned tables are created as of now.
> >>
> >> Questions:
> >> 1. It is observed that, when the tables are created, it reads the data
> from source and the table is created along with that data (ie., if the
> original source is 10GB, the tables stored in the file system are
> comparable to that size). However, I have a question, if the source is
> growing, how it gets into the CTAS tables or CTAS Partition by tables, so
> that queries will result latest output
> >>
> >> Kind Regards
> >> Kiran
> >>
>

Re: Query on tables

Posted by Kiran Kumar NS <ma...@gmail.com>.

Hi Team,

Is there any update on below questions, please 

Kind Regards 
Kiran

Sent from my iPhone

> On Dec 26, 2018, at 10:41 AM, Kiran Kumar NS <ma...@gmail.com> wrote:
> 
> Hi Vitalii,
> 
> Thanks for getting back.
> Yes, I did try it. But, I got the table is already present. It is because I don’t want to create another table with new table name for that changed data .
> 
> Also, if it works in any which way, every time the source changes, is there any automatic way to get the drill tables also updated?
> 
> Also, I have another set of queries.
> 
> If I set max proc mem, max direct mem it is not reflected in cluster. I see it in UI. But if I change java heap mem, it reflects. Wanted to understand when we set these settings, how still distributes this setting to cluster. I did observe whatever the settings I do in options link of UI, it reflects in the whole cluster. But, if I do it in drill-env.sh, it does not. I have to go and change in all the nodes. Also, it is mentioned that distrib-env.sh should not be used by users. Essentially, I want to understand, how the configurations are propagated to all nodes in cluster and how can I reap the benefit of additional compute of the cluster so that my queries are executed fast. Currently one CTAS command and partition by commands are taking hours to process GBs of data.
> 
> I can provide statistics of elapsed time of query execution if required, for your analysis.
> 
> Kind Regards 
> Kiran
> 
> Sent from my iPhone
> 
>> On Dec 25, 2018, at 2:16 PM, Vitalii Diravka <vi...@apache.org> wrote:
>> 
>> Hello!
>> 
>> Can you filter the newly coming data and run CTAS only for those data? It will allow to avoid extra work.
>> 
>> Kind regards 
>> Vitalii 
>> 
>> 
>>> On Mon, Dec 24, 2018 at 6:53 PM Idea Everywhere <ma...@gmail.com> wrote:
>>> Hi Team,
>>> 
>>> My current situation:
>>> I have apache drill installed in AWS EC2 (M4.4x large) instances cluster of 3 nodes. My source data is coming from S3 bucket.
>>> I want to engage drill to read that data from S3, create tables within itself (using CTAS) while the table data is stored in AWS EFS mapped to ec2 instances created as mentioned above and allow the user to read the data from those tables.
>>> Tables and Partitioned tables are created as of now.
>>> 
>>> Questions:
>>> 1. It is observed that, when the tables are created, it reads the data from source and the table is created along with that data (ie., if the original source is 10GB, the tables stored in the file system are comparable to that size). However, I have a question, if the source is growing, how it gets into the CTAS tables or CTAS Partition by tables, so that queries will result latest output
>>> 
>>> Kind Regards
>>> Kiran
>>>

Re: Query on tables

Posted by Vitalii Diravka <vi...@apache.org>.

Hi Kiran,

You ask something similar to INSERT statements, but Drill doesn't support
this functionality now [1].
Also CREATE OR REPLACE TABLE could be helpful for you (with additional
work), but most databases don't support it [2].
You can write scripts with creating tables with different names in the same
directory (and then possibly manage the newly created files).
Also consider temporary tables and views for your case.

Did you restart drillbits on your cluster? Max Directory memory should be
easily changed for you on the cluster.
Here is an info about it [3] and here is the last improvements in this area
[4].
There are also a lot of system/session options, which can help you to
accelerate your CTAS [5]. It mainly depends on the type and structure of
your data and also operators, which you use in your SQL query.
You can start from investigating the individual SELECT queries and profiles
for them.

[1] https://issues.apache.org/jira/browse/DRILL-3534
[2] https://issues.apache.org/jira/browse/DRILL-3979
[3] https://drill.apache.org/docs/configuring-drill-memory/
[4] https://issues.apache.org/jira/browse/DRILL-5741
[5] https://drill.apache.org/docs/configuration-options-introduction/

Kind regards
Vitalii


On Wed, Dec 26, 2018 at 5:41 PM Kiran Kumar NS <ma...@gmail.com>
wrote:

> Hi Vitalii,
>
> Thanks for getting back.
> Yes, I did try it. But, I got the table is already present. It is because
> I don’t want to create another table with new table name for that changed
> data .
>
> Also, if it works in any which way, every time the source changes, is
> there any automatic way to get the drill tables also updated?
>
> Also, I have another set of queries.
>
> If I set max proc mem, max direct mem it is not reflected in cluster. I
> see it in UI. But if I change java heap mem, it reflects. Wanted to
> understand when we set these settings, how still distributes this setting
> to cluster. I did observe whatever the settings I do in options link of UI,
> it reflects in the whole cluster. But, if I do it in drill-env.sh, it does
> not. I have to go and change in all the nodes. Also, it is mentioned that
> distrib-env.sh should not be used by users. Essentially, I want to
> understand, how the configurations are propagated to all nodes in cluster
> and how can I reap the benefit of additional compute of the cluster so that
> my queries are executed fast. Currently one CTAS command and partition by
> commands are taking hours to process GBs of data.
>
> I can provide statistics of elapsed time of query execution if required,
> for your analysis.
>
> Kind Regards
> Kiran
>
> Sent from my iPhone
>
> > On Dec 25, 2018, at 2:16 PM, Vitalii Diravka <vi...@apache.org> wrote:
> >
> > Hello!
> >
> > Can you filter the newly coming data and run CTAS only for those data?
> It will allow to avoid extra work.
> >
> > Kind regards
> > Vitalii
> >
> >
> >> On Mon, Dec 24, 2018 at 6:53 PM Idea Everywhere <
> mailtonskiran@gmail.com> wrote:
> >> Hi Team,
> >>
> >> My current situation:
> >> I have apache drill installed in AWS EC2 (M4.4x large) instances
> cluster of 3 nodes. My source data is coming from S3 bucket.
> >> I want to engage drill to read that data from S3, create tables within
> itself (using CTAS) while the table data is stored in AWS EFS mapped to ec2
> instances created as mentioned above and allow the user to read the data
> from those tables.
> >> Tables and Partitioned tables are created as of now.
> >>
> >> Questions:
> >> 1. It is observed that, when the tables are created, it reads the data
> from source and the table is created along with that data (ie., if the
> original source is 10GB, the tables stored in the file system are
> comparable to that size). However, I have a question, if the source is
> growing, how it gets into the CTAS tables or CTAS Partition by tables, so
> that queries will result latest output
> >>
> >> Kind Regards
> >> Kiran
> >>
>

Re: Query on tables

Posted by Kiran Kumar NS <ma...@gmail.com>.

Hi Vitalii,

Thanks for getting back.
Yes, I did try it. But, I got the table is already present. It is because I don’t want to create another table with new table name for that changed data .

Also, if it works in any which way, every time the source changes, is there any automatic way to get the drill tables also updated?

Also, I have another set of queries.

If I set max proc mem, max direct mem it is not reflected in cluster. I see it in UI. But if I change java heap mem, it reflects. Wanted to understand when we set these settings, how still distributes this setting to cluster. I did observe whatever the settings I do in options link of UI, it reflects in the whole cluster. But, if I do it in drill-env.sh, it does not. I have to go and change in all the nodes. Also, it is mentioned that distrib-env.sh should not be used by users. Essentially, I want to understand, how the configurations are propagated to all nodes in cluster and how can I reap the benefit of additional compute of the cluster so that my queries are executed fast. Currently one CTAS command and partition by commands are taking hours to process GBs of data.

I can provide statistics of elapsed time of query execution if required, for your analysis.

Kind Regards 
Kiran

Sent from my iPhone

> On Dec 25, 2018, at 2:16 PM, Vitalii Diravka <vi...@apache.org> wrote:
> 
> Hello!
> 
> Can you filter the newly coming data and run CTAS only for those data? It will allow to avoid extra work.
> 
> Kind regards 
> Vitalii 
> 
> 
>> On Mon, Dec 24, 2018 at 6:53 PM Idea Everywhere <ma...@gmail.com> wrote:
>> Hi Team,
>> 
>> My current situation:
>> I have apache drill installed in AWS EC2 (M4.4x large) instances cluster of 3 nodes. My source data is coming from S3 bucket.
>> I want to engage drill to read that data from S3, create tables within itself (using CTAS) while the table data is stored in AWS EFS mapped to ec2 instances created as mentioned above and allow the user to read the data from those tables.
>> Tables and Partitioned tables are created as of now.
>> 
>> Questions:
>> 1. It is observed that, when the tables are created, it reads the data from source and the table is created along with that data (ie., if the original source is 10GB, the tables stored in the file system are comparable to that size). However, I have a question, if the source is growing, how it gets into the CTAS tables or CTAS Partition by tables, so that queries will result latest output
>> 
>> Kind Regards
>> Kiran
>>

Re: Query on tables

Posted by Kiran Kumar NS <ma...@gmail.com>.

Hi Vitalii,

Thanks for getting back.
Yes, I did try it. But, I got the table is already present. It is because I don’t want to create another table with new table name for that changed data .

Also, if it works in any which way, every time the source changes, is there any automatic way to get the drill tables also updated?

Also, I have another set of queries.

If I set max proc mem, max direct mem it is not reflected in cluster. I see it in UI. But if I change java heap mem, it reflects. Wanted to understand when we set these settings, how still distributes this setting to cluster. I did observe whatever the settings I do in options link of UI, it reflects in the whole cluster. But, if I do it in drill-env.sh, it does not. I have to go and change in all the nodes. Also, it is mentioned that distrib-env.sh should not be used by users. Essentially, I want to understand, how the configurations are propagated to all nodes in cluster and how can I reap the benefit of additional compute of the cluster so that my queries are executed fast. Currently one CTAS command and partition by commands are taking hours to process GBs of data.

I can provide statistics of elapsed time of query execution if required, for your analysis.

Kind Regards 
Kiran

Sent from my iPhone

> On Dec 25, 2018, at 2:16 PM, Vitalii Diravka <vi...@apache.org> wrote:
> 
> Hello!
> 
> Can you filter the newly coming data and run CTAS only for those data? It will allow to avoid extra work.
> 
> Kind regards 
> Vitalii 
> 
> 
>> On Mon, Dec 24, 2018 at 6:53 PM Idea Everywhere <ma...@gmail.com> wrote:
>> Hi Team,
>> 
>> My current situation:
>> I have apache drill installed in AWS EC2 (M4.4x large) instances cluster of 3 nodes. My source data is coming from S3 bucket.
>> I want to engage drill to read that data from S3, create tables within itself (using CTAS) while the table data is stored in AWS EFS mapped to ec2 instances created as mentioned above and allow the user to read the data from those tables.
>> Tables and Partitioned tables are created as of now.
>> 
>> Questions:
>> 1. It is observed that, when the tables are created, it reads the data from source and the table is created along with that data (ie., if the original source is 10GB, the tables stored in the file system are comparable to that size). However, I have a question, if the source is growing, how it gets into the CTAS tables or CTAS Partition by tables, so that queries will result latest output
>> 
>> Kind Regards
>> Kiran
>>

Re: Query on tables

Posted by Vitalii Diravka <vi...@apache.org>.

Hello!

Can you filter the newly coming data and run CTAS only for those data? It
will allow to avoid extra work.

Kind regards
Vitalii


On Mon, Dec 24, 2018 at 6:53 PM Idea Everywhere <ma...@gmail.com>
wrote:

> Hi Team,
>
> My current situation:
> I have apache drill installed in AWS EC2 (M4.4x large) instances cluster
> of 3 nodes. My source data is coming from S3 bucket.
> I want to engage drill to read that data from S3, create tables within
> itself (using CTAS) while the table data is stored in AWS EFS mapped to ec2
> instances created as mentioned above and allow the user to read the data
> from those tables.
> Tables and Partitioned tables are created as of now.
>
> Questions:
> 1. It is observed that, when the tables are created, it reads the data
> from source and the table is created along with that data (ie., if the
> original source is 10GB, the tables stored in the file system are
> comparable to that size). However, I have a question, if the source is
> growing, how it gets into the CTAS tables or CTAS Partition by tables, so
> that queries will result latest output
>
> Kind Regards
> Kiran
>
>

Re: Query on tables

Posted by Vitalii Diravka <vi...@apache.org>.

Hello!

Can you filter the newly coming data and run CTAS only for those data? It
will allow to avoid extra work.

Kind regards
Vitalii


On Mon, Dec 24, 2018 at 6:53 PM Idea Everywhere <ma...@gmail.com>
wrote:

> Hi Team,
>
> My current situation:
> I have apache drill installed in AWS EC2 (M4.4x large) instances cluster
> of 3 nodes. My source data is coming from S3 bucket.
> I want to engage drill to read that data from S3, create tables within
> itself (using CTAS) while the table data is stored in AWS EFS mapped to ec2
> instances created as mentioned above and allow the user to read the data
> from those tables.
> Tables and Partitioned tables are created as of now.
>
> Questions:
> 1. It is observed that, when the tables are created, it reads the data
> from source and the table is created along with that data (ie., if the
> original source is 10GB, the tables stored in the file system are
> comparable to that size). However, I have a question, if the source is
> growing, how it gets into the CTAS tables or CTAS Partition by tables, so
> that queries will result latest output
>
> Kind Regards
> Kiran
>
>