You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by PROJJWAL SAHA <pr...@gmail.com> on 2017/02/20 12:07:03 UTC

Query on performance using Drill and Amazon s3.

Hello all,

I am using 1GB data in the form of .tsv file, stored in Amazon S3 using
Drill 1.8. I am using default configurations of Drill using S3 storage
plugin coming out of the box. The drill bits are configured on a 5 node
cluster with 32GB RAM and 4VCPU.

I see that select * from xxx; query takes 23 mins to fetch 1,040,000 rows.

Is this the expected behaviour ?
I am looking for any quick tuning that can improve the performance or any
other suggestions.

Attaching is the JSON profile for this query.

Regards,
Projjwal

RE: Query on performance using Drill and Amazon s3.

Posted by Shankar Mane <sh...@games24x7.com>.

1. how much memory have u configured for drill?
2. what about network bandwidth between your s3 and cluster?

On 20-Feb-2017 8:14 PM, "Nitin Pawar" <ni...@gmail.com> wrote:

> Hi chetan,
>
> Projjwal has the issue. Me too asked the same question
>
> On Feb 20, 2017 7:56 PM, "Chetan Kothari" <ch...@oracle.com>
> wrote:
>
> > Hi Nitin
> >
> >
> >
> > Where does the query execute?
> >
> > Does Drill execute query on AWS and fetch results to be displayed?
> >
> >
> >
> > Regards
> >
> > Chetan
> >
> >
> >
> > -----Original Message-----
> > From: Nitin Pawar [mailto:nitinpawar432@gmail.com]
> > Sent: Monday, February 20, 2017 6:19 PM
> > To: user@drill.apache.org
> > Subject: Re: Query on performance using Drill and Amazon s3.
> >
> >
> >
> > how are you doing select * .. using drill UI or sqlline?
> >
> > where are you running it from ?
> >
> > is the drill hosted in aws or on your local machine?
> >
> >
> >
> > I think majority of the time is spent on displaying the result set
> instead
> > of querying the file if the drill server is on aws.
> >
> > If the drill server is local then it might be your network which might
> > take a lot of time based on s3 bucket location and where your drill
> server
> > is
> >
> >
> >
> > On Mon, Feb 20, 2017 at 5:37 PM, PROJJWAL SAHA <HYPERLINK "mailto:
> > proj.saha@gmail.com"proj.saha@gmail.com> wrote:
> >
> >
> >
> > > Hello all,
> >
> > >
> >
> > > I am using 1GB data in the form of .tsv file, stored in Amazon S3
> >
> > > using Drill 1.8. I am using default configurations of Drill using S3
> >
> > > storage plugin coming out of the box. The drill bits are configured on
> >
> > > a 5 node cluster with 32GB RAM and 4VCPU.
> >
> > >
> >
> > > I see that select * from xxx; query takes 23 mins to fetch 1,040,000
> > rows.
> >
> > >
> >
> > > Is this the expected behaviour ?
> >
> > > I am looking for any quick tuning that can improve the performance or
> >
> > > any other suggestions.
> >
> > >
> >
> > > Attaching is the JSON profile for this query.
> >
> > >
> >
> > > Regards,
> >
> > > Projjwal
> >
> > >
> >
> >
> >
> >
> >
> >
> >
> > --
> >
> > Nitin Pawar
> >
> >
> >
>

RE: Query on performance using Drill and Amazon s3.

Posted by Chetan Kothari <ch...@oracle.com>.

My query is generic

What I am asking is that does drill fire query on target data store and only fetch result or does it fetch data and then fire query ?

 

Regards

Chetan

 

-----Original Message-----
From: Nitin Pawar [mailto:nitinpawar432@gmail.com] 
Sent: Monday, February 20, 2017 8:14 PM
To: user@drill.apache.org
Subject: RE: Query on performance using Drill and Amazon s3.

 

Hi chetan,

 

Projjwal has the issue. Me too asked the same question

 

On Feb 20, 2017 7:56 PM, "Chetan Kothari" <HYPERLINK "mailto:chetan.kothari@oracle.com"chetan.kothari@oracle.com> wrote:

 

> Hi Nitin

> 

> 

> 

> Where does the query execute?

> 

> Does Drill execute query on AWS and fetch results to be displayed?

> 

> 

> 

> Regards

> 

> Chetan

> 

> 

> 

> -----Original Message-----

> From: Nitin Pawar [mailto:nitinpawar432@gmail.com]

> Sent: Monday, February 20, 2017 6:19 PM

> To: HYPERLINK "mailto:user@drill.apache.org"user@drill.apache.org

> Subject: Re: Query on performance using Drill and Amazon s3.

> 

> 

> 

> how are you doing select * .. using drill UI or sqlline?

> 

> where are you running it from ?

> 

> is the drill hosted in aws or on your local machine?

> 

> 

> 

> I think majority of the time is spent on displaying the result set 

> instead of querying the file if the drill server is on aws.

> 

> If the drill server is local then it might be your network which might 

> take a lot of time based on s3 bucket location and where your drill 

> server is

> 

> 

> 

> On Mon, Feb 20, 2017 at 5:37 PM, PROJJWAL SAHA <HYPERLINK "mailto:

> HYPERLINK "mailto:proj.saha@gmail.com%22proj.saha@gmail.com"proj.saha@gmail.com"proj.saha@gmail.com> wrote:

> 

> 

> 

> > Hello all,

> 

> >

> 

> > I am using 1GB data in the form of .tsv file, stored in Amazon S3

> 

> > using Drill 1.8. I am using default configurations of Drill using S3

> 

> > storage plugin coming out of the box. The drill bits are configured 

> > on

> 

> > a 5 node cluster with 32GB RAM and 4VCPU.

> 

> >

> 

> > I see that select * from xxx; query takes 23 mins to fetch 1,040,000

> rows.

> 

> >

> 

> > Is this the expected behaviour ?

> 

> > I am looking for any quick tuning that can improve the performance 

> > or

> 

> > any other suggestions.

> 

> >

> 

> > Attaching is the JSON profile for this query.

> 

> >

> 

> > Regards,

> 

> > Projjwal

> 

> >

> 

> 

> 

> 

> 

> 

> 

> --

> 

> Nitin Pawar

> 

> 

>

RE: Query on performance using Drill and Amazon s3.

Posted by Nitin Pawar <ni...@gmail.com>.

Hi chetan,

Projjwal has the issue. Me too asked the same question

On Feb 20, 2017 7:56 PM, "Chetan Kothari" <ch...@oracle.com> wrote:

> Hi Nitin
>
>
>
> Where does the query execute?
>
> Does Drill execute query on AWS and fetch results to be displayed?
>
>
>
> Regards
>
> Chetan
>
>
>
> -----Original Message-----
> From: Nitin Pawar [mailto:nitinpawar432@gmail.com]
> Sent: Monday, February 20, 2017 6:19 PM
> To: user@drill.apache.org
> Subject: Re: Query on performance using Drill and Amazon s3.
>
>
>
> how are you doing select * .. using drill UI or sqlline?
>
> where are you running it from ?
>
> is the drill hosted in aws or on your local machine?
>
>
>
> I think majority of the time is spent on displaying the result set instead
> of querying the file if the drill server is on aws.
>
> If the drill server is local then it might be your network which might
> take a lot of time based on s3 bucket location and where your drill server
> is
>
>
>
> On Mon, Feb 20, 2017 at 5:37 PM, PROJJWAL SAHA <HYPERLINK "mailto:
> proj.saha@gmail.com"proj.saha@gmail.com> wrote:
>
>
>
> > Hello all,
>
> >
>
> > I am using 1GB data in the form of .tsv file, stored in Amazon S3
>
> > using Drill 1.8. I am using default configurations of Drill using S3
>
> > storage plugin coming out of the box. The drill bits are configured on
>
> > a 5 node cluster with 32GB RAM and 4VCPU.
>
> >
>
> > I see that select * from xxx; query takes 23 mins to fetch 1,040,000
> rows.
>
> >
>
> > Is this the expected behaviour ?
>
> > I am looking for any quick tuning that can improve the performance or
>
> > any other suggestions.
>
> >
>
> > Attaching is the JSON profile for this query.
>
> >
>
> > Regards,
>
> > Projjwal
>
> >
>
>
>
>
>
>
>
> --
>
> Nitin Pawar
>
>
>

RE: Query on performance using Drill and Amazon s3.

Posted by Chetan Kothari <ch...@oracle.com>.

Hi Nitin

Where does the query execute?

Does Drill execute query on AWS and fetch results to be displayed?

Regards

Chetan

-----Original Message-----
From: Nitin Pawar [mailto:nitinpawar432@gmail.com] 
Sent: Monday, February 20, 2017 6:19 PM
To: user@drill.apache.org
Subject: Re: Query on performance using Drill and Amazon s3.

how are you doing select * .. using drill UI or sqlline?

where are you running it from ?

is the drill hosted in aws or on your local machine?

I think majority of the time is spent on displaying the result set instead of querying the file if the drill server is on aws.

If the drill server is local then it might be your network which might take a lot of time based on s3 bucket location and where your drill server is

On Mon, Feb 20, 2017 at 5:37 PM, PROJJWAL SAHA <HYPERLINK "mailto:proj.saha@gmail.com"proj.saha@gmail.com> wrote:

> Hello all,

> 

> I am using 1GB data in the form of .tsv file, stored in Amazon S3 

> using Drill 1.8. I am using default configurations of Drill using S3 

> storage plugin coming out of the box. The drill bits are configured on 

> a 5 node cluster with 32GB RAM and 4VCPU.

> 

> I see that select * from xxx; query takes 23 mins to fetch 1,040,000 rows.

> 

> Is this the expected behaviour ?

> I am looking for any quick tuning that can improve the performance or 

> any other suggestions.

> 

> Attaching is the JSON profile for this query.

> 

> Regards,

> Projjwal

> 

--

Nitin Pawar

Re: Query on performance using Drill and Amazon s3.

Posted by PROJJWAL SAHA <pr...@gmail.com>.

Thanks Nitin for the matrices you provided and the suggestions.

On Tue, Feb 21, 2017 at 2:23 PM, Nitin Pawar <ni...@gmail.com>
wrote:

> instead of doing select * in the first go,
> can you do query like select count(1)
>
> when your data is in csv files then yes all the data is transferred to the
> drill node and then query is executed on top of it.
> We had noticed the performance on csv was significantly more compared to
> parquet files, so we moved our data to parquet from csv and have not seen
> any issues on then.
>
> we did test run on 125M records, size was 8 GB in parquet and it took
> roughly 30 second or so.
>
> I would suggest two things
> 1) Which AWS region your S3 bucket is hosted  and which region your ec2
> servers are hosted?
> 2) If answer to above question is two different regions then you might want
> to move them into a single region.
>
> In either case, from AWS console you can figure out how much network
> throughput you are getting if that is the bottleneck
> Also drill machines would need CPU so along with 32GB memory if you have 8
> cores that would be desirable
>
> On Tue, Feb 21, 2017 at 2:17 PM, PROJJWAL SAHA <pr...@gmail.com>
> wrote:
>
> > Hi Nitin,
> >
> > I am executing the SQL query on a drillbit node using drill-conf .
> >  We have configured a 5 node drill cluster external to Amazon with 32GB
> > RAM. From one of the nodes, we are using drill-conf utility to fire the
> SQL
> > query.
> >
> > One observation is had is
> > select * from `xxx.tsv`
> > select * from `xxx.tsv` where yyy = 'zzz'
> >
> > Both these queries are taking almost the same time for 1 GB data with
> > 1000000 rows. So if the network for data transfer is the major time
> taking
> > component compared with the query execution time,  I think that the
> entire
> > data is first transferred to drill cluster and then the query is executed
> > on the drill cluster ?
> >
> > Regards,
> > Projjwal
> >
> > On Mon, Feb 20, 2017 at 6:18 PM, Nitin Pawar <ni...@gmail.com>
> > wrote:
> >
> > > how are you doing select * .. using drill UI or sqlline?
> > > where are you running it from ?
> > > is the drill hosted in aws or on your local machine?
> > >
> > > I think majority of the time is spent on displaying the result set
> > instead
> > > of querying the file if the drill server is on aws.
> > > If the drill server is local then it might be your network which might
> > take
> > > a lot of time based on s3 bucket location and where your drill server
> is
> > >
> > > On Mon, Feb 20, 2017 at 5:37 PM, PROJJWAL SAHA <pr...@gmail.com>
> > > wrote:
> > >
> > > > Hello all,
> > > >
> > > > I am using 1GB data in the form of .tsv file, stored in Amazon S3
> using
> > > > Drill 1.8. I am using default configurations of Drill using S3
> storage
> > > > plugin coming out of the box. The drill bits are configured on a 5
> node
> > > > cluster with 32GB RAM and 4VCPU.
> > > >
> > > > I see that select * from xxx; query takes 23 mins to fetch 1,040,000
> > > rows.
> > > >
> > > > Is this the expected behaviour ?
> > > > I am looking for any quick tuning that can improve the performance or
> > any
> > > > other suggestions.
> > > >
> > > > Attaching is the JSON profile for this query.
> > > >
> > > > Regards,
> > > > Projjwal
> > > >
> > >
> > >
> > >
> > > --
> > > Nitin Pawar
> > >
> >
>
>
>
> --
> Nitin Pawar
>

Re: Query on performance using Drill and Amazon s3.

Posted by Nitin Pawar <ni...@gmail.com>.

instead of doing select * in the first go,
can you do query like select count(1)

when your data is in csv files then yes all the data is transferred to the
drill node and then query is executed on top of it.
We had noticed the performance on csv was significantly more compared to
parquet files, so we moved our data to parquet from csv and have not seen
any issues on then.

we did test run on 125M records, size was 8 GB in parquet and it took
roughly 30 second or so.

I would suggest two things
1) Which AWS region your S3 bucket is hosted  and which region your ec2
servers are hosted?
2) If answer to above question is two different regions then you might want
to move them into a single region.

In either case, from AWS console you can figure out how much network
throughput you are getting if that is the bottleneck
Also drill machines would need CPU so along with 32GB memory if you have 8
cores that would be desirable

On Tue, Feb 21, 2017 at 2:17 PM, PROJJWAL SAHA <pr...@gmail.com> wrote:

> Hi Nitin,
>
> I am executing the SQL query on a drillbit node using drill-conf .
>  We have configured a 5 node drill cluster external to Amazon with 32GB
> RAM. From one of the nodes, we are using drill-conf utility to fire the SQL
> query.
>
> One observation is had is
> select * from `xxx.tsv`
> select * from `xxx.tsv` where yyy = 'zzz'
>
> Both these queries are taking almost the same time for 1 GB data with
> 1000000 rows. So if the network for data transfer is the major time taking
> component compared with the query execution time,  I think that the entire
> data is first transferred to drill cluster and then the query is executed
> on the drill cluster ?
>
> Regards,
> Projjwal
>
> On Mon, Feb 20, 2017 at 6:18 PM, Nitin Pawar <ni...@gmail.com>
> wrote:
>
> > how are you doing select * .. using drill UI or sqlline?
> > where are you running it from ?
> > is the drill hosted in aws or on your local machine?
> >
> > I think majority of the time is spent on displaying the result set
> instead
> > of querying the file if the drill server is on aws.
> > If the drill server is local then it might be your network which might
> take
> > a lot of time based on s3 bucket location and where your drill server is
> >
> > On Mon, Feb 20, 2017 at 5:37 PM, PROJJWAL SAHA <pr...@gmail.com>
> > wrote:
> >
> > > Hello all,
> > >
> > > I am using 1GB data in the form of .tsv file, stored in Amazon S3 using
> > > Drill 1.8. I am using default configurations of Drill using S3 storage
> > > plugin coming out of the box. The drill bits are configured on a 5 node
> > > cluster with 32GB RAM and 4VCPU.
> > >
> > > I see that select * from xxx; query takes 23 mins to fetch 1,040,000
> > rows.
> > >
> > > Is this the expected behaviour ?
> > > I am looking for any quick tuning that can improve the performance or
> any
> > > other suggestions.
> > >
> > > Attaching is the JSON profile for this query.
> > >
> > > Regards,
> > > Projjwal
> > >
> >
> >
> >
> > --
> > Nitin Pawar
> >
>



-- 
Nitin Pawar

Re: Query on performance using Drill and Amazon s3.

Posted by PROJJWAL SAHA <pr...@gmail.com>.

Hi Nitin,

I am executing the SQL query on a drillbit node using drill-conf .
 We have configured a 5 node drill cluster external to Amazon with 32GB
RAM. From one of the nodes, we are using drill-conf utility to fire the SQL
query.

One observation is had is
select * from `xxx.tsv`
select * from `xxx.tsv` where yyy = 'zzz'

Both these queries are taking almost the same time for 1 GB data with
1000000 rows. So if the network for data transfer is the major time taking
component compared with the query execution time,  I think that the entire
data is first transferred to drill cluster and then the query is executed
on the drill cluster ?

Regards,
Projjwal

On Mon, Feb 20, 2017 at 6:18 PM, Nitin Pawar <ni...@gmail.com>
wrote:

> how are you doing select * .. using drill UI or sqlline?
> where are you running it from ?
> is the drill hosted in aws or on your local machine?
>
> I think majority of the time is spent on displaying the result set instead
> of querying the file if the drill server is on aws.
> If the drill server is local then it might be your network which might take
> a lot of time based on s3 bucket location and where your drill server is
>
> On Mon, Feb 20, 2017 at 5:37 PM, PROJJWAL SAHA <pr...@gmail.com>
> wrote:
>
> > Hello all,
> >
> > I am using 1GB data in the form of .tsv file, stored in Amazon S3 using
> > Drill 1.8. I am using default configurations of Drill using S3 storage
> > plugin coming out of the box. The drill bits are configured on a 5 node
> > cluster with 32GB RAM and 4VCPU.
> >
> > I see that select * from xxx; query takes 23 mins to fetch 1,040,000
> rows.
> >
> > Is this the expected behaviour ?
> > I am looking for any quick tuning that can improve the performance or any
> > other suggestions.
> >
> > Attaching is the JSON profile for this query.
> >
> > Regards,
> > Projjwal
> >
>
>
>
> --
> Nitin Pawar
>

Re: Query on performance using Drill and Amazon s3.

Posted by Nitin Pawar <ni...@gmail.com>.

how are you doing select * .. using drill UI or sqlline?
where are you running it from ?
is the drill hosted in aws or on your local machine?

I think majority of the time is spent on displaying the result set instead
of querying the file if the drill server is on aws.
If the drill server is local then it might be your network which might take
a lot of time based on s3 bucket location and where your drill server is

On Mon, Feb 20, 2017 at 5:37 PM, PROJJWAL SAHA <pr...@gmail.com> wrote:

> Hello all,
>
> I am using 1GB data in the form of .tsv file, stored in Amazon S3 using
> Drill 1.8. I am using default configurations of Drill using S3 storage
> plugin coming out of the box. The drill bits are configured on a 5 node
> cluster with 32GB RAM and 4VCPU.
>
> I see that select * from xxx; query takes 23 mins to fetch 1,040,000 rows.
>
> Is this the expected behaviour ?
> I am looking for any quick tuning that can improve the performance or any
> other suggestions.
>
> Attaching is the JSON profile for this query.
>
> Regards,
> Projjwal
>

-- 
Nitin Pawar