You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Satish Cattamanchi <sc...@4info.com> on 2015/06/08 00:47:14 UTC

Apache Drill and S3 performance

We are evaluating Apache Drill performance, and we have setup  Apache Drill on Amazon.

All EC2 machines are r3.2xLarge instance type.

Model   vCPU    Mem (GiB)       SSD Storage (GB)




r3.2xlarge      8       61      1 x 160






Zookeeper - 1 EC2 machine
Drillbits - 25 EC2 machines.
Data on - Amazon  S3
Data Format - Flat File with PSV ( Pipe Separated) and GZIP'ed.
Storage Hierarchy  - /logs/requests/y=2015/m=01/d=01/hh=-01/
Daily Data Size - 2TB approx.
Daily Rows - 3.5B

Using Apache Drill with Default Configuration.

I was successfully able to configure Apache Drill and connect to S3 and query the data from S3.

But when I do count(*) on the day folder, its taking around 45-50min with the above setup. Any other queries with WHERE condition also takes similar time. I was wondering whether the slowness is due to copying data back n forth from S3.

Could anyone give some suggestions on setup/configuration to achieve better performance with Apache Drill?

Thanks,
Satish

Re: Apache Drill and S3 performance

Posted by Alexander Zarei <al...@gmail.com>.

It might be because of using S3 as your file system.

We have done a similar experiment loading data to HDFS on m1.xlarge
machines. A query profile analysis of the experiments showed reading from
magnetic storage on m1.xlarge machines was the bottleneck. Hence we
switched to m3.xlarge instances (which have SSD storage).

A suggestion would be to try loading your data to the HDFS or MapR-FS on
the cluster instead of S3 and it will probably boost the performance. As
you know m3.2xlarge machines have SSD storage which in general performs
better than S3.

Cheers,
Alex

On Sun, Jun 7, 2015 at 5:12 PM, Jacques Nadeau <ja...@apache.org> wrote:

> Can you post a query profile json?  It might help us to determine where the
> time is being spent.
>
> How many files are being queries?
>
> On Sun, Jun 7, 2015 at 3:47 PM, Satish Cattamanchi <scattamanchi@4info.com
> >
> wrote:
>
> > We are evaluating Apache Drill performance, and we have setup  Apache
> > Drill on Amazon.
> >
> > All EC2 machines are r3.2xLarge instance type.
> >
> > Model   vCPU    Mem (GiB)       SSD Storage (GB)
> >
> >
> >
> >
> > r3.2xlarge      8       61      1 x 160
> >
> >
> >
> >
> >
> >
> > Zookeeper - 1 EC2 machine
> > Drillbits - 25 EC2 machines.
> > Data on - Amazon  S3
> > Data Format - Flat File with PSV ( Pipe Separated) and GZIP'ed.
> > Storage Hierarchy  - /logs/requests/y=2015/m=01/d=01/hh=-01/
> > Daily Data Size - 2TB approx.
> > Daily Rows - 3.5B
> >
> > Using Apache Drill with Default Configuration.
> >
> > I was successfully able to configure Apache Drill and connect to S3 and
> > query the data from S3.
> >
> > But when I do count(*) on the day folder, its taking around 45-50min with
> > the above setup. Any other queries with WHERE condition also takes
> similar
> > time. I was wondering whether the slowness is due to copying data back n
> > forth from S3.
> >
> > Could anyone give some suggestions on setup/configuration to achieve
> > better performance with Apache Drill?
> >
> > Thanks,
> > Satish
> >
> >
>

Re: Apache Drill and S3 performance

Posted by Jacques Nadeau <ja...@apache.org>.

Can you post a query profile json?  It might help us to determine where the
time is being spent.

How many files are being queries?

On Sun, Jun 7, 2015 at 3:47 PM, Satish Cattamanchi <sc...@4info.com>
wrote:

> We are evaluating Apache Drill performance, and we have setup  Apache
> Drill on Amazon.
>
> All EC2 machines are r3.2xLarge instance type.
>
> Model   vCPU    Mem (GiB)       SSD Storage (GB)
>
>
>
>
> r3.2xlarge      8       61      1 x 160
>
>
>
>
>
>
> Zookeeper - 1 EC2 machine
> Drillbits - 25 EC2 machines.
> Data on - Amazon  S3
> Data Format - Flat File with PSV ( Pipe Separated) and GZIP'ed.
> Storage Hierarchy  - /logs/requests/y=2015/m=01/d=01/hh=-01/
> Daily Data Size - 2TB approx.
> Daily Rows - 3.5B
>
> Using Apache Drill with Default Configuration.
>
> I was successfully able to configure Apache Drill and connect to S3 and
> query the data from S3.
>
> But when I do count(*) on the day folder, its taking around 45-50min with
> the above setup. Any other queries with WHERE condition also takes similar
> time. I was wondering whether the slowness is due to copying data back n
> forth from S3.
>
> Could anyone give some suggestions on setup/configuration to achieve
> better performance with Apache Drill?
>
> Thanks,
> Satish
>
>