You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Surneni Tilak <su...@ericsson.com> on 2018/07/30 08:57:06 UTC

Drill Configuration Requirements To Query Data in Tera Bytes

Hi Team,

May I know the ideal configuration requirements to query data of size 10 TB with query time under 5 minutes. Please suggest me regarding the number of Drilbits that I have to use and the RAM(Direct-Memory  & Heap_Memory) that each drill bit should consists of to complete the queries within the desired time.

Best regards,
_________
Tilak

Re: Drill Configuration Requirements To Query Data in Tera Bytes

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Tilak,

Since we only had your 10 TB and 5 minute numbers to play with, I did the back-of-envelope calcs using those.

Of course, you can/should look for ways to reduce the data read per query. There are two places to start: vertical and horizontal partitioning.

Vertical partitioning comes from storing the data in the columnar Parquet format. Suppose your data has 100 columns, but a typical query uses some combination of just five columns. Parquet will read just those columns, reading just 5/100 = 5% of the data or 500 GB. Assuming good distribution of data blocks on HDFS, that greatly reduces the number of spindles needed to meet SLAs: down from ~300 to ~15.

Horizontal partitioning comes from doing directory partitioning. The most common pattern is by date for time series data. Suppose your 10 TB represents three years of data. If data is partitioned by date, you'll get roughly 1000 partitions. If each query touches just one day of data, you'll read just 0.1% of the data. If a query touches a month, you'll read just 3% of data.

The good news is that the two forms of partitioning combine. If you apply both forms of partitioning to our hypothetical example, you reduce the scan to 3% (for the scan over one month of data) * 5% (for reading 5/100 columns) = 0.15% of data or just 15 GB of data. This can be done in far less than five minutes.

Of course, your results completely depend on your data and queries. And, at this level of partitioning, other costs begin to dominate such as the metadata preparation discussed in a previous message, the cost of doing sorts, aggregations, etc. and so on. Good measurements will tell you the now-dominant costs.

Thanks,
- Paul

    On Monday, August 13, 2018, 1:17:32 AM PDT, Surneni Tilak <su...@ericsson.com> wrote:  

 Hi Paul,

Thanks a lot  for your response  which I feel Great as I am looking for this kind of approach in specific to reduce my query execution time.

 According to your response I could have a check on multiple hardware related parameters that are being used in my cluster. It's true that I did not provide much of details regarding my cluster but thanks for understanding my query. I will try out by tuning these parameters and will reach out to our group if I need any further help.

Best regards, 
__________
Tilak 

-----Original Message-----
From: Paul Rogers [mailto:par0328@yahoo.com.INVALID] 
Sent: Monday, August 13, 2018 2:20 AM
To: user@drill.apache.org
Subject: RE: Drill Configuration Requirements To Query Data in Tera Bytes

Hi Tilak,

 Quick follow-up to this old thread. Few details were provided other than a desire to scan 10 TB of data in 5 minutes. Here is a quick back of the envelope calculation.

Assume that the Drill query is the only operation active on your cluster. Assume you are doing a simple SELECT query with no joins or sorts. You must get the entire 10 TB of data off disk and into Drill. If the query is highly selective (retrieve, say, 1 GB or less of the 10 TB), then we can ignore must of Drill's own overhead for now.

Assume you have good hard disks, with sustained 100 MB/sec. read performance. It will thus take 10 TB / 100 MB/s = 10^13 / 10^8 = 10^5 = 10,000 seconds for a single disk. Your budget is 5 * 60 = 300 seconds. So, you will need 10,000 / 300 = 34 disks working concurrently. Not a horrible first estimate.

Given that, you can then work out a hardware setup. If you are in the cloud, you'll have to work out how to use cloud resources to get this throughput. If on prem, then you probably have a standard hardware configuration. And, if you have SSDs, the calculations will be different.

You'll want to test with plain HDFS (if that is what you're using) to figure out the combination of disks-per-node to get the needed throughput. For example, with 10 disks per node, you'd need the 4 nodes that Abhishek recommended.

Maximum throughput depends on Drill being able to consume data at the desired rate. Here you can do performance tests to see how fast a single Drill minor fragment (thread) can read data on your hardware for your data. In my own tests (on a Mac), I found something like 50 MB/s for a CSV file with a single 100-byte record. Your results will completely depend on record structure and file format (CSV vs. JSON vs. Parquet.)

Once you know per-fragment throughput, you can target the needed number of fragments to consume the disk I/O on each node. This will tell you the number of cores.  (Or, more realistically, given the number of available CPUs, how much disk I/O can Drill consume on a single node?) If, say, you have 24 cores and each can do 50 MB/sec, then your optimal Drill ingest rage is 1200 MB/sec, which would, under ideal conditions, consume the output of 12 disks. Of course, things are seldom optimum. The point is, balance disks and cores per node to avoid bottlenecks.

Of course, you may not get the optimum sustained read rate on each disk for any number of reasons (limited controller bandwidth, disk contention, HDFS bottlenecks, etc.) So, having more nodes is better, with data spread across more disks. If you get only 50 MB/s sustained, then you need, say, 70 disks. If you have 16 cores per machine, and each can consume, say, 25 MB/s, then you can have 8 disks per machine for a total of 70 / 8 =  9 machines.

The point is, make some assumptions then test them. Use the results to work out an estimate of the number of disks and cores. Use testing to refine from there.

Thanks,
- Paul

    On Tuesday, July 31, 2018, 2:12:17 AM PDT, Surneni Tilak <su...@ericsson.com> wrote:  

 Hi Abhishek,

Thanks for your response . I will try with the approach that you have suggested and come back if I I need any further help.

Best regards,
________
Tilak 

-----Original Message-----
From: Abhishek Girish [mailto:agirish@apache.org]
Sent: Monday, July 30, 2018 9:43 PM
To: user <us...@drill.apache.org>
Subject: Re: Drill Configuration Requirements To Query Data in Tera Bytes

Hey Tilak,

We don't have any official sizing guidelines - for planning a Drill cluster. A lot of it depends on the type of queries being executed (simple look-ups vs complex joins), data format (columnar data such as Parquet shows best performance), and system load (running a single query on nodes dedicated for Drill).

It also depends on the type of machines you have - for example with beefy nodes with lots of RAM and CPU, you'll need fewer number of nodes running Drill.

I would recommend getting started with a 4-10 node cluster with a good amount of memory you can spare. And based on the results try and figure out your own sizing guideline (either to add more nodes or increase memory [1]).

If you share more details, it could be possible to suggest more.

[1] http://drill.apache.org/docs/configuring-drill-memory/

On Mon, Jul 30, 2018 at 1:57 AM Surneni Tilak <su...@ericsson.com>
wrote:

> Hi Team,
>
> May I know the ideal configuration requirements to query data of size
> 10 TB with query time under 5 minutes. Please suggest me regarding the 
> number of Drilbits that I have to use and the RAM(Direct-Memory  &
> Heap_Memory) that each drill bit should consists of to complete the 
> queries within the desired time.
>
> Best regards,
> _________
> Tilak
>
>
>

RE: Drill Configuration Requirements To Query Data in Tera Bytes

Posted by Surneni Tilak <su...@ericsson.com>.

Hi Paul,

Thanks a lot  for your response  which I feel Great as I am looking for this kind of approach in specific to reduce my query execution time.

 According to your response I could have a check on multiple hardware related parameters that are being used in my cluster. It's true that I did not provide much of details regarding my cluster but thanks for understanding my query. I will try out by tuning these parameters and will reach out to our group if I need any further help.

Best regards, 
__________
Tilak 

-----Original Message-----
From: Paul Rogers [mailto:par0328@yahoo.com.INVALID] 
Sent: Monday, August 13, 2018 2:20 AM
To: user@drill.apache.org
Subject: RE: Drill Configuration Requirements To Query Data in Tera Bytes

Hi Tilak,

 Quick follow-up to this old thread. Few details were provided other than a desire to scan 10 TB of data in 5 minutes. Here is a quick back of the envelope calculation.

Assume that the Drill query is the only operation active on your cluster. Assume you are doing a simple SELECT query with no joins or sorts. You must get the entire 10 TB of data off disk and into Drill. If the query is highly selective (retrieve, say, 1 GB or less of the 10 TB), then we can ignore must of Drill's own overhead for now.

Assume you have good hard disks, with sustained 100 MB/sec. read performance. It will thus take 10 TB / 100 MB/s = 10^13 / 10^8 = 10^5 = 10,000 seconds for a single disk. Your budget is 5 * 60 = 300 seconds. So, you will need 10,000 / 300 = 34 disks working concurrently. Not a horrible first estimate.

Given that, you can then work out a hardware setup. If you are in the cloud, you'll have to work out how to use cloud resources to get this throughput. If on prem, then you probably have a standard hardware configuration. And, if you have SSDs, the calculations will be different.

You'll want to test with plain HDFS (if that is what you're using) to figure out the combination of disks-per-node to get the needed throughput. For example, with 10 disks per node, you'd need the 4 nodes that Abhishek recommended.

Maximum throughput depends on Drill being able to consume data at the desired rate. Here you can do performance tests to see how fast a single Drill minor fragment (thread) can read data on your hardware for your data. In my own tests (on a Mac), I found something like 50 MB/s for a CSV file with a single 100-byte record. Your results will completely depend on record structure and file format (CSV vs. JSON vs. Parquet.)

Once you know per-fragment throughput, you can target the needed number of fragments to consume the disk I/O on each node. This will tell you the number of cores.  (Or, more realistically, given the number of available CPUs, how much disk I/O can Drill consume on a single node?) If, say, you have 24 cores and each can do 50 MB/sec, then your optimal Drill ingest rage is 1200 MB/sec, which would, under ideal conditions, consume the output of 12 disks. Of course, things are seldom optimum. The point is, balance disks and cores per node to avoid bottlenecks.

Of course, you may not get the optimum sustained read rate on each disk for any number of reasons (limited controller bandwidth, disk contention, HDFS bottlenecks, etc.) So, having more nodes is better, with data spread across more disks. If you get only 50 MB/s sustained, then you need, say, 70 disks. If you have 16 cores per machine, and each can consume, say, 25 MB/s, then you can have 8 disks per machine for a total of 70 / 8 =  9 machines.

The point is, make some assumptions then test them. Use the results to work out an estimate of the number of disks and cores. Use testing to refine from there.

Thanks,
- Paul

    On Tuesday, July 31, 2018, 2:12:17 AM PDT, Surneni Tilak <su...@ericsson.com> wrote:  

 Hi Abhishek,

Thanks for your response . I will try with the approach that you have suggested and come back if I I need any further help.

Best regards,
________
Tilak 

-----Original Message-----
From: Abhishek Girish [mailto:agirish@apache.org]
Sent: Monday, July 30, 2018 9:43 PM
To: user <us...@drill.apache.org>
Subject: Re: Drill Configuration Requirements To Query Data in Tera Bytes

Hey Tilak,

We don't have any official sizing guidelines - for planning a Drill cluster. A lot of it depends on the type of queries being executed (simple look-ups vs complex joins), data format (columnar data such as Parquet shows best performance), and system load (running a single query on nodes dedicated for Drill).

It also depends on the type of machines you have - for example with beefy nodes with lots of RAM and CPU, you'll need fewer number of nodes running Drill.

I would recommend getting started with a 4-10 node cluster with a good amount of memory you can spare. And based on the results try and figure out your own sizing guideline (either to add more nodes or increase memory [1]).

If you share more details, it could be possible to suggest more.

[1] http://drill.apache.org/docs/configuring-drill-memory/

On Mon, Jul 30, 2018 at 1:57 AM Surneni Tilak <su...@ericsson.com>
wrote:

> Hi Team,
>
> May I know the ideal configuration requirements to query data of size
> 10 TB with query time under 5 minutes. Please suggest me regarding the 
> number of Drilbits that I have to use and the RAM(Direct-Memory  &
> Heap_Memory) that each drill bit should consists of to complete the 
> queries within the desired time.
>
> Best regards,
> _________
> Tilak
>
>
>

RE: Drill Configuration Requirements To Query Data in Tera Bytes

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Tilak,

 Quick follow-up to this old thread. Few details were provided other than a desire to scan 10 TB of data in 5 minutes. Here is a quick back of the envelope calculation.

Assume that the Drill query is the only operation active on your cluster. Assume you are doing a simple SELECT query with no joins or sorts. You must get the entire 10 TB of data off disk and into Drill. If the query is highly selective (retrieve, say, 1 GB or less of the 10 TB), then we can ignore must of Drill's own overhead for now.

Assume you have good hard disks, with sustained 100 MB/sec. read performance. It will thus take 10 TB / 100 MB/s = 10^13 / 10^8 = 10^5 = 10,000 seconds for a single disk. Your budget is 5 * 60 = 300 seconds. So, you will need 10,000 / 300 = 34 disks working concurrently. Not a horrible first estimate.

Given that, you can then work out a hardware setup. If you are in the cloud, you'll have to work out how to use cloud resources to get this throughput. If on prem, then you probably have a standard hardware configuration. And, if you have SSDs, the calculations will be different.

You'll want to test with plain HDFS (if that is what you're using) to figure out the combination of disks-per-node to get the needed throughput. For example, with 10 disks per node, you'd need the 4 nodes that Abhishek recommended.

Maximum throughput depends on Drill being able to consume data at the desired rate. Here you can do performance tests to see how fast a single Drill minor fragment (thread) can read data on your hardware for your data. In my own tests (on a Mac), I found something like 50 MB/s for a CSV file with a single 100-byte record. Your results will completely depend on record structure and file format (CSV vs. JSON vs. Parquet.)

Once you know per-fragment throughput, you can target the needed number of fragments to consume the disk I/O on each node. This will tell you the number of cores.  (Or, more realistically, given the number of available CPUs, how much disk I/O can Drill consume on a single node?) If, say, you have 24 cores and each can do 50 MB/sec, then your optimal Drill ingest rage is 1200 MB/sec, which would, under ideal conditions, consume the output of 12 disks. Of course, things are seldom optimum. The point is, balance disks and cores per node to avoid bottlenecks.

Of course, you may not get the optimum sustained read rate on each disk for any number of reasons (limited controller bandwidth, disk contention, HDFS bottlenecks, etc.) So, having more nodes is better, with data spread across more disks. If you get only 50 MB/s sustained, then you need, say, 70 disks. If you have 16 cores per machine, and each can consume, say, 25 MB/s, then you can have 8 disks per machine for a total of 70 / 8 =  9 machines.

The point is, make some assumptions then test them. Use the results to work out an estimate of the number of disks and cores. Use testing to refine from there.

Thanks,
- Paul

    On Tuesday, July 31, 2018, 2:12:17 AM PDT, Surneni Tilak <su...@ericsson.com> wrote:  

 Hi Abhishek,

Thanks for your response . I will try with the approach that you have suggested and come back if I I need any further help.

Best regards, 
________
Tilak 

-----Original Message-----
From: Abhishek Girish [mailto:agirish@apache.org] 
Sent: Monday, July 30, 2018 9:43 PM
To: user <us...@drill.apache.org>
Subject: Re: Drill Configuration Requirements To Query Data in Tera Bytes

Hey Tilak,

We don't have any official sizing guidelines - for planning a Drill cluster. A lot of it depends on the type of queries being executed (simple look-ups vs complex joins), data format (columnar data such as Parquet shows best performance), and system load (running a single query on nodes dedicated for Drill).

It also depends on the type of machines you have - for example with beefy nodes with lots of RAM and CPU, you'll need fewer number of nodes running Drill.

I would recommend getting started with a 4-10 node cluster with a good amount of memory you can spare. And based on the results try and figure out your own sizing guideline (either to add more nodes or increase memory [1]).

If you share more details, it could be possible to suggest more.

[1] http://drill.apache.org/docs/configuring-drill-memory/

On Mon, Jul 30, 2018 at 1:57 AM Surneni Tilak <su...@ericsson.com>
wrote:

> Hi Team,
>
> May I know the ideal configuration requirements to query data of size 
> 10 TB with query time under 5 minutes. Please suggest me regarding the 
> number of Drilbits that I have to use and the RAM(Direct-Memory  & 
> Heap_Memory) that each drill bit should consists of to complete the 
> queries within the desired time.
>
> Best regards,
> _________
> Tilak
>
>
>

RE: Drill Configuration Requirements To Query Data in Tera Bytes

Posted by Surneni Tilak <su...@ericsson.com>.

Hi Abhishek,

Thanks for your response . I will try with the approach that you have suggested and come back if I I need any further help.

Best regards, 
________
Tilak 

-----Original Message-----
From: Abhishek Girish [mailto:agirish@apache.org] 
Sent: Monday, July 30, 2018 9:43 PM
To: user <us...@drill.apache.org>
Subject: Re: Drill Configuration Requirements To Query Data in Tera Bytes

Hey Tilak,

We don't have any official sizing guidelines - for planning a Drill cluster. A lot of it depends on the type of queries being executed (simple look-ups vs complex joins), data format (columnar data such as Parquet shows best performance), and system load (running a single query on nodes dedicated for Drill).

It also depends on the type of machines you have - for example with beefy nodes with lots of RAM and CPU, you'll need fewer number of nodes running Drill.

I would recommend getting started with a 4-10 node cluster with a good amount of memory you can spare. And based on the results try and figure out your own sizing guideline (either to add more nodes or increase memory [1]).

If you share more details, it could be possible to suggest more.

[1] http://drill.apache.org/docs/configuring-drill-memory/

On Mon, Jul 30, 2018 at 1:57 AM Surneni Tilak <su...@ericsson.com>
wrote:

> Hi Team,
>
> May I know the ideal configuration requirements to query data of size 
> 10 TB with query time under 5 minutes. Please suggest me regarding the 
> number of Drilbits that I have to use and the RAM(Direct-Memory  & 
> Heap_Memory) that each drill bit should consists of to complete the 
> queries within the desired time.
>
> Best regards,
> _________
> Tilak
>
>
>

Re: Drill Configuration Requirements To Query Data in Tera Bytes

Posted by Abhishek Girish <ag...@apache.org>.

Hey Tilak,

We don't have any official sizing guidelines - for planning a Drill
cluster. A lot of it depends on the type of queries being executed (simple
look-ups vs complex joins), data format (columnar data such as Parquet
shows best performance), and system load (running a single query on nodes
dedicated for Drill).

It also depends on the type of machines you have - for example with beefy
nodes with lots of RAM and CPU, you'll need fewer number of nodes running
Drill.

I would recommend getting started with a 4-10 node cluster with a good
amount of memory you can spare. And based on the results try and figure out
your own sizing guideline (either to add more nodes or increase memory
[1]).

If you share more details, it could be possible to suggest more.

[1] http://drill.apache.org/docs/configuring-drill-memory/

On Mon, Jul 30, 2018 at 1:57 AM Surneni Tilak <su...@ericsson.com>
wrote:

> Hi Team,
>
> May I know the ideal configuration requirements to query data of size 10
> TB with query time under 5 minutes. Please suggest me regarding the number
> of Drilbits that I have to use and the RAM(Direct-Memory  & Heap_Memory)
> that each drill bit should consists of to complete the queries within the
> desired time.
>
> Best regards,
> _________
> Tilak
>
>
>