You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by Yijie Shen <he...@gmail.com> on 2016/05/22 16:06:10 UTC

Performance tuning for TPC-H Q1 on a three nodes cluster

Hi all,

I'm trying out Drill on master branch lately and have deployed a cluster on
three physical server.

The input data `lineitem` is in parquet format of total size 150GB, 101MB
per file and 1516 files in total.

The server has two Intel(R) Xeon(R) CPU E5645 @2.40GHz CPUs and 24 cores in
total, 32GB memory.

While executing Q1 using:

 SELECT
  L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE),
SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)),
SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY),
AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), COUNT(1)
FROM
  dfs.tpch.`lineitem`
WHERE
  L_SHIPDATE<='1998-09-02'
GROUP BY L_RETURNFLAG, L_LINESTATUS
ORDER BY L_RETURNFLAG, L_LINESTATUS

I've noticed the parallelism was 51 (planner.width.max_per_node = 17) in my
case for Major Fragment 03 (Scan Filter Project HashAgg and Project), and
each Minor fragment last about 8 to 9 minutes. one record for example:

03-00-xx hw080 7.309s 42.358s 9m35s 118,758,489 14,540 22:31:32 22:31:32
33MB FINISHED

Is this a normal speed (more than 10 minutes) for Drill for my current
cluster? Did I miss something important in conf to accelerate the execution?

Thanks very much!

Yijie

Re: Performance tuning for TPC-H Q1 on a three nodes cluster

Posted by Dechang Gu <dg...@maprtech.com>.

On Tue, May 24, 2016 at 7:07 PM, Yijie Shen <he...@gmail.com>
wrote:

> Hi Dechang,
>
> Thanks very much for your help!
>
> I get a little confused here, why does skew exist?
>
> After some statistic work, I got this: 1516 files and 102.54MB on average,
> max of 104MB, min of 95MB.
> On a system of 72 compute resources, the inputs is somewhat even
> distributed on three nodes: Server80: 520 files, Server83: 486 files,
> Server166: 509 files.
>
> If handling a file takes equal time for one resource/minor fragment/slot,
> why does such a big skew exist? Any way to find out and eliminate this?
>

One theory could be this:  parquet files are stored as "row groups"  - each
file contains a number of row groups. Drill assign row groups to minor
fragments in a round-robin fashion, with locality preference. If the row
groups are not distributed evenly among the files, we will see uneven loads
among the fragments.   Parquet metadata cache may provide some information
about the row group distribution.
To create metadata for lineitem table, run on sqlline:
refresh table metadata lineitem;

A json file .drill.parquet_metadata will be created under the table
lineitem directory.

Can you please create the metadata file and provide it to me so we can
check the row group distribution?

Thanks,
Dechang



>
> My HDFS block size is 256MB, and all my files are of 100MB on average,
> perhaps not small IO?
>
> On Wed, May 25, 2016 at 8:32 AM, Dechang Gu <dg...@maprtech.com> wrote:
>
>> Hi Yijie,
>> Thanks for the profile. Looks like from the Operator Profile overview,
>> 03-xx-02 HASH_AGGREGATE and 03-xx-06 PARQUET_ROW_GROUP_SCAN took the most
>> of time:
>>
>> 03-xx-02 HASH_AGGREGATE 0.020s 0.083s 0.213s 1m06s 1m55s *3m12s* 0.000s
>> 0.000s 0.000s 16MB 16MB
>> 03-xx-03 PROJECT 0.004s 0.009s 0.024s 6.539s 11.249s 16.504s 0.000s
>> 0.000s 0.000s 536KB 536KB
>> 03-xx-04 SELECTION_VECTOR_REMOVER 0.002s 0.004s 0.014s 34.456s 58.595s
>> 1m26s 0.000s 0.000s 0.000s 664KB 664KB
>> 03-xx-05 FILTER 0.001s 0.003s 0.009s 2.556s 4.355s 6.543s 0.000s 0.000s
>> 0.000s 8MB 8MB
>> 03-xx-06 PARQUET_ROW_GROUP_SCAN 0.000s 0.000s 0.000s 2m13s 3m40s *5m14s*
>> 5.392s 6.372s 7.371s 12MB 12MB
>> For the SCAN, there are some skewness among the batches/records processed
>> by the minor fragments:  from ~5000 batches to +15000 batches  The 5m
>> process times are for the bigger batches.
>>
>> Similar skewness is also shown in 03-xx-02.   If the load can be balanced
>> among the fragments, it may speed up the query a little bit.
>>
>> On the other hand,  the 10min query time is not that much off the line,
>> considering that you have only 3 nodes and each node is with only 7 HDDs. I
>> did a similar run of the query on my
>> cluster (10 nodes each with 32 cores and 23 HDDs),  For 03-xx-02 and
>> 03-xx-06, the process time per record is about the same as what you got,
>> even though the total query time is much short due to it is on much bigger
>> system.
>>
>> Another note: for your test, disk IO is about ~24MB/s/HDD
>> (150GB/300s/21HDD), which is about the IO limit if there are a lot small
>> IOs (what is the chuck size of your HDFS?)
>>
>> Hope this helps, let me know If you have any further questions.
>>
>> Thanks,
>> Dechang
>>
>>
>>
>>
>> On Mon, May 23, 2016 at 7:28 PM, Yijie Shen <he...@gmail.com>
>> wrote:
>>
>>> Hi Dechang, Thanks for your reply.
>>>
>>> I've changed the parallelism of each node from 17 to 24 but no
>>> performance improvements seen.
>>>
>>> The 1516 files are distributed in HDFS constitute 3 datanodes, same 3
>>> servers where I start DrillBits. For each datanode, 7 HDD is configured as
>>> dfs.datanode.data.dir.
>>> The data was generated from TPCH dbgen tool and loaded into parquet
>>> format using Apache Spark, the file size are 101MB on average, no skew seen.
>>>
>>> I've paste the `Full json profile` content and attached to this mail.
>>>
>>> Thank you.
>>>
>>> On Mon, May 23, 2016 at 11:12 PM, Dechang Gu <dg...@maprtech.com> wrote:
>>>
>>>> Hi Yijie,
>>>> This is Dechang at MapR. I work on Drill performance.
>>>>
>>>> From what you described, looks like scan took most of the time.
>>>> How are the files are distributed on the disks, are there any skew?
>>>> How many disks are there?
>>>> If possible can you provide the profile for the run?
>>>>
>>>> Thanks,
>>>> Dechang
>>>>
>>>> On Sun, May 22, 2016 at 9:06 AM, Yijie Shen <he...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm trying out Drill on master branch lately and have deployed a
>>>>> cluster on
>>>>> three physical server.
>>>>>
>>>>> The input data `lineitem` is in parquet format of total size 150GB,
>>>>> 101MB
>>>>> per file and 1516 files in total.
>>>>>
>>>>> The server has two Intel(R) Xeon(R) CPU E5645 @2.40GHz CPUs and 24
>>>>> cores in
>>>>> total, 32GB memory.
>>>>>
>>>>> While executing Q1 using:
>>>>>
>>>>>  SELECT
>>>>>   L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE),
>>>>> SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)),
>>>>> SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY),
>>>>> AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), COUNT(1)
>>>>> FROM
>>>>>   dfs.tpch.`lineitem`
>>>>> WHERE
>>>>>   L_SHIPDATE<='1998-09-02'
>>>>> GROUP BY L_RETURNFLAG, L_LINESTATUS
>>>>> ORDER BY L_RETURNFLAG, L_LINESTATUS
>>>>>
>>>>> I've noticed the parallelism was 51 (planner.width.max_per_node = 17)
>>>>> in my
>>>>> case for Major Fragment 03 (Scan Filter Project HashAgg and Project),
>>>>> and
>>>>> each Minor fragment last about 8 to 9 minutes. one record for example:
>>>>>
>>>>> 03-00-xx hw080 7.309s 42.358s 9m35s 118,758,489 14,540 22:31:32
>>>>> 22:31:32
>>>>> 33MB FINISHED
>>>>>
>>>>> Is this a normal speed (more than 10 minutes) for Drill for my current
>>>>> cluster? Did I miss something important in conf to accelerate the
>>>>> execution?
>>>>>
>>>>> Thanks very much!
>>>>>
>>>>> Yijie
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Performance tuning for TPC-H Q1 on a three nodes cluster

Posted by Yijie Shen <he...@gmail.com>.

Hi Dechang,

Thanks very much for your help!

I get a little confused here, why does skew exist?

After some statistic work, I got this: 1516 files and 102.54MB on average,
max of 104MB, min of 95MB.
On a system of 72 compute resources, the inputs is somewhat even
distributed on three nodes: Server80: 520 files, Server83: 486 files,
Server166: 509 files.

If handling a file takes equal time for one resource/minor fragment/slot,
why does such a big skew exist? Any way to find out and eliminate this?

My HDFS block size is 256MB, and all my files are of 100MB on average,
perhaps not small IO?

On Wed, May 25, 2016 at 8:32 AM, Dechang Gu <dg...@maprtech.com> wrote:

> Hi Yijie,
> Thanks for the profile. Looks like from the Operator Profile overview,
> 03-xx-02 HASH_AGGREGATE and 03-xx-06 PARQUET_ROW_GROUP_SCAN took the most
> of time:
>
> 03-xx-02 HASH_AGGREGATE 0.020s 0.083s 0.213s 1m06s 1m55s *3m12s* 0.000s
> 0.000s 0.000s 16MB 16MB
> 03-xx-03 PROJECT 0.004s 0.009s 0.024s 6.539s 11.249s 16.504s 0.000s 0.000s
> 0.000s 536KB 536KB
> 03-xx-04 SELECTION_VECTOR_REMOVER 0.002s 0.004s 0.014s 34.456s 58.595s
> 1m26s 0.000s 0.000s 0.000s 664KB 664KB
> 03-xx-05 FILTER 0.001s 0.003s 0.009s 2.556s 4.355s 6.543s 0.000s 0.000s
> 0.000s 8MB 8MB
> 03-xx-06 PARQUET_ROW_GROUP_SCAN 0.000s 0.000s 0.000s 2m13s 3m40s *5m14s*
> 5.392s 6.372s 7.371s 12MB 12MB
> For the SCAN, there are some skewness among the batches/records processed
> by the minor fragments:  from ~5000 batches to +15000 batches  The 5m
> process times are for the bigger batches.
>
> Similar skewness is also shown in 03-xx-02.   If the load can be balanced
> among the fragments, it may speed up the query a little bit.
>
> On the other hand,  the 10min query time is not that much off the line,
> considering that you have only 3 nodes and each node is with only 7 HDDs. I
> did a similar run of the query on my
> cluster (10 nodes each with 32 cores and 23 HDDs),  For 03-xx-02 and
> 03-xx-06, the process time per record is about the same as what you got,
> even though the total query time is much short due to it is on much bigger
> system.
>
> Another note: for your test, disk IO is about ~24MB/s/HDD
> (150GB/300s/21HDD), which is about the IO limit if there are a lot small
> IOs (what is the chuck size of your HDFS?)
>
> Hope this helps, let me know If you have any further questions.
>
> Thanks,
> Dechang
>
>
>
>
> On Mon, May 23, 2016 at 7:28 PM, Yijie Shen <he...@gmail.com>
> wrote:
>
>> Hi Dechang, Thanks for your reply.
>>
>> I've changed the parallelism of each node from 17 to 24 but no
>> performance improvements seen.
>>
>> The 1516 files are distributed in HDFS constitute 3 datanodes, same 3
>> servers where I start DrillBits. For each datanode, 7 HDD is configured as
>> dfs.datanode.data.dir.
>> The data was generated from TPCH dbgen tool and loaded into parquet
>> format using Apache Spark, the file size are 101MB on average, no skew seen.
>>
>> I've paste the `Full json profile` content and attached to this mail.
>>
>> Thank you.
>>
>> On Mon, May 23, 2016 at 11:12 PM, Dechang Gu <dg...@maprtech.com> wrote:
>>
>>> Hi Yijie,
>>> This is Dechang at MapR. I work on Drill performance.
>>>
>>> From what you described, looks like scan took most of the time.
>>> How are the files are distributed on the disks, are there any skew?
>>> How many disks are there?
>>> If possible can you provide the profile for the run?
>>>
>>> Thanks,
>>> Dechang
>>>
>>> On Sun, May 22, 2016 at 9:06 AM, Yijie Shen <he...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm trying out Drill on master branch lately and have deployed a
>>>> cluster on
>>>> three physical server.
>>>>
>>>> The input data `lineitem` is in parquet format of total size 150GB,
>>>> 101MB
>>>> per file and 1516 files in total.
>>>>
>>>> The server has two Intel(R) Xeon(R) CPU E5645 @2.40GHz CPUs and 24
>>>> cores in
>>>> total, 32GB memory.
>>>>
>>>> While executing Q1 using:
>>>>
>>>>  SELECT
>>>>   L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE),
>>>> SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)),
>>>> SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY),
>>>> AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), COUNT(1)
>>>> FROM
>>>>   dfs.tpch.`lineitem`
>>>> WHERE
>>>>   L_SHIPDATE<='1998-09-02'
>>>> GROUP BY L_RETURNFLAG, L_LINESTATUS
>>>> ORDER BY L_RETURNFLAG, L_LINESTATUS
>>>>
>>>> I've noticed the parallelism was 51 (planner.width.max_per_node = 17)
>>>> in my
>>>> case for Major Fragment 03 (Scan Filter Project HashAgg and Project),
>>>> and
>>>> each Minor fragment last about 8 to 9 minutes. one record for example:
>>>>
>>>> 03-00-xx hw080 7.309s 42.358s 9m35s 118,758,489 14,540 22:31:32 22:31:32
>>>> 33MB FINISHED
>>>>
>>>> Is this a normal speed (more than 10 minutes) for Drill for my current
>>>> cluster? Did I miss something important in conf to accelerate the
>>>> execution?
>>>>
>>>> Thanks very much!
>>>>
>>>> Yijie
>>>>
>>>
>>>
>>
>

Re: Performance tuning for TPC-H Q1 on a three nodes cluster

Posted by Dechang Gu <dg...@maprtech.com>.

Hi Yijie,
Thanks for the profile. Looks like from the Operator Profile overview, 03-xx-02 HASH_AGGREGATE and 03-xx-06 PARQUET_ROW_GROUP_SCAN took the most of time:

03-xx-02 HASH_AGGREGATE 0.020s 0.083s 0.213s 1m06s 1m55s 3m12s 0.000s 0.000s 0.000s 16MB 16MB
03-xx-03 PROJECT 0.004s 0.009s 0.024s 6.539s 11.249s 16.504s 0.000s 0.000s 0.000s 536KB 536KB
03-xx-04 SELECTION_VECTOR_REMOVER 0.002s 0.004s 0.014s 34.456s 58.595s 1m26s 0.000s 0.000s 0.000s 664KB 664KB
03-xx-05 FILTER 0.001s 0.003s 0.009s 2.556s 4.355s 6.543s 0.000s 0.000s 0.000s 8MB 8MB
03-xx-06 PARQUET_ROW_GROUP_SCAN 0.000s 0.000s 0.000s 2m13s 3m40s 5m14s 5.392s 6.372s 7.371s 12MB 12MB
For the SCAN, there are some skewness among the batches/records processed by the minor fragments: from ~5000 batches to +15000 batches The 5m process times are for the bigger batches.

Similar skewness is also shown in 03-xx-02. If the load can be balanced among the fragments, it may speed up the query a little bit.

On the other hand, the 10min query time is not that much off the line, considering that you have only 3 nodes and each node is with only 7 HDDs. I did a similar run of the query on my
cluster (10 nodes each with 32 cores and 23 HDDs), For 03-xx-02 and 03-xx-06, the process time per record is about the same as what you got, even though the total query time is much short due to it is on much bigger system.

Another note: for your test, disk IO is about ~24MB/s/HDD (150GB/300s/21HDD), which is about the IO limit if there are a lot small IOs (what is the chuck size of your HDFS?)

Hope this helps, let me know If you have any further questions.

Thanks,
Dechang

On Mon, May 23, 2016 at 7:28 PM, Yijie Shen <henry.yijieshen@gmail.com <ma...@gmail.com>> wrote:
Hi Dechang, Thanks for your reply.

I've changed the parallelism of each node from 17 to 24 but no performance improvements seen.

The 1516 files are distributed in HDFS constitute 3 datanodes, same 3 servers where I start DrillBits. For each datanode, 7 HDD is configured as dfs.datanode.data.dir.
The data was generated from TPCH dbgen tool and loaded into parquet format using Apache Spark, the file size are 101MB on average, no skew seen.

I've paste the `Full json profile` content and attached to this mail.

Thank you.

On Mon, May 23, 2016 at 11:12 PM, Dechang Gu <dgu@maprtech.com <ma...@maprtech.com>> wrote:
Hi Yijie,
This is Dechang at MapR. I work on Drill performance.

From what you described, looks like scan took most of the time.
How are the files are distributed on the disks, are there any skew?
How many disks are there?
If possible can you provide the profile for the run?

Thanks,
Dechang

On Sun, May 22, 2016 at 9:06 AM, Yijie Shen <henry.yijieshen@gmail.com <ma...@gmail.com>> wrote:
Hi all,

I'm trying out Drill on master branch lately and have deployed a cluster on
three physical server.

The input data `lineitem` is in parquet format of total size 150GB, 101MB
per file and 1516 files in total.

The server has two Intel(R) Xeon(R) CPU E5645 @2.40GHz CPUs and 24 cores in
total, 32GB memory.

While executing Q1 using:

SELECT
L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE),
SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)),
SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY),
AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), COUNT(1)
FROM
dfs.tpch.`lineitem`
WHERE
L_SHIPDATE<='1998-09-02'
GROUP BY L_RETURNFLAG, L_LINESTATUS
ORDER BY L_RETURNFLAG, L_LINESTATUS

I've noticed the parallelism was 51 (planner.width.max_per_node = 17) in my
case for Major Fragment 03 (Scan Filter Project HashAgg and Project), and
each Minor fragment last about 8 to 9 minutes. one record for example:

03-00-xx hw080 7.309s 42.358s 9m35s 118,758,489 14,540 22:31:32 22:31:32
33MB FINISHED

Is this a normal speed (more than 10 minutes) for Drill for my current
cluster? Did I miss something important in conf to accelerate the execution?

Thanks very much!

Yijie

Re: Performance tuning for TPC-H Q1 on a three nodes cluster

Posted by Dechang Gu <dg...@maprtech.com>.

Hi Yijie,
This is Dechang at MapR. I work on Drill performance.

From what you described, looks like scan took most of the time.
How are the files are distributed on the disks, are there any skew?
How many disks are there?
If possible can you provide the profile for the run?

Thanks,
Dechang

On Sun, May 22, 2016 at 9:06 AM, Yijie Shen <he...@gmail.com>
wrote:

> Hi all,
>
> I'm trying out Drill on master branch lately and have deployed a cluster on
> three physical server.
>
> The input data `lineitem` is in parquet format of total size 150GB, 101MB
> per file and 1516 files in total.
>
> The server has two Intel(R) Xeon(R) CPU E5645 @2.40GHz CPUs and 24 cores in
> total, 32GB memory.
>
> While executing Q1 using:
>
>  SELECT
>   L_RETURNFLAG, L_LINESTATUS, SUM(L_QUANTITY), SUM(L_EXTENDEDPRICE),
> SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)),
> SUM(L_EXTENDEDPRICE*(1-L_DISCOUNT)*(1+L_TAX)), AVG(L_QUANTITY),
> AVG(L_EXTENDEDPRICE), AVG(L_DISCOUNT), COUNT(1)
> FROM
>   dfs.tpch.`lineitem`
> WHERE
>   L_SHIPDATE<='1998-09-02'
> GROUP BY L_RETURNFLAG, L_LINESTATUS
> ORDER BY L_RETURNFLAG, L_LINESTATUS
>
> I've noticed the parallelism was 51 (planner.width.max_per_node = 17) in my
> case for Major Fragment 03 (Scan Filter Project HashAgg and Project), and
> each Minor fragment last about 8 to 9 minutes. one record for example:
>
> 03-00-xx hw080 7.309s 42.358s 9m35s 118,758,489 14,540 22:31:32 22:31:32
> 33MB FINISHED
>
> Is this a normal speed (more than 10 minutes) for Drill for my current
> cluster? Did I miss something important in conf to accelerate the
> execution?
>
> Thanks very much!
>
> Yijie
>