You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@drill.apache.org by sreeparna bhabani <bh...@gmail.com> on 2020/05/04 16:00:40 UTC

Partition Pruning in Apache Drill

Hi Team,

Kindly check the below query regarding the partition pruning. We are using
the partition pruning for our current project in Apache Drill and have some
questions. Please find the below details of the scenario-

File Type-
Parquet generated from Python

Folder structure in hdfs-
/<root_folder>/<dir0>/<dir1>/<dir2>

Query used to select data under <dir2>-
To take advantage of partition pruning
select column1, column2, ... from dfs.`tmp`.`<root_folder>` where dir0 =
<dir0> and dir1 = <dir1> and dir2 = <dir2> and <filter> = ..;

Observation-
Although the execution is fast, the time taken for planning is quite high.
I didn't see VALUES operator in the physical plan of the query, rather
there was SCAN operator.
How can we ensure that the selected data is partition pruned here ?
As an alternative, I modified the query to bring down the planning time of
it and included the sub-directories in the root directory. The modified
query is-
select column1, column2, ... from
dfs.`tmp`.`<root_folder>/<dir0>/<dir1>/<dir2>`  where <filter> = ..;

Can you please tell me why the planning time is so high for the first
query? How can we take advantage of partition pruning from it ? Or should
we include sub-directories in the root directory ?

Thanks in advance.

*Sreeparna Bhabani*

Re: Partition Pruning in Apache Drill

Posted by sreeparna bhabani <bh...@gmail.com>.

Hi Paul,

Please find the details below for which I have tried Partition Pruning.
Structure of the directory is -> /<dir0>/<dir1>/<dir2>
The number of directories at dir1 level is 21. dir2 directories are
unevenly placed under dir1. Total number of directories of dir2 is 87.
There are 2 Parquet files in each leaf folder. There are 2 Parquet files
from one leaf folder in my query.

We are using Parquet files in HDFS storage.

The parquet files are generated from Python. Is it mandate to re-write the
parquet from other tools (here Python) to take the advantage of pruning ?

Thanks in advance,
Sreeparna Bhabani



On Tue, May 5, 2020 at 1:26 PM Paul Rogers <pa...@yahoo.com> wrote:

> Hi Sreeparna,
>
>
> There are various reasons that planning might be slow. You mentioned you
> have a partitioned directory structure, which is a good approach. How many
> directories exist at each level? How many files in the leaf folders? How
> many of those folders are included in your query?
>
>
> If the number is large, then the delay may be due to the fact that Drill
> must walk the tree to identify which files to include in the query.
>
>
> Also, which file system are you using? HDFS? S3? Each has different
> characteristics when doing directory operations. (S3 has no actual
> directories, for example.)
>
>
> Please provide the additional information so we can identify the source of
> the issue.
>
>
> Thanks,
>
> - Paul
>
>
>
> On Monday, May 4, 2020, 9:01:08 AM PDT, sreeparna bhabani <
> bhabani.sreeparna@gmail.com> wrote:
>
>
> Hi Team,
>
> Kindly check the below query regarding the partition pruning. We are using
> the partition pruning for our current project in Apache Drill and have some
> questions. Please find the below details of the scenario-
>
> File Type-
> Parquet generated from Python
>
> Folder structure in hdfs-
> /<root_folder>/<dir0>/<dir1>/<dir2>
>
> Query used to select data under <dir2>-
> To take advantage of partition pruning
> select column1, column2, ... from dfs.`tmp`.`<root_folder>` where dir0 =
> <dir0> and dir1 = <dir1> and dir2 = <dir2> and <filter> = ..;
>
> Observation-
> Although the execution is fast, the time taken for planning is quite high.
> I didn't see VALUES operator in the physical plan of the query, rather
> there was SCAN operator.
> How can we ensure that the selected data is partition pruned here ?
> As an alternative, I modified the query to bring down the planning time of
> it and included the sub-directories in the root directory. The modified
> query is-
> select column1, column2, ... from
> dfs.`tmp`.`<root_folder>/<dir0>/<dir1>/<dir2>`  where <filter> = ..;
>
> Can you please tell me why the planning time is so high for the first
> query? How can we take advantage of partition pruning from it ? Or should
> we include sub-directories in the root directory ?
>
> Thanks in advance.
>
> *Sreeparna Bhabani*
>


-- 

Thanks n Regards,
*Sreeparna Bhabani*

Re: Partition Pruning in Apache Drill

Posted by Paul Rogers <pa...@yahoo.com.INVALID>.

Hi Sreeparna,

There are various reasons that planning might be slow. You mentioned you have a partitioned directory structure, which is a good approach. How many directories exist at each level? How many files in the leaf folders? How many of those folders are included in your query?

If the number is large, then the delay may be due to the fact that Drill must walk the tree to identify which files to include in the query.

Also, which file system are you using? HDFS? S3? Each has different characteristics when doing directory operations. (S3 has no actual directories, for example.)

Please provide the additional information so we can identify the source of the issue.


Thanks,
- Paul

 

    On Monday, May 4, 2020, 9:01:08 AM PDT, sreeparna bhabani <bh...@gmail.com> wrote:  
 
 Hi Team,

Kindly check the below query regarding the partition pruning. We are using
the partition pruning for our current project in Apache Drill and have some
questions. Please find the below details of the scenario-

File Type-
Parquet generated from Python

Folder structure in hdfs-
/<root_folder>/<dir0>/<dir1>/<dir2>

Query used to select data under <dir2>-
To take advantage of partition pruning
select column1, column2, ... from dfs.`tmp`.`<root_folder>` where dir0 =
<dir0> and dir1 = <dir1> and dir2 = <dir2> and <filter> = ..;

Observation-
Although the execution is fast, the time taken for planning is quite high.
I didn't see VALUES operator in the physical plan of the query, rather
there was SCAN operator.
How can we ensure that the selected data is partition pruned here ?
As an alternative, I modified the query to bring down the planning time of
it and included the sub-directories in the root directory. The modified
query is-
select column1, column2, ... from
dfs.`tmp`.`<root_folder>/<dir0>/<dir1>/<dir2>`  where <filter> = ..;

Can you please tell me why the planning time is so high for the first
query? How can we take advantage of partition pruning from it ? Or should
we include sub-directories in the root directory ?

Thanks in advance.

*Sreeparna Bhabani*