You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Chris Mathews <ma...@uk2.net> on 2015/10/21 20:26:40 UTC

Externally created Parquet files and partition pruning

We have an existing ETL framework processing machine generated data, which we are updating to write Parquet files out directly to HDFS using AvroParquetWriter for access by Drill.

Some questions:

How do we take advantage of Drill’s partition pruning capabilities with PARTITION BY if we are not using CTAS to load the Parquet files ?

It seems there is no way of taking advantage of these features if the Parquet files are created externally to CTAS - am I correct ?

If I am, then is there any way using a Drill API of programatically loading our data into Parquet files and utilise Drill's parallelisation techniques using CTAS, or do we have to write the data out to a file and then load that file again as input to a CTAS command ?

Another potential issue is that we are constantly writing Parquet files out to HDFS directories so the data in these files eventually appears as additional data in a Drill query - so how can we do this with CTAS ? Does CTAS append to an existing directory structure or does it insist on a new table name each time it is executed ?

What I am getting at here is that there seem to be performance enhancement features available to Drill when the Parquet files are created using an existing file as input to a CTAS that are not possible otherwise.  With the volumes of data we are talking about it is not really an option to write the files out, form them to then be read back in again for conversion using CTAS; which is why we write the Parquet files out directly to HDFS and append them to existing directories.

Am I missing something obvious here - quite possibly yes ?

Thanks for any help.

Cheers — Chris


Re: Externally created Parquet files and partition pruning

Posted by rahul challapalli <ch...@gmail.com>.
Chris,

Its not just sufficient to specify which column is the partition column.
The data should also be organized accordingly. Below is a high level
description of how partition pruning works with parquet files

1. Use CTAS with partition by clause : Here drill creates a single (or
more) file for each distinct partition column value and all the records
which have the same partition column value would go into the file. The
metadata of each parquet file contains information necessary to identify
the partition column(s) in that file.
2. Use a query with a filter on the partition column : During the planning
time, if drill detects a filter on the partition column then it would
instruct the execution engine to only scan the files whose partition column
value matches the filter condition.

Hope this helps.

- Rahul

On Wed, Oct 21, 2015 at 12:18 PM, Chris Mathews <ma...@uk2.net> wrote:

> We create a JSON format schema for the Parquet file using the Avro
> specification and use this schema when loading data.
>
> Is there anything special we have to do to flag a column as a partitioning
> column ?
> Sorry I don’t understand your answer. What do you mean by ‘discover the
> columns with a single value’ ?
>
> Cheers — Chris
>
> > On 21 Oct 2015, at 20:02, Mehant Baid <ba...@gmail.com> wrote:
> >
> > The information is stored in the footer of the parquet files. Drill
> reads the metadata information stored in the parquet footer to discover the
> columns with a single value and treats them as partitioning columns.
> >
> > Thanks
> > Mehant
> >
> > On 10/21/15 11:52 AM, Chris Mathews wrote:
> >> Thank Mehant; yes we did look at doing this, but the advantages of
> using the new PARTITION BY feature is that the partitioned columns are
> automatically detected during any subsequent queries.  This is a major
> advantage as our customers are using the Tableau BI tool, and knowing
> details such as the exact partition levels and directories is not an option.
> >>
> >> By the way, having created a table using PARTITION BY and CTAS ,how
> does a query know how to action the pruning ?  Where is this information
> stored for the query to access the tables/files efficiently ?
> >>
> >> Cheers — Chris
> >>
> >>> On 21 Oct 2015, at 19:37, Mehant Baid <ba...@gmail.com> wrote:
> >>>
> >>> In addition to the auto partitioning done by CTAS, Drill also supports
> directory based pruning. You could load data into different(nested)
> directories underneath the top level table location and use the 'where'
> clause to get the pruning performance benefits. Following is a typical
> example
> >>>
> >>> Table location: /home/user/table_name
> >>> Within this you could create nested directory structure of the form
> >>> /home/user/table_name/2010/jan
> >>> /home/user/table_name/2010/feb
> >>> ...
> >>> /home/user/table_name/2010/dec
> >>>
> >>> /home/user/table_name/2011/jan
> >>> ...
> >>> /home/user/table_name/2011/dec
> >>>
> >>> Given this directory structure you could have a query that looks like
> >>>
> >>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and
> dir1 = jan;
> >>>
> >>> This would prune out scanning the parquet files under the other
> directories.
> >>>
> >>> Thanks
> >>> Mehant
> >>> On 10/21/15 11:26 AM, Chris Mathews wrote:
> >>>> We have an existing ETL framework processing machine generated data,
> which we are updating to write Parquet files out directly to HDFS using
> AvroParquetWriter for access by Drill.
> >>>>
> >>>> Some questions:
> >>>>
> >>>> How do we take advantage of Drill’s partition pruning capabilities
> with PARTITION BY if we are not using CTAS to load the Parquet files ?
> >>>>
> >>>> It seems there is no way of taking advantage of these features if the
> Parquet files are created externally to CTAS - am I correct ?
> >>>>
> >>>> If I am, then is there any way using a Drill API of programatically
> loading our data into Parquet files and utilise Drill's parallelisation
> techniques using CTAS, or do we have to write the data out to a file and
> then load that file again as input to a CTAS command ?
> >>>>
> >>>> Another potential issue is that we are constantly writing Parquet
> files out to HDFS directories so the data in these files eventually appears
> as additional data in a Drill query - so how can we do this with CTAS ?
> Does CTAS append to an existing directory structure or does it insist on a
> new table name each time it is executed ?
> >>>>
> >>>> What I am getting at here is that there seem to be performance
> enhancement features available to Drill when the Parquet files are created
> using an existing file as input to a CTAS that are not possible otherwise.
> With the volumes of data we are talking about it is not really an option to
> write the files out, form them to then be read back in again for conversion
> using CTAS; which is why we write the Parquet files out directly to HDFS
> and append them to existing directories.
> >>>>
> >>>> Am I missing something obvious here - quite possibly yes ?
> >>>>
> >>>> Thanks for any help.
> >>>>
> >>>> Cheers — Chris
> >>>>
> >
>
>

Re: Externally created Parquet files and partition pruning

Posted by Chris Mathews <ma...@uk2.net>.
@Jacques:

>>> Jinfeng hit the nail on the head. If you have Parquet files with single
>>> value columns (and have Parquet footer metadata stats), Drill will
>>> automatically leverage the partitioning with zero additional setup required.

So as it stands at the moment, does this mean Drill has to read the footer of all Parquet files in a directory to detect if a column has a single value across all files in that directory ?
Or is this where the metadata comes into play ?

— Chris


> On 28 Oct 2015, at 00:30, Jacques Nadeau <ja...@dremio.com> wrote:
> 
> Sounds like a future optimization opportunity once someone has the hybrid
> issue and need.
> 
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
> 
> On Sun, Oct 25, 2015 at 6:00 PM, Jinfeng Ni <ji...@gmail.com> wrote:
> 
>> @Jacques,
>> 
>> Steven probably could confirm whether my understanding of the code is
>> correct or not. From the code, it seems we enforce the checking that
>> only a column with unique value across all the files would be
>> considered for pruning.
>> 
>> I just tried two simple cases with TPC-H sample data. It seems to be
>> the case; the table that meets this unique value requirement shows
>> partition pruning works, while the table that has one file that
>> violates this requirement does not show so.
>> 
>> I think the code imposes such requirement due to the current way how
>> partition pruning works : we build a set of value vectors for each
>> input files before partition pruning does the interpreter evaluation.
>> Those value vectors assume that each input file should have single
>> value; otherwise it would not fit into this model.
>> 
>> I agree with you that we actually should do the partitioning if any of
>> the files are excludable. That sounds like we should change the
>> partition pruning rule such that the filter evaluation is moved into
>> ParquetGroupScan itself.
>> 
>> 
>> 
>> 
>> On Sun, Oct 25, 2015 at 4:32 PM, Jacques Nadeau <ja...@dremio.com>
>> wrote:
>>> Jinfeng hit the nail on the head. If you have Parquet files with single
>>> value columns (and have Parquet footer metadata stats), Drill will
>>> automatically leverage the partitioning with zero additional setup
>>> required.
>>> 
>>> Jinfeng, based on what you said, it sounds as if we don't apply
>>> partitioning unless all the files read are partitioned (single value
>>> columns). We should actually do the partitioning if any of the files are
>>> excludable, right?
>>> 
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>> 
>>> On Wed, Oct 21, 2015 at 1:45 PM, Chris Mathews <ma...@uk2.net> wrote:
>>> 
>>>> Thanks guys this is very helpful.
>>>> I now need to go away and do some more research into this.
>>>> 
>>>> Cheers -- Chris
>>>> ____________________
>>>> Sent from my iPhone
>>>> 
>>>>> On 21 Oct 2015, at 21:32, Jinfeng Ni <ji...@gmail.com> wrote:
>>>>> 
>>>>> For each column in the parquet files, Drill will check column metadata
>>>>> and see if min == max across all parquet files. If yes, that indicates
>>>>> this column has a unique value for all the files, and Drill will use
>>>>> that column as partitioning columns.
>>>>> 
>>>>> The partitioning column could be a column specified in "PARTITION BY"
>>>>> clause of CTAS; in such case, CTAS will ensure each created parquet
>>>>> file will have unique value for the column.  If the parquet files are
>>>>> not created by CTAS PARTITION BY, but some column happens to be unique
>>>>> across all the parquet files, then it will be treated as partitioning
>>>>> column as well.
>>>>> 
>>>>> You may get more detail in the code [1].
>>>>> 
>>>>> [1]
>>>> 
>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L324
>>>>> 
>>>>> 
>>>>>> On Wed, Oct 21, 2015 at 12:18 PM, Chris Mathews <ma...@uk2.net>
>>>> wrote:
>>>>>> We create a JSON format schema for the Parquet file using the Avro
>>>> specification and use this schema when loading data.
>>>>>> 
>>>>>> Is there anything special we have to do to flag a column as a
>>>> partitioning column ?
>>>>>> Sorry I don’t understand your answer. What do you mean by ‘discover
>> the
>>>> columns with a single value’ ?
>>>>>> 
>>>>>> Cheers — Chris
>>>>>> 
>>>>>>> On 21 Oct 2015, at 20:02, Mehant Baid <ba...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>> The information is stored in the footer of the parquet files. Drill
>>>> reads the metadata information stored in the parquet footer to discover
>> the
>>>> columns with a single value and treats them as partitioning columns.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Mehant
>>>>>>> 
>>>>>>>> On 10/21/15 11:52 AM, Chris Mathews wrote:
>>>>>>>> Thank Mehant; yes we did look at doing this, but the advantages of
>>>> using the new PARTITION BY feature is that the partitioned columns are
>>>> automatically detected during any subsequent queries.  This is a major
>>>> advantage as our customers are using the Tableau BI tool, and knowing
>>>> details such as the exact partition levels and directories is not an
>> option.
>>>>>>>> 
>>>>>>>> By the way, having created a table using PARTITION BY and CTAS ,how
>>>> does a query know how to action the pruning ?  Where is this information
>>>> stored for the query to access the tables/files efficiently ?
>>>>>>>> 
>>>>>>>> Cheers — Chris
>>>>>>>> 
>>>>>>>>> On 21 Oct 2015, at 19:37, Mehant Baid <ba...@gmail.com>
>> wrote:
>>>>>>>>> 
>>>>>>>>> In addition to the auto partitioning done by CTAS, Drill also
>>>> supports directory based pruning. You could load data into
>>>> different(nested) directories underneath the top level table location
>> and
>>>> use the 'where' clause to get the pruning performance benefits.
>> Following
>>>> is a typical example
>>>>>>>>> 
>>>>>>>>> Table location: /home/user/table_name
>>>>>>>>> Within this you could create nested directory structure of the
>> form
>>>>>>>>> /home/user/table_name/2010/jan
>>>>>>>>> /home/user/table_name/2010/feb
>>>>>>>>> ...
>>>>>>>>> /home/user/table_name/2010/dec
>>>>>>>>> 
>>>>>>>>> /home/user/table_name/2011/jan
>>>>>>>>> ...
>>>>>>>>> /home/user/table_name/2011/dec
>>>>>>>>> 
>>>>>>>>> Given this directory structure you could have a query that looks
>> like
>>>>>>>>> 
>>>>>>>>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and
>>>> dir1 = jan;
>>>>>>>>> 
>>>>>>>>> This would prune out scanning the parquet files under the other
>>>> directories.
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Mehant
>>>>>>>>>> On 10/21/15 11:26 AM, Chris Mathews wrote:
>>>>>>>>>> We have an existing ETL framework processing machine generated
>>>> data, which we are updating to write Parquet files out directly to HDFS
>>>> using AvroParquetWriter for access by Drill.
>>>>>>>>>> 
>>>>>>>>>> Some questions:
>>>>>>>>>> 
>>>>>>>>>> How do we take advantage of Drill’s partition pruning
>> capabilities
>>>> with PARTITION BY if we are not using CTAS to load the Parquet files ?
>>>>>>>>>> 
>>>>>>>>>> It seems there is no way of taking advantage of these features if
>>>> the Parquet files are created externally to CTAS - am I correct ?
>>>>>>>>>> 
>>>>>>>>>> If I am, then is there any way using a Drill API of
>> programatically
>>>> loading our data into Parquet files and utilise Drill's parallelisation
>>>> techniques using CTAS, or do we have to write the data out to a file and
>>>> then load that file again as input to a CTAS command ?
>>>>>>>>>> 
>>>>>>>>>> Another potential issue is that we are constantly writing Parquet
>>>> files out to HDFS directories so the data in these files eventually
>> appears
>>>> as additional data in a Drill query - so how can we do this with CTAS ?
>>>> Does CTAS append to an existing directory structure or does it insist
>> on a
>>>> new table name each time it is executed ?
>>>>>>>>>> 
>>>>>>>>>> What I am getting at here is that there seem to be performance
>>>> enhancement features available to Drill when the Parquet files are
>> created
>>>> using an existing file as input to a CTAS that are not possible
>> otherwise.
>>>> With the volumes of data we are talking about it is not really an
>> option to
>>>> write the files out, form them to then be read back in again for
>> conversion
>>>> using CTAS; which is why we write the Parquet files out directly to HDFS
>>>> and append them to existing directories.
>>>>>>>>>> 
>>>>>>>>>> Am I missing something obvious here - quite possibly yes ?
>>>>>>>>>> 
>>>>>>>>>> Thanks for any help.
>>>>>>>>>> 
>>>>>>>>>> Cheers — Chris
>>>>>> 
>>>> 
>> 


Re: Externally created Parquet files and partition pruning

Posted by Jacques Nadeau <ja...@dremio.com>.
Sounds like a future optimization opportunity once someone has the hybrid
issue and need.

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Sun, Oct 25, 2015 at 6:00 PM, Jinfeng Ni <ji...@gmail.com> wrote:

> @Jacques,
>
> Steven probably could confirm whether my understanding of the code is
> correct or not. From the code, it seems we enforce the checking that
> only a column with unique value across all the files would be
> considered for pruning.
>
> I just tried two simple cases with TPC-H sample data. It seems to be
> the case; the table that meets this unique value requirement shows
> partition pruning works, while the table that has one file that
> violates this requirement does not show so.
>
> I think the code imposes such requirement due to the current way how
> partition pruning works : we build a set of value vectors for each
> input files before partition pruning does the interpreter evaluation.
> Those value vectors assume that each input file should have single
> value; otherwise it would not fit into this model.
>
> I agree with you that we actually should do the partitioning if any of
> the files are excludable. That sounds like we should change the
> partition pruning rule such that the filter evaluation is moved into
> ParquetGroupScan itself.
>
>
>
>
> On Sun, Oct 25, 2015 at 4:32 PM, Jacques Nadeau <ja...@dremio.com>
> wrote:
> > Jinfeng hit the nail on the head. If you have Parquet files with single
> > value columns (and have Parquet footer metadata stats), Drill will
> > automatically leverage the partitioning with zero additional setup
> > required.
> >
> > Jinfeng, based on what you said, it sounds as if we don't apply
> > partitioning unless all the files read are partitioned (single value
> > columns). We should actually do the partitioning if any of the files are
> > excludable, right?
> >
> > --
> > Jacques Nadeau
> > CTO and Co-Founder, Dremio
> >
> > On Wed, Oct 21, 2015 at 1:45 PM, Chris Mathews <ma...@uk2.net> wrote:
> >
> >> Thanks guys this is very helpful.
> >> I now need to go away and do some more research into this.
> >>
> >> Cheers -- Chris
> >> ____________________
> >> Sent from my iPhone
> >>
> >> > On 21 Oct 2015, at 21:32, Jinfeng Ni <ji...@gmail.com> wrote:
> >> >
> >> > For each column in the parquet files, Drill will check column metadata
> >> > and see if min == max across all parquet files. If yes, that indicates
> >> > this column has a unique value for all the files, and Drill will use
> >> > that column as partitioning columns.
> >> >
> >> > The partitioning column could be a column specified in "PARTITION BY"
> >> > clause of CTAS; in such case, CTAS will ensure each created parquet
> >> > file will have unique value for the column.  If the parquet files are
> >> > not created by CTAS PARTITION BY, but some column happens to be unique
> >> > across all the parquet files, then it will be treated as partitioning
> >> > column as well.
> >> >
> >> > You may get more detail in the code [1].
> >> >
> >> > [1]
> >>
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L324
> >> >
> >> >
> >> >> On Wed, Oct 21, 2015 at 12:18 PM, Chris Mathews <ma...@uk2.net>
> >> wrote:
> >> >> We create a JSON format schema for the Parquet file using the Avro
> >> specification and use this schema when loading data.
> >> >>
> >> >> Is there anything special we have to do to flag a column as a
> >> partitioning column ?
> >> >> Sorry I don’t understand your answer. What do you mean by ‘discover
> the
> >> columns with a single value’ ?
> >> >>
> >> >> Cheers — Chris
> >> >>
> >> >>> On 21 Oct 2015, at 20:02, Mehant Baid <ba...@gmail.com>
> wrote:
> >> >>>
> >> >>> The information is stored in the footer of the parquet files. Drill
> >> reads the metadata information stored in the parquet footer to discover
> the
> >> columns with a single value and treats them as partitioning columns.
> >> >>>
> >> >>> Thanks
> >> >>> Mehant
> >> >>>
> >> >>>> On 10/21/15 11:52 AM, Chris Mathews wrote:
> >> >>>> Thank Mehant; yes we did look at doing this, but the advantages of
> >> using the new PARTITION BY feature is that the partitioned columns are
> >> automatically detected during any subsequent queries.  This is a major
> >> advantage as our customers are using the Tableau BI tool, and knowing
> >> details such as the exact partition levels and directories is not an
> option.
> >> >>>>
> >> >>>> By the way, having created a table using PARTITION BY and CTAS ,how
> >> does a query know how to action the pruning ?  Where is this information
> >> stored for the query to access the tables/files efficiently ?
> >> >>>>
> >> >>>> Cheers — Chris
> >> >>>>
> >> >>>>> On 21 Oct 2015, at 19:37, Mehant Baid <ba...@gmail.com>
> wrote:
> >> >>>>>
> >> >>>>> In addition to the auto partitioning done by CTAS, Drill also
> >> supports directory based pruning. You could load data into
> >> different(nested) directories underneath the top level table location
> and
> >> use the 'where' clause to get the pruning performance benefits.
> Following
> >> is a typical example
> >> >>>>>
> >> >>>>> Table location: /home/user/table_name
> >> >>>>> Within this you could create nested directory structure of the
> form
> >> >>>>> /home/user/table_name/2010/jan
> >> >>>>> /home/user/table_name/2010/feb
> >> >>>>> ...
> >> >>>>> /home/user/table_name/2010/dec
> >> >>>>>
> >> >>>>> /home/user/table_name/2011/jan
> >> >>>>> ...
> >> >>>>> /home/user/table_name/2011/dec
> >> >>>>>
> >> >>>>> Given this directory structure you could have a query that looks
> like
> >> >>>>>
> >> >>>>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and
> >> dir1 = jan;
> >> >>>>>
> >> >>>>> This would prune out scanning the parquet files under the other
> >> directories.
> >> >>>>>
> >> >>>>> Thanks
> >> >>>>> Mehant
> >> >>>>>> On 10/21/15 11:26 AM, Chris Mathews wrote:
> >> >>>>>> We have an existing ETL framework processing machine generated
> >> data, which we are updating to write Parquet files out directly to HDFS
> >> using AvroParquetWriter for access by Drill.
> >> >>>>>>
> >> >>>>>> Some questions:
> >> >>>>>>
> >> >>>>>> How do we take advantage of Drill’s partition pruning
> capabilities
> >> with PARTITION BY if we are not using CTAS to load the Parquet files ?
> >> >>>>>>
> >> >>>>>> It seems there is no way of taking advantage of these features if
> >> the Parquet files are created externally to CTAS - am I correct ?
> >> >>>>>>
> >> >>>>>> If I am, then is there any way using a Drill API of
> programatically
> >> loading our data into Parquet files and utilise Drill's parallelisation
> >> techniques using CTAS, or do we have to write the data out to a file and
> >> then load that file again as input to a CTAS command ?
> >> >>>>>>
> >> >>>>>> Another potential issue is that we are constantly writing Parquet
> >> files out to HDFS directories so the data in these files eventually
> appears
> >> as additional data in a Drill query - so how can we do this with CTAS ?
> >> Does CTAS append to an existing directory structure or does it insist
> on a
> >> new table name each time it is executed ?
> >> >>>>>>
> >> >>>>>> What I am getting at here is that there seem to be performance
> >> enhancement features available to Drill when the Parquet files are
> created
> >> using an existing file as input to a CTAS that are not possible
> otherwise.
> >> With the volumes of data we are talking about it is not really an
> option to
> >> write the files out, form them to then be read back in again for
> conversion
> >> using CTAS; which is why we write the Parquet files out directly to HDFS
> >> and append them to existing directories.
> >> >>>>>>
> >> >>>>>> Am I missing something obvious here - quite possibly yes ?
> >> >>>>>>
> >> >>>>>> Thanks for any help.
> >> >>>>>>
> >> >>>>>> Cheers — Chris
> >> >>
> >>
>

Re: Externally created Parquet files and partition pruning

Posted by Jinfeng Ni <ji...@gmail.com>.
@Jacques,

Steven probably could confirm whether my understanding of the code is
correct or not. From the code, it seems we enforce the checking that
only a column with unique value across all the files would be
considered for pruning.

I just tried two simple cases with TPC-H sample data. It seems to be
the case; the table that meets this unique value requirement shows
partition pruning works, while the table that has one file that
violates this requirement does not show so.

I think the code imposes such requirement due to the current way how
partition pruning works : we build a set of value vectors for each
input files before partition pruning does the interpreter evaluation.
Those value vectors assume that each input file should have single
value; otherwise it would not fit into this model.

I agree with you that we actually should do the partitioning if any of
the files are excludable. That sounds like we should change the
partition pruning rule such that the filter evaluation is moved into
ParquetGroupScan itself.




On Sun, Oct 25, 2015 at 4:32 PM, Jacques Nadeau <ja...@dremio.com> wrote:
> Jinfeng hit the nail on the head. If you have Parquet files with single
> value columns (and have Parquet footer metadata stats), Drill will
> automatically leverage the partitioning with zero additional setup
> required.
>
> Jinfeng, based on what you said, it sounds as if we don't apply
> partitioning unless all the files read are partitioned (single value
> columns). We should actually do the partitioning if any of the files are
> excludable, right?
>
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
>
> On Wed, Oct 21, 2015 at 1:45 PM, Chris Mathews <ma...@uk2.net> wrote:
>
>> Thanks guys this is very helpful.
>> I now need to go away and do some more research into this.
>>
>> Cheers -- Chris
>> ____________________
>> Sent from my iPhone
>>
>> > On 21 Oct 2015, at 21:32, Jinfeng Ni <ji...@gmail.com> wrote:
>> >
>> > For each column in the parquet files, Drill will check column metadata
>> > and see if min == max across all parquet files. If yes, that indicates
>> > this column has a unique value for all the files, and Drill will use
>> > that column as partitioning columns.
>> >
>> > The partitioning column could be a column specified in "PARTITION BY"
>> > clause of CTAS; in such case, CTAS will ensure each created parquet
>> > file will have unique value for the column.  If the parquet files are
>> > not created by CTAS PARTITION BY, but some column happens to be unique
>> > across all the parquet files, then it will be treated as partitioning
>> > column as well.
>> >
>> > You may get more detail in the code [1].
>> >
>> > [1]
>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L324
>> >
>> >
>> >> On Wed, Oct 21, 2015 at 12:18 PM, Chris Mathews <ma...@uk2.net>
>> wrote:
>> >> We create a JSON format schema for the Parquet file using the Avro
>> specification and use this schema when loading data.
>> >>
>> >> Is there anything special we have to do to flag a column as a
>> partitioning column ?
>> >> Sorry I don’t understand your answer. What do you mean by ‘discover the
>> columns with a single value’ ?
>> >>
>> >> Cheers — Chris
>> >>
>> >>> On 21 Oct 2015, at 20:02, Mehant Baid <ba...@gmail.com> wrote:
>> >>>
>> >>> The information is stored in the footer of the parquet files. Drill
>> reads the metadata information stored in the parquet footer to discover the
>> columns with a single value and treats them as partitioning columns.
>> >>>
>> >>> Thanks
>> >>> Mehant
>> >>>
>> >>>> On 10/21/15 11:52 AM, Chris Mathews wrote:
>> >>>> Thank Mehant; yes we did look at doing this, but the advantages of
>> using the new PARTITION BY feature is that the partitioned columns are
>> automatically detected during any subsequent queries.  This is a major
>> advantage as our customers are using the Tableau BI tool, and knowing
>> details such as the exact partition levels and directories is not an option.
>> >>>>
>> >>>> By the way, having created a table using PARTITION BY and CTAS ,how
>> does a query know how to action the pruning ?  Where is this information
>> stored for the query to access the tables/files efficiently ?
>> >>>>
>> >>>> Cheers — Chris
>> >>>>
>> >>>>> On 21 Oct 2015, at 19:37, Mehant Baid <ba...@gmail.com> wrote:
>> >>>>>
>> >>>>> In addition to the auto partitioning done by CTAS, Drill also
>> supports directory based pruning. You could load data into
>> different(nested) directories underneath the top level table location and
>> use the 'where' clause to get the pruning performance benefits. Following
>> is a typical example
>> >>>>>
>> >>>>> Table location: /home/user/table_name
>> >>>>> Within this you could create nested directory structure of the form
>> >>>>> /home/user/table_name/2010/jan
>> >>>>> /home/user/table_name/2010/feb
>> >>>>> ...
>> >>>>> /home/user/table_name/2010/dec
>> >>>>>
>> >>>>> /home/user/table_name/2011/jan
>> >>>>> ...
>> >>>>> /home/user/table_name/2011/dec
>> >>>>>
>> >>>>> Given this directory structure you could have a query that looks like
>> >>>>>
>> >>>>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and
>> dir1 = jan;
>> >>>>>
>> >>>>> This would prune out scanning the parquet files under the other
>> directories.
>> >>>>>
>> >>>>> Thanks
>> >>>>> Mehant
>> >>>>>> On 10/21/15 11:26 AM, Chris Mathews wrote:
>> >>>>>> We have an existing ETL framework processing machine generated
>> data, which we are updating to write Parquet files out directly to HDFS
>> using AvroParquetWriter for access by Drill.
>> >>>>>>
>> >>>>>> Some questions:
>> >>>>>>
>> >>>>>> How do we take advantage of Drill’s partition pruning capabilities
>> with PARTITION BY if we are not using CTAS to load the Parquet files ?
>> >>>>>>
>> >>>>>> It seems there is no way of taking advantage of these features if
>> the Parquet files are created externally to CTAS - am I correct ?
>> >>>>>>
>> >>>>>> If I am, then is there any way using a Drill API of programatically
>> loading our data into Parquet files and utilise Drill's parallelisation
>> techniques using CTAS, or do we have to write the data out to a file and
>> then load that file again as input to a CTAS command ?
>> >>>>>>
>> >>>>>> Another potential issue is that we are constantly writing Parquet
>> files out to HDFS directories so the data in these files eventually appears
>> as additional data in a Drill query - so how can we do this with CTAS ?
>> Does CTAS append to an existing directory structure or does it insist on a
>> new table name each time it is executed ?
>> >>>>>>
>> >>>>>> What I am getting at here is that there seem to be performance
>> enhancement features available to Drill when the Parquet files are created
>> using an existing file as input to a CTAS that are not possible otherwise.
>> With the volumes of data we are talking about it is not really an option to
>> write the files out, form them to then be read back in again for conversion
>> using CTAS; which is why we write the Parquet files out directly to HDFS
>> and append them to existing directories.
>> >>>>>>
>> >>>>>> Am I missing something obvious here - quite possibly yes ?
>> >>>>>>
>> >>>>>> Thanks for any help.
>> >>>>>>
>> >>>>>> Cheers — Chris
>> >>
>>

Re: Externally created Parquet files and partition pruning

Posted by Jacques Nadeau <ja...@dremio.com>.
Jinfeng hit the nail on the head. If you have Parquet files with single
value columns (and have Parquet footer metadata stats), Drill will
automatically leverage the partitioning with zero additional setup
required.

Jinfeng, based on what you said, it sounds as if we don't apply
partitioning unless all the files read are partitioned (single value
columns). We should actually do the partitioning if any of the files are
excludable, right?

--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Wed, Oct 21, 2015 at 1:45 PM, Chris Mathews <ma...@uk2.net> wrote:

> Thanks guys this is very helpful.
> I now need to go away and do some more research into this.
>
> Cheers -- Chris
> ____________________
> Sent from my iPhone
>
> > On 21 Oct 2015, at 21:32, Jinfeng Ni <ji...@gmail.com> wrote:
> >
> > For each column in the parquet files, Drill will check column metadata
> > and see if min == max across all parquet files. If yes, that indicates
> > this column has a unique value for all the files, and Drill will use
> > that column as partitioning columns.
> >
> > The partitioning column could be a column specified in "PARTITION BY"
> > clause of CTAS; in such case, CTAS will ensure each created parquet
> > file will have unique value for the column.  If the parquet files are
> > not created by CTAS PARTITION BY, but some column happens to be unique
> > across all the parquet files, then it will be treated as partitioning
> > column as well.
> >
> > You may get more detail in the code [1].
> >
> > [1]
> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L324
> >
> >
> >> On Wed, Oct 21, 2015 at 12:18 PM, Chris Mathews <ma...@uk2.net>
> wrote:
> >> We create a JSON format schema for the Parquet file using the Avro
> specification and use this schema when loading data.
> >>
> >> Is there anything special we have to do to flag a column as a
> partitioning column ?
> >> Sorry I don’t understand your answer. What do you mean by ‘discover the
> columns with a single value’ ?
> >>
> >> Cheers — Chris
> >>
> >>> On 21 Oct 2015, at 20:02, Mehant Baid <ba...@gmail.com> wrote:
> >>>
> >>> The information is stored in the footer of the parquet files. Drill
> reads the metadata information stored in the parquet footer to discover the
> columns with a single value and treats them as partitioning columns.
> >>>
> >>> Thanks
> >>> Mehant
> >>>
> >>>> On 10/21/15 11:52 AM, Chris Mathews wrote:
> >>>> Thank Mehant; yes we did look at doing this, but the advantages of
> using the new PARTITION BY feature is that the partitioned columns are
> automatically detected during any subsequent queries.  This is a major
> advantage as our customers are using the Tableau BI tool, and knowing
> details such as the exact partition levels and directories is not an option.
> >>>>
> >>>> By the way, having created a table using PARTITION BY and CTAS ,how
> does a query know how to action the pruning ?  Where is this information
> stored for the query to access the tables/files efficiently ?
> >>>>
> >>>> Cheers — Chris
> >>>>
> >>>>> On 21 Oct 2015, at 19:37, Mehant Baid <ba...@gmail.com> wrote:
> >>>>>
> >>>>> In addition to the auto partitioning done by CTAS, Drill also
> supports directory based pruning. You could load data into
> different(nested) directories underneath the top level table location and
> use the 'where' clause to get the pruning performance benefits. Following
> is a typical example
> >>>>>
> >>>>> Table location: /home/user/table_name
> >>>>> Within this you could create nested directory structure of the form
> >>>>> /home/user/table_name/2010/jan
> >>>>> /home/user/table_name/2010/feb
> >>>>> ...
> >>>>> /home/user/table_name/2010/dec
> >>>>>
> >>>>> /home/user/table_name/2011/jan
> >>>>> ...
> >>>>> /home/user/table_name/2011/dec
> >>>>>
> >>>>> Given this directory structure you could have a query that looks like
> >>>>>
> >>>>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and
> dir1 = jan;
> >>>>>
> >>>>> This would prune out scanning the parquet files under the other
> directories.
> >>>>>
> >>>>> Thanks
> >>>>> Mehant
> >>>>>> On 10/21/15 11:26 AM, Chris Mathews wrote:
> >>>>>> We have an existing ETL framework processing machine generated
> data, which we are updating to write Parquet files out directly to HDFS
> using AvroParquetWriter for access by Drill.
> >>>>>>
> >>>>>> Some questions:
> >>>>>>
> >>>>>> How do we take advantage of Drill’s partition pruning capabilities
> with PARTITION BY if we are not using CTAS to load the Parquet files ?
> >>>>>>
> >>>>>> It seems there is no way of taking advantage of these features if
> the Parquet files are created externally to CTAS - am I correct ?
> >>>>>>
> >>>>>> If I am, then is there any way using a Drill API of programatically
> loading our data into Parquet files and utilise Drill's parallelisation
> techniques using CTAS, or do we have to write the data out to a file and
> then load that file again as input to a CTAS command ?
> >>>>>>
> >>>>>> Another potential issue is that we are constantly writing Parquet
> files out to HDFS directories so the data in these files eventually appears
> as additional data in a Drill query - so how can we do this with CTAS ?
> Does CTAS append to an existing directory structure or does it insist on a
> new table name each time it is executed ?
> >>>>>>
> >>>>>> What I am getting at here is that there seem to be performance
> enhancement features available to Drill when the Parquet files are created
> using an existing file as input to a CTAS that are not possible otherwise.
> With the volumes of data we are talking about it is not really an option to
> write the files out, form them to then be read back in again for conversion
> using CTAS; which is why we write the Parquet files out directly to HDFS
> and append them to existing directories.
> >>>>>>
> >>>>>> Am I missing something obvious here - quite possibly yes ?
> >>>>>>
> >>>>>> Thanks for any help.
> >>>>>>
> >>>>>> Cheers — Chris
> >>
>

Re: Externally created Parquet files and partition pruning

Posted by Chris Mathews <ma...@uk2.net>.
Thanks guys this is very helpful. 
I now need to go away and do some more research into this. 

Cheers -- Chris
____________________
Sent from my iPhone

> On 21 Oct 2015, at 21:32, Jinfeng Ni <ji...@gmail.com> wrote:
> 
> For each column in the parquet files, Drill will check column metadata
> and see if min == max across all parquet files. If yes, that indicates
> this column has a unique value for all the files, and Drill will use
> that column as partitioning columns.
> 
> The partitioning column could be a column specified in "PARTITION BY"
> clause of CTAS; in such case, CTAS will ensure each created parquet
> file will have unique value for the column.  If the parquet files are
> not created by CTAS PARTITION BY, but some column happens to be unique
> across all the parquet files, then it will be treated as partitioning
> column as well.
> 
> You may get more detail in the code [1].
> 
> [1] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L324
> 
> 
>> On Wed, Oct 21, 2015 at 12:18 PM, Chris Mathews <ma...@uk2.net> wrote:
>> We create a JSON format schema for the Parquet file using the Avro specification and use this schema when loading data.
>> 
>> Is there anything special we have to do to flag a column as a partitioning column ?
>> Sorry I don’t understand your answer. What do you mean by ‘discover the columns with a single value’ ?
>> 
>> Cheers — Chris
>> 
>>> On 21 Oct 2015, at 20:02, Mehant Baid <ba...@gmail.com> wrote:
>>> 
>>> The information is stored in the footer of the parquet files. Drill reads the metadata information stored in the parquet footer to discover the columns with a single value and treats them as partitioning columns.
>>> 
>>> Thanks
>>> Mehant
>>> 
>>>> On 10/21/15 11:52 AM, Chris Mathews wrote:
>>>> Thank Mehant; yes we did look at doing this, but the advantages of using the new PARTITION BY feature is that the partitioned columns are automatically detected during any subsequent queries.  This is a major advantage as our customers are using the Tableau BI tool, and knowing details such as the exact partition levels and directories is not an option.
>>>> 
>>>> By the way, having created a table using PARTITION BY and CTAS ,how does a query know how to action the pruning ?  Where is this information stored for the query to access the tables/files efficiently ?
>>>> 
>>>> Cheers — Chris
>>>> 
>>>>> On 21 Oct 2015, at 19:37, Mehant Baid <ba...@gmail.com> wrote:
>>>>> 
>>>>> In addition to the auto partitioning done by CTAS, Drill also supports directory based pruning. You could load data into different(nested) directories underneath the top level table location and use the 'where' clause to get the pruning performance benefits. Following is a typical example
>>>>> 
>>>>> Table location: /home/user/table_name
>>>>> Within this you could create nested directory structure of the form
>>>>> /home/user/table_name/2010/jan
>>>>> /home/user/table_name/2010/feb
>>>>> ...
>>>>> /home/user/table_name/2010/dec
>>>>> 
>>>>> /home/user/table_name/2011/jan
>>>>> ...
>>>>> /home/user/table_name/2011/dec
>>>>> 
>>>>> Given this directory structure you could have a query that looks like
>>>>> 
>>>>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and dir1 = jan;
>>>>> 
>>>>> This would prune out scanning the parquet files under the other directories.
>>>>> 
>>>>> Thanks
>>>>> Mehant
>>>>>> On 10/21/15 11:26 AM, Chris Mathews wrote:
>>>>>> We have an existing ETL framework processing machine generated data, which we are updating to write Parquet files out directly to HDFS using AvroParquetWriter for access by Drill.
>>>>>> 
>>>>>> Some questions:
>>>>>> 
>>>>>> How do we take advantage of Drill’s partition pruning capabilities with PARTITION BY if we are not using CTAS to load the Parquet files ?
>>>>>> 
>>>>>> It seems there is no way of taking advantage of these features if the Parquet files are created externally to CTAS - am I correct ?
>>>>>> 
>>>>>> If I am, then is there any way using a Drill API of programatically loading our data into Parquet files and utilise Drill's parallelisation techniques using CTAS, or do we have to write the data out to a file and then load that file again as input to a CTAS command ?
>>>>>> 
>>>>>> Another potential issue is that we are constantly writing Parquet files out to HDFS directories so the data in these files eventually appears as additional data in a Drill query - so how can we do this with CTAS ? Does CTAS append to an existing directory structure or does it insist on a new table name each time it is executed ?
>>>>>> 
>>>>>> What I am getting at here is that there seem to be performance enhancement features available to Drill when the Parquet files are created using an existing file as input to a CTAS that are not possible otherwise.  With the volumes of data we are talking about it is not really an option to write the files out, form them to then be read back in again for conversion using CTAS; which is why we write the Parquet files out directly to HDFS and append them to existing directories.
>>>>>> 
>>>>>> Am I missing something obvious here - quite possibly yes ?
>>>>>> 
>>>>>> Thanks for any help.
>>>>>> 
>>>>>> Cheers — Chris
>> 

Re: Externally created Parquet files and partition pruning

Posted by Jinfeng Ni <ji...@gmail.com>.
For each column in the parquet files, Drill will check column metadata
and see if min == max across all parquet files. If yes, that indicates
this column has a unique value for all the files, and Drill will use
that column as partitioning columns.

The partitioning column could be a column specified in "PARTITION BY"
clause of CTAS; in such case, CTAS will ensure each created parquet
file will have unique value for the column.  If the parquet files are
not created by CTAS PARTITION BY, but some column happens to be unique
across all the parquet files, then it will be treated as partitioning
column as well.

You may get more detail in the code [1].

[1] https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L324


On Wed, Oct 21, 2015 at 12:18 PM, Chris Mathews <ma...@uk2.net> wrote:
> We create a JSON format schema for the Parquet file using the Avro specification and use this schema when loading data.
>
> Is there anything special we have to do to flag a column as a partitioning column ?
> Sorry I don’t understand your answer. What do you mean by ‘discover the columns with a single value’ ?
>
> Cheers — Chris
>
>> On 21 Oct 2015, at 20:02, Mehant Baid <ba...@gmail.com> wrote:
>>
>> The information is stored in the footer of the parquet files. Drill reads the metadata information stored in the parquet footer to discover the columns with a single value and treats them as partitioning columns.
>>
>> Thanks
>> Mehant
>>
>> On 10/21/15 11:52 AM, Chris Mathews wrote:
>>> Thank Mehant; yes we did look at doing this, but the advantages of using the new PARTITION BY feature is that the partitioned columns are automatically detected during any subsequent queries.  This is a major advantage as our customers are using the Tableau BI tool, and knowing details such as the exact partition levels and directories is not an option.
>>>
>>> By the way, having created a table using PARTITION BY and CTAS ,how does a query know how to action the pruning ?  Where is this information stored for the query to access the tables/files efficiently ?
>>>
>>> Cheers — Chris
>>>
>>>> On 21 Oct 2015, at 19:37, Mehant Baid <ba...@gmail.com> wrote:
>>>>
>>>> In addition to the auto partitioning done by CTAS, Drill also supports directory based pruning. You could load data into different(nested) directories underneath the top level table location and use the 'where' clause to get the pruning performance benefits. Following is a typical example
>>>>
>>>> Table location: /home/user/table_name
>>>> Within this you could create nested directory structure of the form
>>>> /home/user/table_name/2010/jan
>>>> /home/user/table_name/2010/feb
>>>> ...
>>>> /home/user/table_name/2010/dec
>>>>
>>>> /home/user/table_name/2011/jan
>>>> ...
>>>> /home/user/table_name/2011/dec
>>>>
>>>> Given this directory structure you could have a query that looks like
>>>>
>>>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and dir1 = jan;
>>>>
>>>> This would prune out scanning the parquet files under the other directories.
>>>>
>>>> Thanks
>>>> Mehant
>>>> On 10/21/15 11:26 AM, Chris Mathews wrote:
>>>>> We have an existing ETL framework processing machine generated data, which we are updating to write Parquet files out directly to HDFS using AvroParquetWriter for access by Drill.
>>>>>
>>>>> Some questions:
>>>>>
>>>>> How do we take advantage of Drill’s partition pruning capabilities with PARTITION BY if we are not using CTAS to load the Parquet files ?
>>>>>
>>>>> It seems there is no way of taking advantage of these features if the Parquet files are created externally to CTAS - am I correct ?
>>>>>
>>>>> If I am, then is there any way using a Drill API of programatically loading our data into Parquet files and utilise Drill's parallelisation techniques using CTAS, or do we have to write the data out to a file and then load that file again as input to a CTAS command ?
>>>>>
>>>>> Another potential issue is that we are constantly writing Parquet files out to HDFS directories so the data in these files eventually appears as additional data in a Drill query - so how can we do this with CTAS ? Does CTAS append to an existing directory structure or does it insist on a new table name each time it is executed ?
>>>>>
>>>>> What I am getting at here is that there seem to be performance enhancement features available to Drill when the Parquet files are created using an existing file as input to a CTAS that are not possible otherwise.  With the volumes of data we are talking about it is not really an option to write the files out, form them to then be read back in again for conversion using CTAS; which is why we write the Parquet files out directly to HDFS and append them to existing directories.
>>>>>
>>>>> Am I missing something obvious here - quite possibly yes ?
>>>>>
>>>>> Thanks for any help.
>>>>>
>>>>> Cheers — Chris
>>>>>
>>
>

Re: Externally created Parquet files and partition pruning

Posted by Chris Mathews <ma...@uk2.net>.
We create a JSON format schema for the Parquet file using the Avro specification and use this schema when loading data.

Is there anything special we have to do to flag a column as a partitioning column ?
Sorry I don’t understand your answer. What do you mean by ‘discover the columns with a single value’ ?

Cheers — Chris

> On 21 Oct 2015, at 20:02, Mehant Baid <ba...@gmail.com> wrote:
> 
> The information is stored in the footer of the parquet files. Drill reads the metadata information stored in the parquet footer to discover the columns with a single value and treats them as partitioning columns.
> 
> Thanks
> Mehant
> 
> On 10/21/15 11:52 AM, Chris Mathews wrote:
>> Thank Mehant; yes we did look at doing this, but the advantages of using the new PARTITION BY feature is that the partitioned columns are automatically detected during any subsequent queries.  This is a major advantage as our customers are using the Tableau BI tool, and knowing details such as the exact partition levels and directories is not an option.
>> 
>> By the way, having created a table using PARTITION BY and CTAS ,how does a query know how to action the pruning ?  Where is this information stored for the query to access the tables/files efficiently ?
>> 
>> Cheers — Chris
>>  
>>> On 21 Oct 2015, at 19:37, Mehant Baid <ba...@gmail.com> wrote:
>>> 
>>> In addition to the auto partitioning done by CTAS, Drill also supports directory based pruning. You could load data into different(nested) directories underneath the top level table location and use the 'where' clause to get the pruning performance benefits. Following is a typical example
>>> 
>>> Table location: /home/user/table_name
>>> Within this you could create nested directory structure of the form
>>> /home/user/table_name/2010/jan
>>> /home/user/table_name/2010/feb
>>> ...
>>> /home/user/table_name/2010/dec
>>> 
>>> /home/user/table_name/2011/jan
>>> ...
>>> /home/user/table_name/2011/dec
>>> 
>>> Given this directory structure you could have a query that looks like
>>> 
>>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and dir1 = jan;
>>> 
>>> This would prune out scanning the parquet files under the other directories.
>>> 
>>> Thanks
>>> Mehant
>>> On 10/21/15 11:26 AM, Chris Mathews wrote:
>>>> We have an existing ETL framework processing machine generated data, which we are updating to write Parquet files out directly to HDFS using AvroParquetWriter for access by Drill.
>>>> 
>>>> Some questions:
>>>> 
>>>> How do we take advantage of Drill’s partition pruning capabilities with PARTITION BY if we are not using CTAS to load the Parquet files ?
>>>> 
>>>> It seems there is no way of taking advantage of these features if the Parquet files are created externally to CTAS - am I correct ?
>>>> 
>>>> If I am, then is there any way using a Drill API of programatically loading our data into Parquet files and utilise Drill's parallelisation techniques using CTAS, or do we have to write the data out to a file and then load that file again as input to a CTAS command ?
>>>> 
>>>> Another potential issue is that we are constantly writing Parquet files out to HDFS directories so the data in these files eventually appears as additional data in a Drill query - so how can we do this with CTAS ? Does CTAS append to an existing directory structure or does it insist on a new table name each time it is executed ?
>>>> 
>>>> What I am getting at here is that there seem to be performance enhancement features available to Drill when the Parquet files are created using an existing file as input to a CTAS that are not possible otherwise.  With the volumes of data we are talking about it is not really an option to write the files out, form them to then be read back in again for conversion using CTAS; which is why we write the Parquet files out directly to HDFS and append them to existing directories.
>>>> 
>>>> Am I missing something obvious here - quite possibly yes ?
>>>> 
>>>> Thanks for any help.
>>>> 
>>>> Cheers — Chris
>>>> 
> 


Re: Externally created Parquet files and partition pruning

Posted by Mehant Baid <ba...@gmail.com>.
The information is stored in the footer of the parquet files. Drill 
reads the metadata information stored in the parquet footer to discover 
the columns with a single value and treats them as partitioning columns.

Thanks
Mehant

On 10/21/15 11:52 AM, Chris Mathews wrote:
> Thank Mehant; yes we did look at doing this, but the advantages of using the new PARTITION BY feature is that the partitioned columns are automatically detected during any subsequent queries.  This is a major advantage as our customers are using the Tableau BI tool, and knowing details such as the exact partition levels and directories is not an option.
>
> By the way, having created a table using PARTITION BY and CTAS ,how does a query know how to action the pruning ?  Where is this information stored for the query to access the tables/files efficiently ?
>
> Cheers — Chris
>   
>> On 21 Oct 2015, at 19:37, Mehant Baid <ba...@gmail.com> wrote:
>>
>> In addition to the auto partitioning done by CTAS, Drill also supports directory based pruning. You could load data into different(nested) directories underneath the top level table location and use the 'where' clause to get the pruning performance benefits. Following is a typical example
>>
>> Table location: /home/user/table_name
>> Within this you could create nested directory structure of the form
>> /home/user/table_name/2010/jan
>> /home/user/table_name/2010/feb
>> ...
>> /home/user/table_name/2010/dec
>>
>> /home/user/table_name/2011/jan
>> ...
>> /home/user/table_name/2011/dec
>>
>> Given this directory structure you could have a query that looks like
>>
>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and dir1 = jan;
>>
>> This would prune out scanning the parquet files under the other directories.
>>
>> Thanks
>> Mehant
>> On 10/21/15 11:26 AM, Chris Mathews wrote:
>>> We have an existing ETL framework processing machine generated data, which we are updating to write Parquet files out directly to HDFS using AvroParquetWriter for access by Drill.
>>>
>>> Some questions:
>>>
>>> How do we take advantage of Drill’s partition pruning capabilities with PARTITION BY if we are not using CTAS to load the Parquet files ?
>>>
>>> It seems there is no way of taking advantage of these features if the Parquet files are created externally to CTAS - am I correct ?
>>>
>>> If I am, then is there any way using a Drill API of programatically loading our data into Parquet files and utilise Drill's parallelisation techniques using CTAS, or do we have to write the data out to a file and then load that file again as input to a CTAS command ?
>>>
>>> Another potential issue is that we are constantly writing Parquet files out to HDFS directories so the data in these files eventually appears as additional data in a Drill query - so how can we do this with CTAS ? Does CTAS append to an existing directory structure or does it insist on a new table name each time it is executed ?
>>>
>>> What I am getting at here is that there seem to be performance enhancement features available to Drill when the Parquet files are created using an existing file as input to a CTAS that are not possible otherwise.  With the volumes of data we are talking about it is not really an option to write the files out, form them to then be read back in again for conversion using CTAS; which is why we write the Parquet files out directly to HDFS and append them to existing directories.
>>>
>>> Am I missing something obvious here - quite possibly yes ?
>>>
>>> Thanks for any help.
>>>
>>> Cheers — Chris
>>>


Re: Externally created Parquet files and partition pruning

Posted by Chris Mathews <ma...@uk2.net>.
Thank Mehant; yes we did look at doing this, but the advantages of using the new PARTITION BY feature is that the partitioned columns are automatically detected during any subsequent queries.  This is a major advantage as our customers are using the Tableau BI tool, and knowing details such as the exact partition levels and directories is not an option.

By the way, having created a table using PARTITION BY and CTAS ,how does a query know how to action the pruning ?  Where is this information stored for the query to access the tables/files efficiently ?

Cheers — Chris
 
> On 21 Oct 2015, at 19:37, Mehant Baid <ba...@gmail.com> wrote:
> 
> In addition to the auto partitioning done by CTAS, Drill also supports directory based pruning. You could load data into different(nested) directories underneath the top level table location and use the 'where' clause to get the pruning performance benefits. Following is a typical example
> 
> Table location: /home/user/table_name
> Within this you could create nested directory structure of the form
> /home/user/table_name/2010/jan
> /home/user/table_name/2010/feb
> ...
> /home/user/table_name/2010/dec
> 
> /home/user/table_name/2011/jan
> ...
> /home/user/table_name/2011/dec
> 
> Given this directory structure you could have a query that looks like
> 
> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and dir1 = jan;
> 
> This would prune out scanning the parquet files under the other directories.
> 
> Thanks
> Mehant
> On 10/21/15 11:26 AM, Chris Mathews wrote:
>> We have an existing ETL framework processing machine generated data, which we are updating to write Parquet files out directly to HDFS using AvroParquetWriter for access by Drill.
>> 
>> Some questions:
>> 
>> How do we take advantage of Drill’s partition pruning capabilities with PARTITION BY if we are not using CTAS to load the Parquet files ?
>> 
>> It seems there is no way of taking advantage of these features if the Parquet files are created externally to CTAS - am I correct ?
>> 
>> If I am, then is there any way using a Drill API of programatically loading our data into Parquet files and utilise Drill's parallelisation techniques using CTAS, or do we have to write the data out to a file and then load that file again as input to a CTAS command ?
>> 
>> Another potential issue is that we are constantly writing Parquet files out to HDFS directories so the data in these files eventually appears as additional data in a Drill query - so how can we do this with CTAS ? Does CTAS append to an existing directory structure or does it insist on a new table name each time it is executed ?
>> 
>> What I am getting at here is that there seem to be performance enhancement features available to Drill when the Parquet files are created using an existing file as input to a CTAS that are not possible otherwise.  With the volumes of data we are talking about it is not really an option to write the files out, form them to then be read back in again for conversion using CTAS; which is why we write the Parquet files out directly to HDFS and append them to existing directories.
>> 
>> Am I missing something obvious here - quite possibly yes ?
>> 
>> Thanks for any help.
>> 
>> Cheers — Chris
>> 
> 


Re: Externally created Parquet files and partition pruning

Posted by Mehant Baid <ba...@gmail.com>.
In addition to the auto partitioning done by CTAS, Drill also supports 
directory based pruning. You could load data into different(nested) 
directories underneath the top level table location and use the 'where' 
clause to get the pruning performance benefits. Following is a typical 
example

Table location: /home/user/table_name
Within this you could create nested directory structure of the form
/home/user/table_name/2010/jan
/home/user/table_name/2010/feb
...
/home/user/table_name/2010/dec

/home/user/table_name/2011/jan
...
/home/user/table_name/2011/dec

Given this directory structure you could have a query that looks like

select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and dir1 
= jan;

This would prune out scanning the parquet files under the other 
directories.

Thanks
Mehant
On 10/21/15 11:26 AM, Chris Mathews wrote:
> We have an existing ETL framework processing machine generated data, which we are updating to write Parquet files out directly to HDFS using AvroParquetWriter for access by Drill.
>
> Some questions:
>
> How do we take advantage of Drill’s partition pruning capabilities with PARTITION BY if we are not using CTAS to load the Parquet files ?
>
> It seems there is no way of taking advantage of these features if the Parquet files are created externally to CTAS - am I correct ?
>
> If I am, then is there any way using a Drill API of programatically loading our data into Parquet files and utilise Drill's parallelisation techniques using CTAS, or do we have to write the data out to a file and then load that file again as input to a CTAS command ?
>
> Another potential issue is that we are constantly writing Parquet files out to HDFS directories so the data in these files eventually appears as additional data in a Drill query - so how can we do this with CTAS ? Does CTAS append to an existing directory structure or does it insist on a new table name each time it is executed ?
>
> What I am getting at here is that there seem to be performance enhancement features available to Drill when the Parquet files are created using an existing file as input to a CTAS that are not possible otherwise.  With the volumes of data we are talking about it is not really an option to write the files out, form them to then be read back in again for conversion using CTAS; which is why we write the Parquet files out directly to HDFS and append them to existing directories.
>
> Am I missing something obvious here - quite possibly yes ?
>
> Thanks for any help.
>
> Cheers — Chris
>