You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Chris Mathews <ma...@uk2.net> on 2015/11/02 18:48:28 UTC

Re: Externally created Parquet files and partition pruning

@Jacques:

>>> Jinfeng hit the nail on the head. If you have Parquet files with single
>>> value columns (and have Parquet footer metadata stats), Drill will
>>> automatically leverage the partitioning with zero additional setup required.

So as it stands at the moment, does this mean Drill has to read the footer of all Parquet files in a directory to detect if a column has a single value across all files in that directory ?
Or is this where the metadata comes into play ?

— Chris


> On 28 Oct 2015, at 00:30, Jacques Nadeau <ja...@dremio.com> wrote:
> 
> Sounds like a future optimization opportunity once someone has the hybrid
> issue and need.
> 
> --
> Jacques Nadeau
> CTO and Co-Founder, Dremio
> 
> On Sun, Oct 25, 2015 at 6:00 PM, Jinfeng Ni <ji...@gmail.com> wrote:
> 
>> @Jacques,
>> 
>> Steven probably could confirm whether my understanding of the code is
>> correct or not. From the code, it seems we enforce the checking that
>> only a column with unique value across all the files would be
>> considered for pruning.
>> 
>> I just tried two simple cases with TPC-H sample data. It seems to be
>> the case; the table that meets this unique value requirement shows
>> partition pruning works, while the table that has one file that
>> violates this requirement does not show so.
>> 
>> I think the code imposes such requirement due to the current way how
>> partition pruning works : we build a set of value vectors for each
>> input files before partition pruning does the interpreter evaluation.
>> Those value vectors assume that each input file should have single
>> value; otherwise it would not fit into this model.
>> 
>> I agree with you that we actually should do the partitioning if any of
>> the files are excludable. That sounds like we should change the
>> partition pruning rule such that the filter evaluation is moved into
>> ParquetGroupScan itself.
>> 
>> 
>> 
>> 
>> On Sun, Oct 25, 2015 at 4:32 PM, Jacques Nadeau <ja...@dremio.com>
>> wrote:
>>> Jinfeng hit the nail on the head. If you have Parquet files with single
>>> value columns (and have Parquet footer metadata stats), Drill will
>>> automatically leverage the partitioning with zero additional setup
>>> required.
>>> 
>>> Jinfeng, based on what you said, it sounds as if we don't apply
>>> partitioning unless all the files read are partitioned (single value
>>> columns). We should actually do the partitioning if any of the files are
>>> excludable, right?
>>> 
>>> --
>>> Jacques Nadeau
>>> CTO and Co-Founder, Dremio
>>> 
>>> On Wed, Oct 21, 2015 at 1:45 PM, Chris Mathews <ma...@uk2.net> wrote:
>>> 
>>>> Thanks guys this is very helpful.
>>>> I now need to go away and do some more research into this.
>>>> 
>>>> Cheers -- Chris
>>>> ____________________
>>>> Sent from my iPhone
>>>> 
>>>>> On 21 Oct 2015, at 21:32, Jinfeng Ni <ji...@gmail.com> wrote:
>>>>> 
>>>>> For each column in the parquet files, Drill will check column metadata
>>>>> and see if min == max across all parquet files. If yes, that indicates
>>>>> this column has a unique value for all the files, and Drill will use
>>>>> that column as partitioning columns.
>>>>> 
>>>>> The partitioning column could be a column specified in "PARTITION BY"
>>>>> clause of CTAS; in such case, CTAS will ensure each created parquet
>>>>> file will have unique value for the column.  If the parquet files are
>>>>> not created by CTAS PARTITION BY, but some column happens to be unique
>>>>> across all the parquet files, then it will be treated as partitioning
>>>>> column as well.
>>>>> 
>>>>> You may get more detail in the code [1].
>>>>> 
>>>>> [1]
>>>> 
>> https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L324
>>>>> 
>>>>> 
>>>>>> On Wed, Oct 21, 2015 at 12:18 PM, Chris Mathews <ma...@uk2.net>
>>>> wrote:
>>>>>> We create a JSON format schema for the Parquet file using the Avro
>>>> specification and use this schema when loading data.
>>>>>> 
>>>>>> Is there anything special we have to do to flag a column as a
>>>> partitioning column ?
>>>>>> Sorry I don’t understand your answer. What do you mean by ‘discover
>> the
>>>> columns with a single value’ ?
>>>>>> 
>>>>>> Cheers — Chris
>>>>>> 
>>>>>>> On 21 Oct 2015, at 20:02, Mehant Baid <ba...@gmail.com>
>> wrote:
>>>>>>> 
>>>>>>> The information is stored in the footer of the parquet files. Drill
>>>> reads the metadata information stored in the parquet footer to discover
>> the
>>>> columns with a single value and treats them as partitioning columns.
>>>>>>> 
>>>>>>> Thanks
>>>>>>> Mehant
>>>>>>> 
>>>>>>>> On 10/21/15 11:52 AM, Chris Mathews wrote:
>>>>>>>> Thank Mehant; yes we did look at doing this, but the advantages of
>>>> using the new PARTITION BY feature is that the partitioned columns are
>>>> automatically detected during any subsequent queries.  This is a major
>>>> advantage as our customers are using the Tableau BI tool, and knowing
>>>> details such as the exact partition levels and directories is not an
>> option.
>>>>>>>> 
>>>>>>>> By the way, having created a table using PARTITION BY and CTAS ,how
>>>> does a query know how to action the pruning ?  Where is this information
>>>> stored for the query to access the tables/files efficiently ?
>>>>>>>> 
>>>>>>>> Cheers — Chris
>>>>>>>> 
>>>>>>>>> On 21 Oct 2015, at 19:37, Mehant Baid <ba...@gmail.com>
>> wrote:
>>>>>>>>> 
>>>>>>>>> In addition to the auto partitioning done by CTAS, Drill also
>>>> supports directory based pruning. You could load data into
>>>> different(nested) directories underneath the top level table location
>> and
>>>> use the 'where' clause to get the pruning performance benefits.
>> Following
>>>> is a typical example
>>>>>>>>> 
>>>>>>>>> Table location: /home/user/table_name
>>>>>>>>> Within this you could create nested directory structure of the
>> form
>>>>>>>>> /home/user/table_name/2010/jan
>>>>>>>>> /home/user/table_name/2010/feb
>>>>>>>>> ...
>>>>>>>>> /home/user/table_name/2010/dec
>>>>>>>>> 
>>>>>>>>> /home/user/table_name/2011/jan
>>>>>>>>> ...
>>>>>>>>> /home/user/table_name/2011/dec
>>>>>>>>> 
>>>>>>>>> Given this directory structure you could have a query that looks
>> like
>>>>>>>>> 
>>>>>>>>> select col1 from dfs.`/home/user/table_name` where dir0 = 2011 and
>>>> dir1 = jan;
>>>>>>>>> 
>>>>>>>>> This would prune out scanning the parquet files under the other
>>>> directories.
>>>>>>>>> 
>>>>>>>>> Thanks
>>>>>>>>> Mehant
>>>>>>>>>> On 10/21/15 11:26 AM, Chris Mathews wrote:
>>>>>>>>>> We have an existing ETL framework processing machine generated
>>>> data, which we are updating to write Parquet files out directly to HDFS
>>>> using AvroParquetWriter for access by Drill.
>>>>>>>>>> 
>>>>>>>>>> Some questions:
>>>>>>>>>> 
>>>>>>>>>> How do we take advantage of Drill’s partition pruning
>> capabilities
>>>> with PARTITION BY if we are not using CTAS to load the Parquet files ?
>>>>>>>>>> 
>>>>>>>>>> It seems there is no way of taking advantage of these features if
>>>> the Parquet files are created externally to CTAS - am I correct ?
>>>>>>>>>> 
>>>>>>>>>> If I am, then is there any way using a Drill API of
>> programatically
>>>> loading our data into Parquet files and utilise Drill's parallelisation
>>>> techniques using CTAS, or do we have to write the data out to a file and
>>>> then load that file again as input to a CTAS command ?
>>>>>>>>>> 
>>>>>>>>>> Another potential issue is that we are constantly writing Parquet
>>>> files out to HDFS directories so the data in these files eventually
>> appears
>>>> as additional data in a Drill query - so how can we do this with CTAS ?
>>>> Does CTAS append to an existing directory structure or does it insist
>> on a
>>>> new table name each time it is executed ?
>>>>>>>>>> 
>>>>>>>>>> What I am getting at here is that there seem to be performance
>>>> enhancement features available to Drill when the Parquet files are
>> created
>>>> using an existing file as input to a CTAS that are not possible
>> otherwise.
>>>> With the volumes of data we are talking about it is not really an
>> option to
>>>> write the files out, form them to then be read back in again for
>> conversion
>>>> using CTAS; which is why we write the Parquet files out directly to HDFS
>>>> and append them to existing directories.
>>>>>>>>>> 
>>>>>>>>>> Am I missing something obvious here - quite possibly yes ?
>>>>>>>>>> 
>>>>>>>>>> Thanks for any help.
>>>>>>>>>> 
>>>>>>>>>> Cheers — Chris
>>>>>> 
>>>> 
>>