You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Hyukjin Kwon <gu...@gmail.com> on 2015/10/02 11:51:07 UTC
Re: Parquet, usage of statistics in DataPageHeader

Does Parquet support page-level filtering as well as rowgroup-level
filtering?

2015-09-18 15:06 GMT+09:00 Hyukjin Kwon <gu...@gmail.com>:

> Just in cast, what I meant skipping pages (or row groups) with statistics
> is, filtering them by the comparison with the given value for filter2 and
> the statistics such as min, max and etc  for DataPageHeader and
> ColumnMetadata.
>
> Thanks!
>
> 2015-09-18 14:58 GMT+09:00 Hyukjin Kwon <gu...@gmail.com>:
>
>> I see.
>>
>> However, does filtering at RowMaterializer (with
>> IncrementallyUpdatedFilterPredicate as filter2) actually happen after
>> reading the values for a row of the pages (in the columns of the row)?
>>
>> I just wonder if some pages can be skipped by the statistics in
>> DataPageHeader before actually reading the data part of the pages in
>> order to reduce the cost of io, decompression and decode,
>>
>> just like skipping row groups by the statistics in ColumnMetaData (in a
>> split) before actually starting to read a Parquet file.
>>
>>
>> Although I know I am pretty wrong, for example, I could find
>> ColumnChunkPageReadStore.ColumnChunkPageReader.readPage() function to
>> read actual page data.
>>
>>
>> public DataPage visit(DataPageV2 dataPageV2) {
>>
>>   if (!dataPageV2.isCompressed()) {
>>     return dataPageV2;
>>   }
>>   try {
>>     int uncompressedSize = Ints.checkedCast(
>>         dataPageV2.getUncompressedSize()
>>         - dataPageV2.getDefinitionLevels().size()
>>         - dataPageV2.getRepetitionLevels().size());
>>     return DataPageV2.uncompressed(
>>         dataPageV2.getRowCount(),
>>         dataPageV2.getNullCount(),
>>         dataPageV2.getValueCount(),
>>         dataPageV2.getRepetitionLevels(),
>>         dataPageV2.getDefinitionLevels(),
>>         dataPageV2.getDataEncoding(),
>>         *decompressor.decompress(dataPageV2.getData(), uncompressedSize),*
>>         dataPageV2.getStatistics()
>>         );
>>   } catch (IOException e) {
>>     throw new ParquetDecodingException("could not decompress page", e);
>>   }
>> }
>>
>>
>> I think we can skip the page here actually without decompress & decode
>> filtering by given filter value and statistics in DataPageHeader.
>> 
>> Are there some logics for this skipping function?
>>
>>
>> Thanks!
>>
>>
>>
>> 2015-09-18 2:31 GMT+09:00 Ryan Blue <bl...@cloudera.com>:
>>
>>> Hi Hyukjin,
>>>
>>> I think the code you're looking for is created by parquet-generator so
>>> we have one specific to each primitive type:
>>>
>>>
>>>
>>> https://github.com/apache/parquet-mr/blob/master/parquet-generator/src/main/java/org/apache/parquet/filter2/IncrementallyUpdatedFilterPredicateGenerator.java
>>>
>>> rb
>>>
>>>
>>> On 09/16/2015 06:57 PM, Hyukjin Kwon wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am pretty new to Parquet and trying to learn Parquet structure.
>>>>
>>>> I assume that min, max and etc information has been stored for both
>>>> ColumnMetaData and also DataPageHeader since 1.6.0 (
>>>> https://github.com/Parquet/parquet-mr/pull/338)
>>>>
>>>> I see the statistics in ColumnMetaData is used to filter blocks (or row
>>>> groups) as filter2 at RowGroupFilter by calling canDrop().
>>>>
>>>> I though the statistics in DataPageHeader is used to not to read a page
>>>> by
>>>> reading the statistics.
>>>> However, my question is, I could not find where to use statistics in
>>>> DataPageHeader for filter1 and also filter2.
>>>> 
>>>>
>>>> Could you give me some comments on this please?
>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Cloudera, Inc.
>>>
>>
>>
>