You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Dong Chen (JIRA)" <ji...@apache.org> on 2015/06/05 10:37:00 UTC

[jira] [Commented] (PARQUET-281) Statistic and Filter need a mechanism to get customized comparator from high layer user

    [ https://issues.apache.org/jira/browse/PARQUET-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574129#comment-14574129 ] 

Dong Chen commented on PARQUET-281:
-----------------------------------

Hi [~rdblue], as we discussed in HIVE-10254, here is some thoughts about adding a comparator at column level rather than Binary class. Could you take a look if time is available? Thanks.

The customized comparator will be injected and used in 3 parts: 
* generating blocks statistics when writing
* filter blocks with predicate when reading
* filter records with predicate when reading

1. Writing
{{Statistics}} instance hold the data and is compared & updated when writing a record. It is initialized in {{ColumnWriter}} inside Parquet and not exposed for Hive.

In order to transit the comparator from Hive to Parquet, how about we adding params (like {{parquet.customized.comparator.type}} and {{p.c.c.class}}) in conf or WriteContext.extraMetaData? Then add a delegated comparator in {{Statistic}}. {{Statistics}} could extract the param and instantiate the comparator based on data type.

2. Reading
Methods like {{FilterApi.binaryColumn}} is exposed so that we could pass the comparator from Hive. Then {{Operators.Column}} class should have an attribute to store the comparator.

For filtering blocks, modify the {{visit}} methods in {{StatisticsFilter}} to get the comparator through {{Column}} and use it if existed.

For fitlering records, modify the {{update}} methods in {{IncrementallyUpdatedFilterPredicate.ValueInspector}} (the impl is actually in {{IncrementallyUpdatedFilterPredicateGenerator}}) to get the comparator through {{Column}} and use it if existed.

How does this sound?

> Statistic and Filter need a mechanism to get customized comparator from high layer user
> ---------------------------------------------------------------------------------------
>
>                 Key: PARQUET-281
>                 URL: https://issues.apache.org/jira/browse/PARQUET-281
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Dong Chen
>            Assignee: Dong Chen
>
> As discussed in HIVE-10254, we might need a customized comparator from high layer user for generating statistic when writing and applying filter when reading. 
> The problem is that (use Decimal type in Hive as an example):
> Decimal in Hive is mapped to Binary in Parquet. When using predicate and statistic to filter values, comparing Binary values in Parquet cannot reflect the correct relationship of Decimal values in Hive. This type mapping causes 2 problems:
> 1. When writing Decimal column, Binary.compareTo() is used to judge and set the column statistic (min, max). The generated statistic value is not correct from a Decimal perspective.
> 2. When reading with Predicate (also Filter), in which the expected Decimal value is converted to Binary type, Binary.compareTo() is used to compare the expected value and column statistic value. They are Binary perspective, and also the result is not right.
> We could add an interface for customized comparator, and high level user like Hive provides the comparator to Parquet, since Hive knows how to decode the binary to Decimal and compare. Then Parquet could switch between customized and original comparison method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)