You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Gabor Szadovszky (JIRA)" <ji...@apache.org> on 2017/11/20 13:14:00 UTC

[jira] [Commented] (PARQUET-1025) Support new min-max statistics in parquet-mr

    [ https://issues.apache.org/jira/browse/PARQUET-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259224#comment-16259224 ] 

Gabor Szadovszky commented on PARQUET-1025:
-------------------------------------------

To implement the new statistics we have to support the different comparison logics required by the specs. Currently, all the primitives are comparable. There are two possible options to extend the API:
# Implement separate comparators which are to be used by parquet-mr internally as well as by the API users.
#* pros:
#** Backwards compatible (however, the statistics and filtering would work based on different ordering than before)
#** Keeps the primitive and logical types more loosely coupled in parquet-mr
#* cons:
#** It can be quite confusing to the API user that the compareTo methods of the provided primitive types shall not be used but the provided comparators
#** More parts of the API have to be modified to ensure that the proper comparators are used and these comparators are accessible
#** The client developers shall also modify their code parts where primitive comparison is used
# Extend the actual primitive implementations for the logical types so the comparable objects would do the proper comparison by default.
#* pros:
#** Backwards compatible (however, the different Binary implementations would have different orderings than before, but the current one is incorrect anyway)
#** The API is more clean as the API users can rely on the comparable primitive types
#** The other parts of the API (e.g. filtering, statistics) can be kept unmodified, we modify only the parts where the primitives are created
#** The client code can be remain unmodified as it can still rely on the comparable primitive types
#* cons:
#** Proper comparison logic for UINT types will not be implemented (we cannot override the natural ordering of the primitive java types int and long)
#** The primitive and logical types would get more tightly coupled in parquet-mr

Which one shall we prefer? I’m also curious about the ideas/comments of the API users (e.g. Hive, Sparks etc.).
The first option is more or less implemented. Check the linked PR for details. I’m happy to implement the second option if it is more supported.

> Support new min-max statistics in parquet-mr
> --------------------------------------------
>
>                 Key: PARQUET-1025
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1025
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Zoltan Ivanfi
>            Assignee: Gabor Szadovszky
>
> Impala started using new min-max statistics that got specified as part of PARQUET-686. Support for these should be added to parquet-mr as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)