You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Volodymyr Vysotskyi (JIRA)" <ji...@apache.org> on 2019/03/22 20:47:00 UTC

[jira] [Commented] (DRILL-7132) Metadata cache does not have correct min/max values for varchar and interval data types

    [ https://issues.apache.org/jira/browse/DRILL-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16799339#comment-16799339 ] 

Volodymyr Vysotskyi commented on DRILL-7132:
--------------------------------------------

[~rhou], parquet metadata cache contains min/max values for varchar, decimal, interval, and some other types encoded using base64, so they differ from the values displayed by parquet tools.

There is no need to store values in the same format/encoding, etc. The main requirement is Drill should be able to handle these values from parquet metadata cache files correctly, and it does.

As a side note, in DRILL-4139 was made a change to use base64 encoding in parquet metadata cache to be able to handle correctly statistics for decimal and interval types.

> Metadata cache does not have correct min/max values for varchar and interval data types
> ---------------------------------------------------------------------------------------
>
>                 Key: DRILL-7132
>                 URL: https://issues.apache.org/jira/browse/DRILL-7132
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>    Affects Versions: 1.14.0
>            Reporter: Robert Hou
>            Priority: Major
>             Fix For: 1.17.0
>
>         Attachments: 0_0_10.parquet
>
>
> The parquet metadata cache does not have correct min/max values for varchar and interval data types.
> I have attached a parquet file.  Here is what parquet tools shows for varchar:
> [varchar_col] BINARY 14.6% of all space [PLAIN, BIT_PACKED] min: 67 max: 67 average: 67 total: 67 (raw data: 65 saving -3%)
>   values: min: 1 max: 1 average: 1 total: 1
>   uncompressed: min: 65 max: 65 average: 65 total: 65
>   column values statistics: min: ioegjNJKvnkd, max: ioegjNJKvnkd, num_nulls: 0
> Here is what the metadata cache file shows:
>         "name" : [ "varchar_col" ],
>         "minValue" : "aW9lZ2pOSkt2bmtk",
>         "maxValue" : "aW9lZ2pOSkt2bmtk",
>         "nulls" : 0
> Here is what parquet tools shows for interval:
> [interval_col] BINARY 11.3% of all space [PLAIN, BIT_PACKED] min: 52 max: 52 average: 52 total: 52 (raw data: 50 saving -4%)
>   values: min: 1 max: 1 average: 1 total: 1
>   uncompressed: min: 50 max: 50 average: 50 total: 50
>   column values statistics: min: P18582D, max: P18582D, num_nulls: 0
> Here is what the metadata cache file shows:
>         "name" : [ "interval_col" ],
>         "minValue" : "UDE4NTgyRA==",
>         "maxValue" : "UDE4NTgyRA==",
>         "nulls" : 0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)