You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Csaba Ringhofer (Jira)" <ji...@apache.org> on 2022/08/19 15:35:00 UTC

[jira] [Resolved] (IMPALA-9578) Read/write support for BINARY in Parquet

     [ https://issues.apache.org/jira/browse/IMPALA-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Csaba Ringhofer resolved IMPALA-9578.
-------------------------------------
    Resolution: Fixed

> Read/write support for BINARY in Parquet
> ----------------------------------------
>
>                 Key: IMPALA-9578
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9578
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Priority: Major
>              Labels: parquet
>
> In Parquet both STRING and BINARY are stored using the same physical type, BYTE_ARRAY.
> There is a  String annotation among logical types (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#string), which means UTF-8 encoding (and is of course ignored by Impala).
> Both reading and writing should occur the same way as with STRING.
> There is one potential difference to consider during writing: in ORC BinaryStatistics has no min/max stats (StringStatistics has them). My guess for the reason is that binary values are often very large and "random", so it is likely for the stats to need a lot of space while never being used successfully for filtering. Note that Parquet is a bit different with its per-page statistics and can be potentially need even more space for stats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)