You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2019/06/05 16:07:01 UTC
[jira] [Commented] (IMPALA-8450) Add support for zstd in parquet

    [ https://issues.apache.org/jira/browse/IMPALA-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856814#comment-16856814 ] 

ASF subversion and git services commented on IMPALA-8450:
---------------------------------------------------------

Commit 51e8175c622014064e5a6853317de13b6987c629 in impala's branch refs/heads/master from Abhishek
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=51e8175 ]

IMPALA-8450: Add support for zstd in parquet

Makefile was updated to include zstd in the ${IMPALA_HOME}/toolchain
directory. Other changes were made to make zstd headers and libs
accessible.

Class ZstandardCompressor/ZstandardDecompressor was added to provide
interfaces for calling ZSTD_compress/ZSTD_decompress functions. Zstd
supports different compression levels (clevel) from 1 to
ZSTD_maxCLevel(). Zstd also supports -ive clevels, but since the -ive
values represents uncompressed data they won't be supported. The default
clevel is ZSTD_CLEVEL_DEFAULT.

HdfsParquetTableWriter was updated to support ZSTD codec. The
new codecs can be set using existing query option as follows:
  set COMPRESSION_CODEC=ZSTD:<clevel>;
  set COMPRESSION_CODEC=ZSTD; // uses ZSTD_CLEVEL_DEFAULT

Testing:
  - Added unit test in DecompressorTest class with ZSTD_CLEVEL_DEFAULT
    clevel and a random clevel. The test unit decompresses an input
    compressed data and validates the result. It also tests for
    expected behavior when passing an over/under sized buffer for
    decompressing.
  - Added unit tests for valid/invalid values for COMPRESSION_CODEC.
  - Added e2e test in test_insert_parquet.py which tests writing/read-
    ing (null/non-null) data into/from a table (w different data type
    columns) using multiple codecs. Other existing e2e tests were
    updated to also use parquet/zstd table format.
  - Manual interoperability tests were run between Impala and Hive.

Change-Id: Id2c0e26e6f7fb2dc4024309d733983ba5197beb7
Reviewed-on: http://gerrit.cloudera.org:8080/13507
Reviewed-by: Tim Armstrong <ta...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Add support for zstd in parquet
> -------------------------------
>
>                 Key: IMPALA-8450
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8450
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Tim Armstrong
>            Assignee: Abhishek Rawat
>            Priority: Major
>              Labels: parquet
>
> PARQUET-970 added these codecs to the format. We have LZ4 in the toolchain already and I just added zstd: [https://gerrit.cloudera.org/#/c/13079/]
> These codec probably offer a better trade-off of density and speed than snappy or gzip.
> [https://github.com/apache/arrow/pull/807/files] might be a useful crib sheet for how to add a compressor.
> LZ4 support will be added using: https://issues.apache.org/jira/browse/IMPALA-8617



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org