You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/04/08 20:48:00 UTC

[jira] [Commented] (IMPALA-10629) bin/load-data.py does not respect compression codec for parquet

    [ https://issues.apache.org/jira/browse/IMPALA-10629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17317480#comment-17317480 ] 

ASF subversion and git services commented on IMPALA-10629:
----------------------------------------------------------

Commit d29fab1ad9a32c0200b71506c3b31f1ac8838e63 in impala's branch refs/heads/master from Joe McDonnell
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=d29fab1 ]

IMPALA-10629: Fix parquet compression codecs for data load scripts

Currently, the dataload scripts don't respect non-standard
compression codecs when loading Parquet data. It always
loads snappy, even when specifying something else like
--table_format=parquet/zstd.

This fixes the dataload scripts so that they specify the
compression_codec query option correctly and thus use the
right codec when loading Parquet.

For backwards compatibility, this preserves the behavior
that parquet/none corresponds to the default compression
codec (which is Snappy).

This should make it easier to do performance testing on
various Parquet codecs (like ZSTD).

Testing:
 - Ran bin/load-data.py -w tpch --table_format=parquet/zstd
   and checked the codec in the file with the parquet-reader
   utility

Change-Id: I1a346de3e5c4e38328e5a8ce8162697b7dd6553a
Reviewed-on: http://gerrit.cloudera.org:8080/17259
Reviewed-by: Joe McDonnell <jo...@cloudera.com>
Tested-by: Joe McDonnell <jo...@cloudera.com>


> bin/load-data.py does not respect compression codec for parquet
> ---------------------------------------------------------------
>
>                 Key: IMPALA-10629
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10629
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>    Affects Versions: Impala 4.0
>            Reporter: Joe McDonnell
>            Priority: Major
>
> If I try to use bin/load-data.py to load TPC-H as ZSTD compressed Parquet, it silently ignores the codec and uses Snappy under the covers:
> {noformat}
> $ bin/load-data.py -w tpch --table_formats=parquet/zstd
> $ hdfs dfs -ls /test-warehouse/tpch.lineitem_parquet_zstd/
> Found 4 items
> -rw-r--r--   3 joe supergroup   72305126 2021-03-31 17:01 /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000000_1779607968_data.0.parq
> -rw-r--r--   3 joe supergroup   58526717 2021-03-31 17:01 /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000001_53336944_data.0.parq
> -rw-r--r--   3 joe supergroup   72584796 2021-03-31 17:01 /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000002_53336944_data.0.parq
> drwxr-xr-x   - joe supergroup          0 2021-03-31 17:01 /test-warehouse/tpch.lineitem_parquet_zstd/_impala_insert_staging
> $ hdfs dfs -copyToLocal /test-warehouse/tpch.lineitem_parquet_zstd/02444051906c734d-3b49d6c900000002_53336944_data.0.parq
> $ parquet-reader 02444051906c734d-3b49d6c900000002_53336944_data.0.parq
> ...
>         [10] = ColumnChunk {
>           02: file_offset (i64) = 37053592,
>           03: meta_data (struct) = ColumnMetaData {
>             01: type (i32) = 6,
>             02: encodings (list) = list<i32>[2] {
>               [0] = 2,
>               [1] = 3,
>             },
>             03: path_in_schema (list) = list<string>[1] {
>               [0] = "l_shipdate",
>             },
>             04: codec (i32) = 1, <------ SNAPPY!!!!
> ...{noformat}
> Based on what I'm seeing, bin/load-data.py doesn't set the compression_codec query option when loading parquet. It is a bug that this silently does the wrong thing, but the actual support is more of a feature request.
> Being able to load ZSTD (or other compression) parquet makes it easier to do performance comparisons for those compression codecs on the perf-AB-test upstream job ([https://jenkins.impala.io/job/perf-AB-test/]).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org