You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/03/02 02:02:00 UTC
[jira] [Commented] (IMPALA-11120) load-data.py does not load ORC files with specified codec

    [ https://issues.apache.org/jira/browse/IMPALA-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17499841#comment-17499841 ] 

ASF subversion and git services commented on IMPALA-11120:
----------------------------------------------------------

Commit b2e4b29f06141ad34eef2cbadfda259124792ac2 in impala's branch refs/heads/master from stiga-huang
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=b2e4b29 ]

IMPALA-11120: Fix codec not set in generating ORC tables

We use 'mapred.output.compression.codec' to set the compression codec in
generating test files by Hive. However, it doesn't affect ORC files.
Instead, we need to set 'orc.compress' in tblproperties for each ORC
tables. The default value of 'orc.compress' is ZLIB which corresponds to
our 'def' codec. We only need to set it for non-def codecs.

This patch also fixes a bug in build_compression_codec_statement() that
would raise KeyError when loading lz4 non-avro tables.

Tests
 - Loaded tpch data in orc/none/none, orc/def/block, orc/snap/block,
   orc/lz4/block and verified there compression codecs.

Change-Id: I02bd5d9400864145133ff019a3d076a6cab36fcc
Reviewed-on: http://gerrit.cloudera.org:8080/18228
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> load-data.py does not load ORC files with specified codec
> ---------------------------------------------------------
>
>                 Key: IMPALA-11120
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11120
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Major
>
> I ran the following command to generate TPC-H tables in ORC format using SNAPPY compression:
> {code:java}
> bin/load-data.py -w tpch -e core --table_formats=orc/snap/block
> {code}
> After it succeeded, I realized the compression is still ZLIB:
> {code:java}
> $ hive --service orcfiledump hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0
> Processing data file hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0 [length: 149783256]
> Structure for hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0
> File Version: 0.12 with ORC_135
> Rows: 6001215
> Compression: ZLIB         <-------- not SNAPPY
> Compression size: 262144
> Calendar: Julian/Gregorian
> {code}
> The Hive statements we use to generate data are
> {code:sql}
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.type=BLOCK;
> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
> SET hive.exec.dynamic.partition.mode=nonstrict;
> SET hive.exec.dynamic.partition=true;
> SET hive.exec.max.dynamic.partitions=10000;
> SET hive.exec.max.dynamic.partitions.pernode=10000;
> set hive.auto.convert.join=true;
> SET mapred.max.split.size=256000000;
> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
> INSERT OVERWRITE TABLE tpch_orc_snap.lineitem SELECT * FROM tpch.lineitem;
> {code}
> Setting mapred.output.compression.codec does not work in ORC format. Instead, we need to set tblproperty "orc.compress" to "SNAPPY".
> ref: [https://orc.apache.org/docs/hive-config.html]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org