You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2022/02/14 02:28:00 UTC

[jira] [Created] (IMPALA-11120) load-data.py does not load ORC files with specified codec

Quanlong Huang created IMPALA-11120:
---------------------------------------

             Summary: load-data.py does not load ORC files with specified codec
                 Key: IMPALA-11120
                 URL: https://issues.apache.org/jira/browse/IMPALA-11120
             Project: IMPALA
          Issue Type: Bug
          Components: Infrastructure
            Reporter: Quanlong Huang
            Assignee: Quanlong Huang


I ran the following command to generate TPC-H tables in ORC format using SNAPPY compression:
{code:java}
bin/load-data.py -w tpch -e core --table_formats=orc/snap/block
{code}
After it succeeded, I realized the compression is still ZLIB:
{code:java}
$ hive --service orcfiledump hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0
Processing data file hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0 [length: 149783256]
Structure for hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0
File Version: 0.12 with ORC_135
Rows: 6001215
Compression: ZLIB         <-------- not SNAPPY
Compression size: 262144
Calendar: Julian/Gregorian
{code}
The Hive statements we use to generate data are
{code:sql}
SET hive.exec.compress.output=true;
SET mapred.output.compression.type=BLOCK;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true;
SET hive.exec.max.dynamic.partitions=10000;
SET hive.exec.max.dynamic.partitions.pernode=10000;
set hive.auto.convert.join=true;
SET mapred.max.split.size=256000000;
SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
INSERT OVERWRITE TABLE tpch_orc_snap.lineitem SELECT * FROM tpch.lineitem;
{code}
Setting mapred.output.compression.codec does not work in ORC format. Instead, we need to set tblproperty "orc.compress" to "SNAPPY".

ref: [https://orc.apache.org/docs/hive-config.html]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)