You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues-all@impala.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2022/03/03 01:04:00 UTC

[jira] [Resolved] (IMPALA-11120) load-data.py does not load ORC files with specified codec

     [ https://issues.apache.org/jira/browse/IMPALA-11120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Quanlong Huang resolved IMPALA-11120.
-------------------------------------
    Fix Version/s: Impala 4.1.0
       Resolution: Fixed

> load-data.py does not load ORC files with specified codec
> ---------------------------------------------------------
>
>                 Key: IMPALA-11120
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11120
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Infrastructure
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Major
>             Fix For: Impala 4.1.0
>
>
> I ran the following command to generate TPC-H tables in ORC format using SNAPPY compression:
> {code:java}
> bin/load-data.py -w tpch -e core --table_formats=orc/snap/block
> {code}
> After it succeeded, I realized the compression is still ZLIB:
> {code:java}
> $ hive --service orcfiledump hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0
> Processing data file hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0 [length: 149783256]
> Structure for hdfs://localhost:20500/test-warehouse/tpch.lineitem_orc_snap/000000_0
> File Version: 0.12 with ORC_135
> Rows: 6001215
> Compression: ZLIB         <-------- not SNAPPY
> Compression size: 262144
> Calendar: Julian/Gregorian
> {code}
> The Hive statements we use to generate data are
> {code:sql}
> SET hive.exec.compress.output=true;
> SET mapred.output.compression.type=BLOCK;
> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
> SET hive.exec.dynamic.partition.mode=nonstrict;
> SET hive.exec.dynamic.partition=true;
> SET hive.exec.max.dynamic.partitions=10000;
> SET hive.exec.max.dynamic.partitions.pernode=10000;
> set hive.auto.convert.join=true;
> SET mapred.max.split.size=256000000;
> SET hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
> INSERT OVERWRITE TABLE tpch_orc_snap.lineitem SELECT * FROM tpch.lineitem;
> {code}
> Setting mapred.output.compression.codec does not work in ORC format. Instead, we need to set tblproperty "orc.compress" to "SNAPPY".
> ref: [https://orc.apache.org/docs/hive-config.html]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org