You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2019/06/19 16:58:01 UTC

[jira] [Commented] (IMPALA-8617) Add support for lz4 in parquet

    [ https://issues.apache.org/jira/browse/IMPALA-8617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16867825#comment-16867825 ] 

ASF subversion and git services commented on IMPALA-8617:
---------------------------------------------------------

Commit 97a6a3c8077affd5a181054d706cc7f30aecca91 in impala's branch refs/heads/master from Abhishek
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=97a6a3c ]

IMPALA-8617: Add support for lz4 in parquet

A new enum value LZ4_BLOCKED was added to the THdfsCompression enum, to
distinguish it from the existing LZ4 codec. LZ4_BLOCKED codec represents
the block compression scheme used by Hadoop. Its similar to
SNAPPY_BLOCKED as far as the block format is concerned, with the only
difference being the codec used for compression and decompression.

Added Lz4BlockCompressor and Lz4BlockDecompressor classes for
compressing and decompressing parquet data using Hadoop's
lz4 block compression scheme.

The Lz4BlockCompressor treats the input
as a single block and generates a compressed block with following layout
  <4 byte big endian uncompressed size>
  <4 byte big endian compressed size>
  <lz4 compressed block>
The hdfs parquet table writer should call the Lz4BlockCompressor
using the ideal input size (unit of compression in parquet is a page),
and so the Lz4BlockCompressor does not further break down the input
into smaller blocks.

The Lz4BlockDecompressor on the other hand should be compatible with
blocks written by Impala and other engines in Hadoop ecosystem. It can
decompress compressed data in following format
  <4 byte big endian uncompressed size>
  <4 byte big endian compressed size>
  <lz4 compressed block>
  ...
  <4 byte big endian compressed size>
  <lz4 compressed block>
  ...
  <repeated untill uncompressed size from outer block is consumed>

Externally users can now set the lz4 codec for parquet using:
  set COMPRESSION_CODEC=lz4
This gets translated into LZ4_BLOCKED codec for the
HdfsParquetTableWriter. Similarly, when reading lz4 compressed parquet
data, the LZ4_BLOCKED codec is used.

Testing:
 - Added unit tests for LZ4_BLOCKED in decompress-test.cc
 - Added unit tests for Hadoop compatibility in decompress-test.cc,
   basically being able to decompress an outer block with multiple inner
   blocks (the Lz4BlockDecompressor description above)
 - Added interoperability tests for Hive and Impala for all parquet
   codecs. New test added to
   tests/custom_cluster/test_hive_parquet_codec_interop.py

Change-Id: Ia6850a39ef3f1e0e7ba48e08eef1d4f7cbb74d0c
Reviewed-on: http://gerrit.cloudera.org:8080/13582
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Add support for lz4 in parquet
> ------------------------------
>
>                 Key: IMPALA-8617
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8617
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Abhishek Rawat
>            Assignee: Abhishek Rawat
>            Priority: Major
>              Labels: parquet
>
> Hadoop uses a native block format for LZ4 (same as parquet-mr api) which is incompatible with LZ4 block format.
> As a result Parquet/LZ4 could have different block formats.
> The parquet-cpp api (now Apache Arrow) uses LZ4 frame format, which is also incompatible with LZ4 block format.
> The current decision is to use a format compatible with Hive, Spark and parquet-mr.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org