You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (JIRA)" <ji...@apache.org> on 2019/08/08 01:07:00 UTC

[jira] [Commented] (IMPALA-8549) Add support for scanning DEFLATE text files

    [ https://issues.apache.org/jira/browse/IMPALA-8549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16902577#comment-16902577 ] 

ASF subversion and git services commented on IMPALA-8549:
---------------------------------------------------------

Commit 6d68c4f6c01c3d1f9d51a802476e0ef99fbfa208 in impala's branch refs/heads/master from Ethan Xue
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=6d68c4f ]

IMPALA-8549: Add support for scanning DEFLATE text files

This patch adds support to Impala for scanning .DEFLATE files of
tables stored as text. To avoid confusion, it should be noted that
although these files have a compression type of DEFLATE in Impala,
they should be treated as if their compression type is DEFAULT.

Hadoop tools such as Hive and MapReduce support reading and writing
text files compressed using the deflate algorithm, which is the default
compression type. Hadoop uses the zlib library (an implementation of
the DEFLATE algorithm) to compress text files into .DEFLATE files,
which are not in the raw deflate format but rather the zlib format
(the zlib library supports three flavors of deflate, and Hadoop uses
the flavor that compresses data into deflate with zlib wrappings rather
than just raw deflate)

Testing:
There is a pre-existing unit test that validates compressing and
decompressing data with compression type DEFLATE. Also, modified
existing end-to-end testing that simulates querying files of various
formats and compression types. All core and exhaustive tests pass.

Change-Id: I45e41ab5a12637d396fef0812a09d71fa839b27a
Reviewed-on: http://gerrit.cloudera.org:8080/13857
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Tim Armstrong <ta...@cloudera.com>


> Add support for scanning DEFLATE text files
> -------------------------------------------
>
>                 Key: IMPALA-8549
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8549
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Sahil Takiar
>            Assignee: Ethan
>            Priority: Minor
>              Labels: ramp-up
>
> Several Hadoop tools (e.g. Hive, MapReduce, etc.) support reading and writing text files stored using zlib / deflate (results in files such as {{000000_0.deflate}}). Impala currently does not support reading {{.deflate}} text files and returns errors such as: {{ERROR: Scanner plugin 'DEFLATE' is not one of the enabled plugins: 'LZO'}}.
> Moreover, the default compression codec in Hadoop is zlib / deflate (see {{o.a.h.io.compress.DefaultCodec}}). So when writing to a text table in Hive, if users set {{hive.exec.compress.output}} to true, then {{.deflate}} files will be written by default.
> Impala does support zlib / deflate with other file formats though: Avro, RCFiles, SequenceFiles (see [https://impala.apache.org/docs/build/html/topics/impala_file_formats.html]).
> Currently, the frontend assigns a compression type to a file depending on its extension. For instance, the functional_text_def database is stored as a file with a .deflate extension and is assigned the compression type DEFLATE. The HdfsTextScanner class receives this value and uses it directly to create a decompressor. The functional_\{avro,seq,rc}_databases are stored as files without extensions, so the frontend interprets their compression type as NONE. However, in the backend, each of their corresponding scanners implement custom logic of their own to read file headers and override the existing NONE compression type assigned to files with new values, such as DEFAULT or DEFLATE, so that they appropriate decompressor can be instantiated.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org