You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Abhishek Rawat (Code Review)" <ge...@cloudera.org> on 2019/06/04 12:24:09 UTC
[Impala-ASF-CR] IMPALA-8450: Add support for zstd and lz4 in parquet

Abhishek Rawat has posted comments on this change. ( http://gerrit.cloudera.org:8080/13396 )

Change subject: IMPALA-8450: Add support for zstd and lz4 in parquet
......................................................................


Patch Set 7:

(13 comments)

This patch will be split into two parts. The first part (zstd support) can be found here: https://gerrit.cloudera.org/#/c/13507/.

http://gerrit.cloudera.org:8080/#/c/13396/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13396/7//COMMIT_MSG@17
PS7, Line 17: A new query
            : option COMPRESSION_LEVEL was added so that user can set appropriate
            : clevel
> Similar writer settings are currently configured with query options in Impa
In patch#8 we no longer have compression_level query option and instead we can specify the compression level for "zstd" using the compression_codec query option, such as "set compression_codec=zstd:5". This will also hopefully minimize the confusion whether its a table/column property since its scope is that of the related compression_codec query option. Also, specifying compression level in the above manner for codecs other than zstd will return an error.

I also completely agree that a finer granularity control over such variables such that end user can specify them at table/column scope is very much desirable and something we should definitely consider in future. Unfortunately, addressing it as part of this patch is out of scope for this work.


http://gerrit.cloudera.org:8080/#/c/13396/7//COMMIT_MSG@21
PS7, Line 21: other codecs. The default clevel is ZSTD_CLEVEL_DEFAULT.
> If we default to ZSTD_CLEVEL_DEFAULT, but then we later add other codecs, t
For now, one can specify a compression_level only for ZSTD (example below) and so the default only applies of ZSTD.
set compression_codec=zstd:<clevel>

When we do support similar properties for other codecs, each codec could have its own default compression level.


http://gerrit.cloudera.org:8080/#/c/13396/7//COMMIT_MSG@26
PS7, Line 26:   set COMPRESSION_CODEC=ZSTD;
> same question: seems like you'd want to set this as a table or partition pr
Completely agree but this is out of scope for this work. I will file a JIRA for future work.


http://gerrit.cloudera.org:8080/#/c/13396/6/be/src/exec/parquet/parquet-common.cc
File be/src/exec/parquet/parquet-common.cc:

http://gerrit.cloudera.org:8080/#/c/13396/6/be/src/exec/parquet/parquet-common.cc@28
PS6, Line 28:   THdfsCompression::SNAPPY, // BROTLI
> Is this just a placeholder? Confusing since they're not equivalent. Can you
Added BROTLI in the THdfsCompression enum.


http://gerrit.cloudera.org:8080/#/c/13396/6/be/src/exec/parquet/parquet-common.cc@41
PS6, Line 41:   parquet::CompressionCodec::SNAPPY, // BZIP2
> Can you leave a comment that it's just a placeholder since BZIP2 isn't a va
Done


http://gerrit.cloudera.org:8080/#/c/13396/6/be/src/exec/parquet/parquet-common.cc@46
PS6, Line 46:   parquet::CompressionCodec::SNAPPY, // ZLIB
> ZLIB <=> DEFLATE actually (not sure why we ended up with the same names).
Sounds like GZIP, DEFLATE and ZLIB are different formats supported by the zlib library. I think we can use GZIP as the mapping name here similar to GZIP/DEFLATE


http://gerrit.cloudera.org:8080/#/c/13396/6/be/src/service/query-options.cc
File be/src/service/query-options.cc:

http://gerrit.cloudera.org:8080/#/c/13396/6/be/src/service/query-options.cc@759
PS6, Line 759:           return Status(Substitute("Invalid ZSTD compression level '$0'. Valid values"
> I do wonder if we'd be better off including the compression level in COMPRE
Good point. I tried not to do that initially since I did not see any query option supporting multiple fields per value. But, I agree that having <codec:clevel> will make validation cleaner.
 I have made the change such that user can set COMPRESSION_CODEC=ZSTD:<CLEVEL>. If compression level is set for any other codec we will return an error. 

I also added a new unit test CompressionCodec in query-opttions-test.cc.


http://gerrit.cloudera.org:8080/#/c/13396/6/be/src/util/compress.cc
File be/src/util/compress.cc:

http://gerrit.cloudera.org:8080/#/c/13396/6/be/src/util/compress.cc@34
PS6, Line 34: #include "util/debug-util.h"
> Please put this with the block on l30-31
I ended up removing "util/debug-util.h". I think I added it for getStackTrace() during debugging but forgot to remove it.


http://gerrit.cloudera.org:8080/#/c/13396/6/be/src/util/compress.cc@338
PS6, Line 338:     return Status("ZSTD_compress failed with: "
> Please use Substitute() instead of string concatenation. That should handle
Made a mental note for using Substitute for string concatenations.

Added a general ZSTD_ERROR template in  common/thrift/generate_error_codes.py. It will print the ZSTD_compress/ZSTD_decompress function which hit an error and the related error code from the ZSTD_ErrorCode enum.


http://gerrit.cloudera.org:8080/#/c/13396/6/be/src/util/decompress-test.cc
File be/src/util/decompress-test.cc:

http://gerrit.cloudera.org:8080/#/c/13396/6/be/src/util/decompress-test.cc@483
PS6, Line 483:   const int clevel = (rand() % ZSTD_maxCLevel()) + 1;
> We had been trying to avoid using rand() in tests because it makes them non
Updated the test unit to use a test-local RNG and std::uniform_int_distribution.


http://gerrit.cloudera.org:8080/#/c/13396/6/be/src/util/decompress.cc
File be/src/util/decompress.cc:

http://gerrit.cloudera.org:8080/#/c/13396/6/be/src/util/decompress.cc@35
PS6, Line 35: #include "util/debug-util.h"
> Move to the previous block of includes.
This include is not needed anymore and has been removed.


http://gerrit.cloudera.org:8080/#/c/13396/7/tests/query_test/test_insert_parquet.py
File tests/query_test/test_insert_parquet.py:

http://gerrit.cloudera.org:8080/#/c/13396/7/tests/query_test/test_insert_parquet.py@154
PS7, Line 154:     test_result = self.execute_query("select * from {0} order by c3".format(test_table))
> Can we add code to this test that reads back the tables from Hive and makes
Good point. Thanks for the pointers. For zstd, currently its not possible to add such tests in the upstream since the libs are old and parquet doesn't recognize zstd codec. A follow on Jira will add such a test case. For now I did manual interop tests.


http://gerrit.cloudera.org:8080/#/c/13396/7/tests/query_test/test_insert_parquet.py@155
PS7, Line 155: ;
> flake8: E703 statement ends with a semicolon
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/13396
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I98c6dcf3d0a873380e4fa4cf03eb7e924e4ee768
Gerrit-Change-Number: 13396
Gerrit-PatchSet: 7
Gerrit-Owner: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Todd Lipcon <to...@apache.org>
Gerrit-Comment-Date: Tue, 04 Jun 2019 12:24:09 +0000
Gerrit-HasComments: Yes