You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Xiaomeng Zhang (Code Review)" <ge...@cloudera.org> on 2020/01/24 22:36:21 UTC

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Xiaomeng Zhang has uploaded this change for review. ( http://gerrit.cloudera.org:8080/15023


Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................

IMPALA-9075: Add support for reading zstd text files

In this patch, we add support for reading zstd encoded text files.
This includes:
1. support reading zstd file written by Hive which uses streaming.
2. support reading zstd file compressed by standard zstd library which
uses block.
To support decompressing both formats, a function ProcessBlockStreaming
is added in zstd decompressor.

Testing done:
Added two backend tests:
1. streaming decompress test.
2. large data test for both block and streaming decompress.
Added two end to end tests:
1. hive and impala integration. For four compression codecs, write in
hive and read from impala.
2. zstd library and impala integration. Copy a zstd lib compressed file
to HDFS, and read from impala.

Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M be/src/util/compress.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
M be/src/util/decompress.h
M bin/rat_exclude_files.txt
A testdata/data/text_large_zstd.zst
A tests/custom_cluster/test_hive_text_codec_interop.py
M tests/query_test/test_compressed_formats.py
10 files changed, 248 insertions(+), 9 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/15023/3
-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 3
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Xiaomeng Zhang (Code Review)" <ge...@cloudera.org>.
Xiaomeng Zhang has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/15023/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15023/5//COMMIT_MSG@9
PS5, Line 9: In this patch, we add support for reading zstd encoded text files.
> We should open a docs JIRA to update the docs for this new feature.
https://issues.apache.org/jira/browse/IMPALA-9389


http://gerrit.cloudera.org:8080/#/c/15023/5/be/src/util/decompress.cc
File be/src/util/decompress.cc:

http://gerrit.cloudera.org:8080/#/c/15023/5/be/src/util/decompress.cc@616
PS5, Line 616: Status ZstandardDecompressor::Init() {
> Discussed offline with Xiaomeng. For now we decided to move the call to `ZS
Done


http://gerrit.cloudera.org:8080/#/c/15023/5/be/src/util/decompress.cc@668
PS5, Line 668:       return Status(TErrorCode::ZSTD_ERROR, "ZSTD_decompress",
> Use ZSTD_decompressStream in the error message.
Done


http://gerrit.cloudera.org:8080/#/c/15023/5/tests/custom_cluster/test_hive_text_codec_interop.py
File tests/custom_cluster/test_hive_text_codec_interop.py:

http://gerrit.cloudera.org:8080/#/c/15023/5/tests/custom_cluster/test_hive_text_codec_interop.py@28
PS5, Line 28: TEXT_CODECS = ['snappy', 'gzip', 'zstd', 'lzo']
> I think we should also add Default, Bzip2 and Deflate to the list:
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 5
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Sat, 15 Feb 2020 01:16:09 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 9: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 9
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Tue, 25 Feb 2020 00:31:33 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Xiaomeng Zhang (Code Review)" <ge...@cloudera.org>.
Xiaomeng Zhang has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................

IMPALA-9075: Add support for reading zstd text files

In this patch, we add support for reading zstd encoded text files.
This includes:
1. support reading zstd file written by Hive which uses streaming.
2. support reading zstd file compressed by standard zstd library which
uses block.
To support decompressing both formats, a function ProcessBlockStreaming
is added in zstd decompressor.

Testing done:
Added two backend tests:
1. streaming decompress test.
2. large data test for both block and streaming decompress.
Added two end to end tests:
1. hive and impala integration. For four compression codecs, write in
hive and read from impala.
2. zstd library and impala integration. Copy a zstd lib compressed file
to HDFS, and read from impala.

Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M be/src/util/compress.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
M be/src/util/decompress.h
M bin/rat_exclude_files.txt
A testdata/data/text_large_zstd.zst
A tests/custom_cluster/test_hive_text_codec_interop.py
M tests/query_test/test_compressed_formats.py
10 files changed, 248 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/15023/4
-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 4
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Abhishek Rawat (Code Review)" <ge...@cloudera.org>.
Abhishek Rawat has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 6:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15023/6/be/src/util/decompress.cc
File be/src/util/decompress.cc:

http://gerrit.cloudera.org:8080/#/c/15023/6/be/src/util/decompress.cc@611
PS6, Line 611:   stream_ = NULL;
nit: Best to put this in the initializer list right after the call to base class constructor. In this case, it probably makes no difference perf-wise (since its a pointer assignment), but its usually a good practice to use initializer lists. (Note that most compilers will generate an initializer list and putting the initialization in the constructor body will result in double initialization)


http://gerrit.cloudera.org:8080/#/c/15023/6/be/src/util/decompress.cc@656
PS6, Line 656:     }
Should we also call `ZSTD_initDStream()` here after call to `ZSTD_createDCtx()`? I am basically reading the docs here: https://facebook.github.io/zstd/zstd_manual.html#Chapter9


http://gerrit.cloudera.org:8080/#/c/15023/6/tests/query_test/test_compressed_formats.py
File tests/query_test/test_compressed_formats.py:

http://gerrit.cloudera.org:8080/#/c/15023/6/tests/query_test/test_compressed_formats.py@322
PS6, Line 322:     assert int(result.get_data()) == 10000000
nit: instead of directly comparing with 10M here, you could run `select count(*)` against the uncompressed table and compare the results.



-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 6
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Wed, 19 Feb 2020 16:32:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 8: Code-Review+2

LGTM thanks Xiaomeng and Abhishek


-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 8
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Mon, 24 Feb 2020 19:49:49 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 9:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/5392/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 9
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Mon, 24 Feb 2020 19:55:48 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5575/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 5
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Fri, 31 Jan 2020 20:01:09 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 9: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 9
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Mon, 24 Feb 2020 19:55:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Xiaomeng Zhang (Code Review)" <ge...@cloudera.org>.
Xiaomeng Zhang has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15023/5/be/src/util/decompress.cc
File be/src/util/decompress.cc:

http://gerrit.cloudera.org:8080/#/c/15023/5/be/src/util/decompress.cc@616
PS5, Line 616: Status ZstandardDecompressor::Init() {
> Init() gets called for both block and streaming code paths. Since it's only
Do you have suggestions where should it be init? I can't put it inside ProcessBlockStreaming as it might be called multiple times for one stream, and we should use one dctx for one stream.



-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 5
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Thu, 13 Feb 2020 19:45:45 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15023/4/be/src/util/decompress.cc
File be/src/util/decompress.cc:

http://gerrit.cloudera.org:8080/#/c/15023/4/be/src/util/decompress.cc@665
PS4, Line 665:     *stream_end = false;
> The reason I add it here is that other stream decompressing functions write
I would have written the code the same way!
But, now the scary comment has been removed, we can see that the loop is really quite simple. If we remove another line then it is even simpler, which I like. 
Clearly the extra setting into *stream_end is safe, so it's not "wrong" but anything that makes things clearer for the future maintainer is good. 
If you feel strongly then we can talk more



-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 4
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Thu, 30 Jan 2020 00:28:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Xiaomeng Zhang (Code Review)" <ge...@cloudera.org>.
Xiaomeng Zhang has uploaded a new patch set (#8). ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................

IMPALA-9075: Add support for reading zstd text files

In this patch, we add support for reading zstd encoded text files.
This includes:
1. support reading zstd file written by Hive which uses streaming.
2. support reading zstd file compressed by standard zstd library which
uses block.
To support decompressing both formats, a function ProcessBlockStreaming
is added in zstd decompressor.

Testing done:
Added two backend tests:
1. streaming decompress test.
2. large data test for both block and streaming decompress.
Added two end to end tests:
1. hive and impala integration. For four compression codecs, write in
hive and read from impala.
2. zstd library and impala integration. Copy a zstd lib compressed file
to HDFS, and read from impala.

Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M be/src/util/compress.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
M be/src/util/decompress.h
M bin/rat_exclude_files.txt
A testdata/data/text_large_zstd.txt
A testdata/data/text_large_zstd.zst
A tests/custom_cluster/test_hive_text_codec_interop.py
M tests/query_test/test_compressed_formats.py
11 files changed, 10,000,275 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/15023/8
-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 8
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 5: Code-Review+1

LGTM, @arawat will review to +2


-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 5
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Fri, 31 Jan 2020 19:23:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5518/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 3
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Fri, 24 Jan 2020 23:20:40 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15023/3/tests/custom_cluster/test_hive_text_codec_interop.py
File tests/custom_cluster/test_hive_text_codec_interop.py:

http://gerrit.cloudera.org:8080/#/c/15023/3/tests/custom_cluster/test_hive_text_codec_interop.py@26
PS3, Line 26: from tests.util.filesystem_utils import get_fs_path
flake8: F401 'tests.util.filesystem_utils.get_fs_path' imported but unused


http://gerrit.cloudera.org:8080/#/c/15023/3/tests/custom_cluster/test_hive_text_codec_interop.py@63
PS3, Line 63: _
flake8: E501 line too long (106 > 90 characters)


http://gerrit.cloudera.org:8080/#/c/15023/3/tests/query_test/test_compressed_formats.py
File tests/query_test/test_compressed_formats.py:

http://gerrit.cloudera.org:8080/#/c/15023/3/tests/query_test/test_compressed_formats.py@264
PS3, Line 264: class TestReadZtsdLibCompressedFile(ImpalaTestSuite):
flake8: E302 expected 2 blank lines, found 1



-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 3
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Fri, 24 Jan 2020 22:36:54 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Abhishek Rawat (Code Review)" <ge...@cloudera.org>.
Abhishek Rawat has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 5:

(6 comments)

A few more comments..

http://gerrit.cloudera.org:8080/#/c/15023/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15023/5//COMMIT_MSG@9
PS5, Line 9: In this patch, we add support for reading zstd encoded text files.
We should open a docs JIRA to update the docs for this new feature.


http://gerrit.cloudera.org:8080/#/c/15023/5/be/src/util/decompress-test.cc
File be/src/util/decompress-test.cc:

http://gerrit.cloudera.org:8080/#/c/15023/5/be/src/util/decompress-test.cc@500
PS5, Line 500:   RunTestStreaming(THdfsCompression::ZSTD, clevel);
Note that in this testcase and other testcases we are using block/frame compression but streaming decompression. Its good that they are compatible!


http://gerrit.cloudera.org:8080/#/c/15023/5/be/src/util/decompress.cc
File be/src/util/decompress.cc:

http://gerrit.cloudera.org:8080/#/c/15023/5/be/src/util/decompress.cc@616
PS5, Line 616: Status ZstandardDecompressor::Init() {
> Do you have suggestions where should it be init? I can't put it inside Proc
Discussed offline with Xiaomeng. For now we decided to move the call to `ZSTD_createDCtx()` inside ProcessBlockStreaming() under an `if null` condition. Similarly, the call to `ZSTD_freeDCtx()` in the destructor will also be under an `if not null` condition


http://gerrit.cloudera.org:8080/#/c/15023/5/be/src/util/decompress.cc@668
PS5, Line 668:       return Status(TErrorCode::ZSTD_ERROR, "ZSTD_decompress",
Use ZSTD_decompressStream in the error message.


http://gerrit.cloudera.org:8080/#/c/15023/5/tests/custom_cluster/test_hive_text_codec_interop.py
File tests/custom_cluster/test_hive_text_codec_interop.py:

http://gerrit.cloudera.org:8080/#/c/15023/5/tests/custom_cluster/test_hive_text_codec_interop.py@28
PS5, Line 28: TEXT_CODECS = ['snappy', 'gzip', 'zstd', 'lzo']
I think we should also add Default, Bzip2 and Deflate to the list:
org.apache.hadoop.io.compress.DefaultCodec
org.apache.hadoop.io.compress.BZip2Codec
org.apache.hadoop.io.compress.DeflateCodec


http://gerrit.cloudera.org:8080/#/c/15023/5/tests/query_test/test_compressed_formats.py
File tests/query_test/test_compressed_formats.py:

http://gerrit.cloudera.org:8080/#/c/15023/5/tests/query_test/test_compressed_formats.py@309
PS5, Line 309:     assert int(result.get_data()) == 10000000
I think we should also keep the original uncompressed text file. And then load that into a table and compare the actual results along with the num rows. Since this is a huge file we could probably compare a subset of data say top 10K value using limit and order by.



-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 5
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Fri, 14 Feb 2020 15:42:51 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5230/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 6
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Sat, 15 Feb 2020 02:00:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Xiaomeng Zhang (Code Review)" <ge...@cloudera.org>.
Xiaomeng Zhang has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................

IMPALA-9075: Add support for reading zstd text files

In this patch, we add support for reading zstd encoded text files.
This includes:
1. support reading zstd file written by Hive which uses streaming.
2. support reading zstd file compressed by standard zstd library which
uses block.
To support decompressing both formats, a function ProcessBlockStreaming
is added in zstd decompressor.

Testing done:
Added two backend tests:
1. streaming decompress test.
2. large data test for both block and streaming decompress.
Added two end to end tests:
1. hive and impala integration. For four compression codecs, write in
hive and read from impala.
2. zstd library and impala integration. Copy a zstd lib compressed file
to HDFS, and read from impala.

Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M be/src/util/compress.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
M be/src/util/decompress.h
M bin/rat_exclude_files.txt
A testdata/data/text_large_zstd.zst
A tests/custom_cluster/test_hive_text_codec_interop.py
M tests/query_test/test_compressed_formats.py
10 files changed, 247 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/15023/5
-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 5
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 7:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5279/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 7
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Wed, 19 Feb 2020 23:18:11 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5292/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 8
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Feb 2020 22:28:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Xiaomeng Zhang (Code Review)" <ge...@cloudera.org>.
Xiaomeng Zhang has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................

IMPALA-9075: Add support for reading zstd text files

In this patch, we add support for reading zstd encoded text files.
This includes:
1. support reading zstd file written by Hive which uses streaming.
2. support reading zstd file compressed by standard zstd library which
uses block.
To support decompressing both formats, a function ProcessBlockStreaming
is added in zstd decompressor.

Testing done:
Added two backend tests:
1. streaming decompress test.
2. large data test for both block and streaming decompress.
Added two end to end tests:
1. hive and impala integration. For four compression codecs, write in
hive and read from impala.
2. zstd library and impala integration. Copy a zstd lib compressed file
to HDFS, and read from impala.

Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M be/src/util/compress.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
M be/src/util/decompress.h
M bin/rat_exclude_files.txt
A testdata/data/text_large_zstd.txt
A testdata/data/text_large_zstd.zst
A tests/custom_cluster/test_hive_text_codec_interop.py
M tests/query_test/test_compressed_formats.py
11 files changed, 10,000,275 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/15023/6
-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 6
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5537/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 4
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Tue, 28 Jan 2020 19:58:07 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Xiaomeng Zhang (Code Review)" <ge...@cloudera.org>.
Xiaomeng Zhang has uploaded a new patch set (#7). ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................

IMPALA-9075: Add support for reading zstd text files

In this patch, we add support for reading zstd encoded text files.
This includes:
1. support reading zstd file written by Hive which uses streaming.
2. support reading zstd file compressed by standard zstd library which
uses block.
To support decompressing both formats, a function ProcessBlockStreaming
is added in zstd decompressor.

Testing done:
Added two backend tests:
1. streaming decompress test.
2. large data test for both block and streaming decompress.
Added two end to end tests:
1. hive and impala integration. For four compression codecs, write in
hive and read from impala.
2. zstd library and impala integration. Copy a zstd lib compressed file
to HDFS, and read from impala.

Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M be/src/util/compress.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
M be/src/util/decompress.h
M bin/rat_exclude_files.txt
A testdata/data/text_large_zstd.txt
A testdata/data/text_large_zstd.zst
A tests/custom_cluster/test_hive_text_codec_interop.py
M tests/query_test/test_compressed_formats.py
11 files changed, 10,000,278 insertions(+), 9 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/23/15023/7
-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 7
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Abhishek Rawat (Code Review)" <ge...@cloudera.org>.
Abhishek Rawat has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 5:

(2 comments)

Overall looks good. Nice work adding hive interop tests and extending existing unit tests. I still need to do another round of review. Adding comments so far...

http://gerrit.cloudera.org:8080/#/c/15023/5/be/src/util/compress.h
File be/src/util/compress.h:

http://gerrit.cloudera.org:8080/#/c/15023/5/be/src/util/compress.h@147
PS5, Line 147:   virtual std::string file_extension() const override { return "zst"; }
I think we should add a comment here explaining that the extension is not used by parquet. Parquet uses '.parq' as the file extension. Text uses '.zst' as the extension for files compressed with zstd codec.


http://gerrit.cloudera.org:8080/#/c/15023/5/be/src/util/decompress.cc
File be/src/util/decompress.cc:

http://gerrit.cloudera.org:8080/#/c/15023/5/be/src/util/decompress.cc@616
PS5, Line 616: Status ZstandardDecompressor::Init() {
Init() gets called for both block and streaming code paths. Since it's only needed for the streaming code path we should only call ZSTD_createDCtx and ZSTD_freeDCtx in the streaming code path.



-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 5
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Mon, 10 Feb 2020 22:20:18 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................

IMPALA-9075: Add support for reading zstd text files

In this patch, we add support for reading zstd encoded text files.
This includes:
1. support reading zstd file written by Hive which uses streaming.
2. support reading zstd file compressed by standard zstd library which
uses block.
To support decompressing both formats, a function ProcessBlockStreaming
is added in zstd decompressor.

Testing done:
Added two backend tests:
1. streaming decompress test.
2. large data test for both block and streaming decompress.
Added two end to end tests:
1. hive and impala integration. For four compression codecs, write in
hive and read from impala.
2. zstd library and impala integration. Copy a zstd lib compressed file
to HDFS, and read from impala.

Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Reviewed-on: http://gerrit.cloudera.org:8080/15023
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/hdfs-text-scanner.cc
M be/src/exec/hdfs-text-scanner.h
M be/src/util/compress.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
M be/src/util/decompress.h
M bin/rat_exclude_files.txt
A testdata/data/text_large_zstd.txt
A testdata/data/text_large_zstd.zst
A tests/custom_cluster/test_hive_text_codec_interop.py
M tests/query_test/test_compressed_formats.py
11 files changed, 10,000,275 insertions(+), 9 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 10
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Xiaomeng Zhang (Code Review)" <ge...@cloudera.org>.
Xiaomeng Zhang has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15023/4/be/src/util/decompress.cc
File be/src/util/decompress.cc:

http://gerrit.cloudera.org:8080/#/c/15023/4/be/src/util/decompress.cc@665
PS4, Line 665:     *stream_end = false;
> It seems to me that this line is redundant.
The reason I add it here is that other stream decompressing functions write in this way https://github.infra.cloudera.com/CDH/Impala/blob/cdpd-master/be/src/util/decompress.cc#L103 and https://github.infra.cloudera.com/CDH/Impala/blob/cdpd-master/be/src/util/decompress.cc#L375



-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 4
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Wed, 29 Jan 2020 23:50:54 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Andrew Sherman (Code Review)" <ge...@cloudera.org>.
Andrew Sherman has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 4:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15023/4/be/src/util/decompress.cc
File be/src/util/decompress.cc:

http://gerrit.cloudera.org:8080/#/c/15023/4/be/src/util/decompress.cc@665
PS4, Line 665:     *stream_end = false;
It seems to me that this line is redundant.
*stream_end = true
is only set if (input_buffer.pos == input_buffer.size) 
in which case we will break out of the loop.



-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 4
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Wed, 29 Jan 2020 00:30:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Abhishek Rawat (Code Review)" <ge...@cloudera.org>.
Abhishek Rawat has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 7: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/15023/7/be/src/util/decompress.cc
File be/src/util/decompress.cc:

http://gerrit.cloudera.org:8080/#/c/15023/7/be/src/util/decompress.cc@673
PS7, Line 673:       ZSTD_DCtx_reset(stream_, ZSTD_reset_parameters); // reset parameters only
As discussed offline, we can safely remove this. ZSTD_decompressStream should take care of resets.



-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 7
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Thu, 20 Feb 2020 21:04:20 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9075: Add support for reading zstd text files

Posted by "Xiaomeng Zhang (Code Review)" <ge...@cloudera.org>.
Xiaomeng Zhang has posted comments on this change. ( http://gerrit.cloudera.org:8080/15023 )

Change subject: IMPALA-9075: Add support for reading zstd text files
......................................................................


Patch Set 4:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/15023/3/tests/custom_cluster/test_hive_text_codec_interop.py
File tests/custom_cluster/test_hive_text_codec_interop.py:

http://gerrit.cloudera.org:8080/#/c/15023/3/tests/custom_cluster/test_hive_text_codec_interop.py@26
PS3, Line 26: 
> flake8: F401 'tests.util.filesystem_utils.get_fs_path' imported but unused
Done


http://gerrit.cloudera.org:8080/#/c/15023/3/tests/custom_cluster/test_hive_text_codec_interop.py@63
PS3, Line 63: 
> flake8: E501 line too long (106 > 90 characters)
Done


http://gerrit.cloudera.org:8080/#/c/15023/3/tests/query_test/test_compressed_formats.py
File tests/query_test/test_compressed_formats.py:

http://gerrit.cloudera.org:8080/#/c/15023/3/tests/query_test/test_compressed_formats.py@264
PS3, Line 264: 
> flake8: E302 expected 2 blank lines, found 1
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/15023
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I2adce9fe00190558525fa5cd3d50cf5e0f0b0aa4
Gerrit-Change-Number: 15023
Gerrit-PatchSet: 4
Gerrit-Owner: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Reviewer: Abhishek Rawat <ar...@cloudera.com>
Gerrit-Reviewer: Andrew Sherman <as...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Xiaomeng Zhang <xi...@cloudera.com>
Gerrit-Comment-Date: Tue, 28 Jan 2020 19:13:28 +0000
Gerrit-HasComments: Yes