You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Tianyi Wang (Code Review)" <ge...@cloudera.org> on 2017/09/11 17:24:51 UTC

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Tianyi Wang has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/8030

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................

IMPALA-5250: Unify decompressor output_length semantics

This patch makes the semantics output_length parameter in
Codec::ProcessBlock to be the same. In existing code different
decompressor treats output_length differently:
1. SnappyDecompressor needs output_length to be greater than or equal
   to actual decompressed length, but it does not set it to actual
   decompressed length after decompression.
2. SnappyBlockDecompressor and Lz4Decompressor require output_length to
   be exactly the same as actual decompressed length, otherwise
   decompression fails.
3. Other decompressors need output_length to be greater than or equal to
   actual decompressed length and will set it to actual decompressed
   length if oversized.
This inconsistency leads to a bug where the error message is
undeterministic when the compressed block is corrupted. This patch makes
all decompressor behave as 3 and adds a testcase checking that
decompressors can handle oversized output buffer correctly.
Lz4Decompressor will use the "safe" version of decompression function
instead of the "fast" version, for the latter is insecure with corrupted
data and requires decompressed length to be known.

Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
---
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
2 files changed, 31 insertions(+), 24 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/8030/1
-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Hello Alex Behm,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/8030

to look at the new patch set (#4).

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................

IMPALA-5250: Unify decompressor output_length semantics

This patch makes the semantics of the output_length parameter in
Codec::ProcessBlock to be the same across all codecs. In existing code
different decompressor treats output_length differently:
1. SnappyDecompressor needs output_length to be greater than or equal
   to the actual decompressed length, but it does not set it to the
   actual decompressed length after decompression.
2. SnappyBlockDecompressor and Lz4Decompressor require output_length to
   be exactly the same as the actual decompressed length, otherwise
   decompression fails.
3. Other decompressors need output_length to be greater than or equal to
   the actual decompressed length and will set it to actual decompressed
   length if oversized.
This inconsistency leads to a bug where the error message is
undeterministic when the compressed block is corrupted. This patch makes
all decompressor behave like a modified version of 3:
Output_length should be greater than or equal to the actual decompressed
length and it will be set to actual decompressed length if oversized. A
decompression failure sets it to 0.
Lz4Decompressor will use the "safe" instead of the "fast" decompression
function, for the latter is insecure with corrupted data and requires
the decompressed length to be known.

Testing: A testcase is added checking that the decompressors can handle
an oversized output buffer correctly. A regression test for the exact
case described in IMPALA-5250 is also added. A benchmark is run on a
16-node cluster testing the performance impact of the LZ4Decompressor
change and no performance regression is found.

Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
---
M be/src/util/codec.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
3 files changed, 74 insertions(+), 45 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/8030/4
-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/8030 )

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................


Patch Set 4: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-Change-Number: 8030
Gerrit-PatchSet: 4
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Sep 2017 17:43:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8030 )

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................


Patch Set 5: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-Change-Number: 8030
Gerrit-PatchSet: 5
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Sep 2017 18:18:43 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change.

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8030/3//COMMIT_MSG
Commit Message:

Line 33: 
> Add a separate section for testing. Mention the tests you added and the per
Done


http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress-test.cc
File be/src/util/decompress-test.cc:

Line 196:       int64_t* output_len, uint8_t** output, bool output_preallocated) {
> Your interpretation is correct.
A test cases is added.


http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress.cc
File be/src/util/decompress.cc:

Line 426: 
> Agree this in/out pattern is not great and we should consider your proposal
All output_length will be set to 0 if the decompression fails now.


-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change.

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8030/1//COMMIT_MSG
Commit Message:

Line 26: instead of the "fast" version, for the latter is insecure with corrupted
> Not sure how useful compression-test.cc is. I was thinking more along the l
Running single_node_perf_run.py locally shows that overall performance is 1.78% slower after this change. Given LZ4 is not used in parquet, we may leave LZ4 inconsistent with others for now.


-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8030 )

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/1255/


-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-Change-Number: 8030
Gerrit-PatchSet: 5
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Sep 2017 18:19:27 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/8030 )

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................

IMPALA-5250: Unify decompressor output_length semantics

This patch makes the semantics of the output_length parameter in
Codec::ProcessBlock to be the same across all codecs. In existing code
different decompressor treats output_length differently:
1. SnappyDecompressor needs output_length to be greater than or equal
   to the actual decompressed length, but it does not set it to the
   actual decompressed length after decompression.
2. SnappyBlockDecompressor and Lz4Decompressor require output_length to
   be exactly the same as the actual decompressed length, otherwise
   decompression fails.
3. Other decompressors need output_length to be greater than or equal to
   the actual decompressed length and will set it to actual decompressed
   length if oversized.
This inconsistency leads to a bug where the error message is
undeterministic when the compressed block is corrupted. This patch makes
all decompressor behave like a modified version of 3:
Output_length should be greater than or equal to the actual decompressed
length and it will be set to actual decompressed length if oversized. A
decompression failure sets it to 0.
Lz4Decompressor will use the "safe" instead of the "fast" decompression
function, for the latter is insecure with corrupted data and requires
the decompressed length to be known.

Testing: A testcase is added checking that the decompressors can handle
an oversized output buffer correctly. A regression test for the exact
case described in IMPALA-5250 is also added. A benchmark is run on a
16-node cluster testing the performance impact of the LZ4Decompressor
change and no performance regression is found.

Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Reviewed-on: http://gerrit.cloudera.org:8080/8030
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M be/src/util/codec.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
3 files changed, 74 insertions(+), 45 deletions(-)

Approvals:
  Alex Behm: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-Change-Number: 8030
Gerrit-PatchSet: 6
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has uploaded a new patch set (#3).

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................

IMPALA-5250: Unify decompressor output_length semantics

This patch makes the semantics of the output_length parameter in
Codec::ProcessBlock to be the same across all codecs. In existing code
different decompressor treats output_length differently:
1. SnappyDecompressor needs output_length to be greater than or equal
   to the actual decompressed length, but it does not set it to the
   actual decompressed length after decompression.
2. SnappyBlockDecompressor and Lz4Decompressor require output_length to
   be exactly the same as the actual decompressed length, otherwise
   decompression fails.
3. Other decompressors need output_length to be greater than or equal to
   the actual decompressed length and will set it to actual decompressed
   length if oversized.
This inconsistency leads to a bug where the error message is
undeterministic when the compressed block is corrupted. This patch makes
all decompressor behave like a modified version of 3:
Output_length should be greater than or equal to the actual decompressed
length and it will be set to actual decompressed length if oversized. A
decompression failure sets it to 0.
A testcase is added checking that decompressors can handle an oversized
output buffer correctly.
Lz4Decompressor will use the "safe" instead of the "fast" decompression
function, for the latter is insecure with corrupted data and requires
the decompressed length to be known. A benchmark is run on a 16-node
cluster and no performance impact is found.

Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
---
M be/src/util/codec.h
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
3 files changed, 74 insertions(+), 45 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/8030/3
-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/8030 )

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................


Patch Set 5: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-Change-Number: 8030
Gerrit-PatchSet: 5
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Sep 2017 22:25:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................


Patch Set 1:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/8030/1//COMMIT_MSG
Commit Message:

Line 9: This patch makes the semantics output_length parameter in
the semantics of the output_length


Line 10: Codec::ProcessBlock to be the same. In existing code different
Codec::ProcessBlock() to be the same across all codecs. In the existing code different decompressors treat output_length differently.


Line 13:    to actual decompressed length, but it does not set it to actual
to the actual decompressed length


Line 16:    be exactly the same as actual decompressed length, otherwise
as the actual decompressed length


Line 19:    actual decompressed length and will set it to actual decompressed
the actual decompressed length


Line 21: This inconsistency leads to a bug where the error message is
This inconsistency lead to a bug where ....


Line 24: decompressors can handle oversized output buffer correctly.
can handle an oversized output buffer correctly.


Line 25: Lz4Decompressor will use the "safe" version of decompression function
will use the "safe" instead of the "fast" decompression function


Line 26: instead of the "fast" version, for the latter is insecure with corrupted
We use LZ4 for compressing exchange data, so we check the perf impact of these changes. I can help you with the details.


Line 27: data and requires decompressed length to be known.
and requires the decompressed length to be known


http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress-test.cc
File be/src/util/decompress-test.cc:

Line 153:   // Test the behavior when the decompressor is given too little / too much space
"." at end of sentence


Line 155:   // correct output size when the space is enough, and does not write beyond the output
the correct output size


Line 157:   void DecompressOverUnderSizedOutputBuffer(Codec *compressor, Codec *decompressor,
Out style is "Codec* compressor"


Line 196:     }
I think we should test more directly the scenario mentioned in the JIRA where a corrupt file can lead to the output_len not being set to 0 properly.


http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress.cc
File be/src/util/decompress.cc:

Line 426: // If size_only is false, output buffer size must be at least output_len. *output_len is
Mention what output_len is set to if a non-OK status is returned.


Line 427: // also updated with actual output size.
the actual output size


Line 575:   // LZ4_decompress_fast will cause a segmentation fault if passed a nullptr output.
Comment still relevant?


Line 579:   if (ret < 0) {
single line


-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has uploaded a new patch set (#2).

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................

IMPALA-5250: Unify decompressor output_length semantics

This patch makes the semantics of the output_length parameter in
Codec::ProcessBlock to be the same across all codecs. In existing code
different decompressor treats output_length differently:
1. SnappyDecompressor needs output_length to be greater than or equal
   to the actual decompressed length, but it does not set it to the
   actual decompressed length after decompression.
2. SnappyBlockDecompressor and Lz4Decompressor require output_length to
   be exactly the same as the actual decompressed length, otherwise
   decompression fails.
3. Other decompressors need output_length to be greater than or equal to
   the actual decompressed length and will set it to actual decompressed
   length if oversized.
This inconsistency leads to a bug where the error message is
undeterministic when the compressed block is corrupted. This patch makes
all decompressor behave as 3 and adds a testcase checking that
decompressors can handle an oversized output buffer correctly.
Lz4Decompressor will use the "safe" instead of the "fast" decompression
function, for the latter is insecure with corrupted data and requires
the decompressed length to be known.

Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
---
M be/src/util/decompress-test.cc
M be/src/util/decompress.cc
2 files changed, 31 insertions(+), 27 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/30/8030/2
-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Posted by "Tianyi Wang (Code Review)" <ge...@cloudera.org>.
Tianyi Wang has posted comments on this change.

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................


Patch Set 1:

(18 comments)

http://gerrit.cloudera.org:8080/#/c/8030/1//COMMIT_MSG
Commit Message:

Line 9: This patch makes the semantics output_length parameter in
> the semantics of the output_length
Done


Line 10: Codec::ProcessBlock to be the same. In existing code different
> Codec::ProcessBlock() to be the same across all codecs. In the existing cod
Done


Line 13:    to actual decompressed length, but it does not set it to actual
> to the actual decompressed length
Done


Line 16:    be exactly the same as actual decompressed length, otherwise
> as the actual decompressed length
Done


Line 19:    actual decompressed length and will set it to actual decompressed
> the actual decompressed length
Done


Line 21: This inconsistency leads to a bug where the error message is
> This inconsistency lead to a bug where ....
Do you mean changing it to plural?


Line 24: decompressors can handle oversized output buffer correctly.
> can handle an oversized output buffer correctly.
Done


Line 25: Lz4Decompressor will use the "safe" version of decompression function
> will use the "safe" instead of the "fast" decompression function
Done


Line 26: instead of the "fast" version, for the latter is insecure with corrupted
> We use LZ4 for compressing exchange data, so we check the perf impact of th
Is there an individual benchmark?
There is one for compression in be/src/experiments/compression-test.cc.


Line 27: data and requires decompressed length to be known.
> and requires the decompressed length to be known
Done


http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress-test.cc
File be/src/util/decompress-test.cc:

Line 153:   // Test the behavior when the decompressor is given too little / too much space
> "." at end of sentence
Done


Line 155:   // correct output size when the space is enough, and does not write beyond the output
> the correct output size
Done


Line 157:   void DecompressOverUnderSizedOutputBuffer(Codec *compressor, Codec *decompressor,
> Out style is "Codec* compressor"
Done


Line 196:     }
> I think we should test more directly the scenario mentioned in the JIRA whe
I'm not sure if my interpretation of JIRA is correct. In my understanding:
1. The parquet file is corrupted so either current_page_header_.uncompressed_page_size is greater or current_page_header_.compressed_page_size is less than what it should be. 
2. Either way we allocated a larger buffer and the decompressor decompressed the (probably incomplete) data successfully. 
3. The part of the buffer not written by the decompressor is uninitilized. Because the decompressor returned OK, somehow we read this uninitilized part later and trigger undeterministic errors.
The problem here is in 2: The decompressor should report how much data it decompressed even if the decompression is successful. Then ReadDataPage() can check whether it is the same as current_page_header_.uncompressed_page_size.
So I think we should test whether output_len is set properly with an successful decompression. I don't understand "set to 0" part. When should it be set to 0?


http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress.cc
File be/src/util/decompress.cc:

Line 426: // If size_only is false, output buffer size must be at least output_len. *output_len is
> Mention what output_len is set to if a non-OK status is returned.
Done. It is set to 0 in this case, while other decompressor may leave it unmodified or set it to an arbitrary intermediate value.
I think the complication here is caused by the design that 'output_len' and 'output' are output parameters. When a parameter is both input and output, and could be passed as reference further into callees of callee, there could be all kinds of bugs forgetting to set the correct value here and there. I think if output and output_len is in the return value, creating such bugs is much more difficult, and the code will be much easier to follow since the data flow is clear and explicit.

I think we should have a Status<T> type for return values. It represents either a value or an error status, and then we can avoid output parameters in new codes.


Line 427: // also updated with actual output size.
> the actual output size
Done


Line 575:   // LZ4_decompress_fast will cause a segmentation fault if passed a nullptr output.
> Comment still relevant?
Removed.


Line 579:   if (ret < 0) {
> single line
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/8030/1//COMMIT_MSG
Commit Message:

Line 26: instead of the "fast" version, for the latter is insecure with corrupted
> Is there an individual benchmark?
Not sure how useful compression-test.cc is. I was thinking more along the lines of end-to-end tests that have heavy data exchanges.


http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress-test.cc
File be/src/util/decompress-test.cc:

Line 196:     }
> I'm not sure if my interpretation of JIRA is correct. In my understanding:
Your interpretation is correct.

I'm just referring to a specific broken file I've seen where I believe the output_len should be 0. Can you please make sure that your change fixes the bug for that specific broken file and check whether these tests cover that scenario?


http://gerrit.cloudera.org:8080/#/c/8030/1/be/src/util/decompress.cc
File be/src/util/decompress.cc:

Line 426: // If size_only is false, output buffer size must be at least output_len. *output_len is
> Done. It is set to 0 in this case, while other decompressor may leave it un
Agree this in/out pattern is not great and we should consider your proposal (but not in this patch).

I'm ok with setting to 0 everywhere, but it might be easier to just leave output_len untouched. I saw in the code that in some places it is set to 0, but not in others. My main ask is for consistency and documenting the behavior in a comment.


-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5250: Unify decompressor output length semantics

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-5250: Unify decompressor output_length semantics
......................................................................


Patch Set 3: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/8030/3//COMMIT_MSG
Commit Message:

Line 33: 
Add a separate section for testing. Mention the tests you added and the perf validation you did.


-- 
To view, visit http://gerrit.cloudera.org:8080/8030
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ifd42942b169921a7eb53940c3762bc45bb82a993
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Tianyi Wang <tw...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Tianyi Wang <tw...@cloudera.com>
Gerrit-HasComments: Yes