You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by "Alex Behm (Code Review)" <ge...@cloudera.org> on 2016/06/03 15:39:29 UTC

[Impala-CR](cdh5-trunk) IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

Alex Behm has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/3299

Change subject: IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.
......................................................................

IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

Adds handling and testing for a specific Parquet data corruption
scenario with plain dictionary encoded values.

The problematic scenario is when the repeat or literal count of
the RLE-encoded dictionary indexes is decoded as 0 - an invalid value.

There are several other cases of data corruption that are not yet
handled gracefully. This patch only handles one specific case.

Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
M testdata/data/README
A testdata/data/bad_rle_literal_count.parquet
A testdata/data/bad_rle_repeat_count.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts-abort.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts.test
M tests/query_test/test_scanners.py
9 files changed, 85 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/99/3299/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <al...@cloudera.com>

[Impala-CR](cdh5-trunk) IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Tim Armstrong, Dan Hecht,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/3299

to look at the new patch set (#2).

Change subject: IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.
......................................................................

IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

Adds handling and testing for a specific Parquet data corruption
scenario with plain dictionary encoded values.

The problematic scenario is when the repeat or literal count of
the RLE-encoded dictionary indexes is decoded as 0 - an invalid value.

There are several other cases of data corruption that are not yet
handled gracefully. This patch only handles one specific case.

Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
M testdata/data/README
A testdata/data/bad_rle_literal_count.parquet
A testdata/data/bad_rle_repeat_count.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts-abort.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts.test
M tests/query_test/test_scanners.py
9 files changed, 85 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/99/3299/2
-- 
To view, visit http://gerrit.cloudera.org:8080/3299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-CR](cdh5-trunk) IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.
......................................................................


Patch Set 2: Code-Review+2

Assuming the S3 build works.

-- 
To view, visit http://gerrit.cloudera.org:8080/3299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.
......................................................................


Patch Set 3: Code-Review+2 Verified+1

Private hdfs and s3 builds passed. Merging manually:
http://sandbox.jenkins.cloudera.com/view/Impala/view/Private-Utility/job/impala-private-build-and-test/3339/
http://sandbox.jenkins.cloudera.com/view/Impala/view/Private-Utility/job/impala-private-build-and-test-s3/167/

-- 
To view, visit http://gerrit.cloudera.org:8080/3299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
Gerrit-PatchSet: 3
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.
......................................................................


Patch Set 1: Code-Review+2

Looks fine other than Tim's comments.

-- 
To view, visit http://gerrit.cloudera.org:8080/3299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.
......................................................................


Patch Set 1: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/3299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: No

[Impala-CR](cdh5-trunk) IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/3299/1/be/src/util/rle-test.cc
File be/src/util/rle-test.cc:

Line 284: TEST(Rle, BitWidthZeroRepeated) {
> It's good that we have some coverage of this.
I was thinking the same thing.


http://gerrit.cloudera.org:8080/#/c/3299/1/testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts.test:

PS1, Line 8: hdfs
> Does this work on all FS's?
The test files used in test_file_metadata_discrepancy() have exactly the same pattern and it appears to work on the other filesystems. I've been struggling to do a private S3 run (strange hangs), but I'll be sure to complete one before I GVM (or fix this test).


http://gerrit.cloudera.org:8080/#/c/3299/1/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

PS1, Line 234: \
> Continuation isn't needed, since it's in parenthesees.
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/3299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-CR](cdh5-trunk) IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has submitted this change and it was merged.

Change subject: IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.
......................................................................


IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

Adds handling and testing for a specific Parquet data corruption
scenario with plain dictionary encoded values.

The problematic scenario is when the repeat or literal count of
the RLE-encoded dictionary indexes is decoded as 0 - an invalid value.

There are several other cases of data corruption that are not yet
handled gracefully. This patch only handles one specific case.

Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
Reviewed-on: http://gerrit.cloudera.org:8080/3299
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Alex Behm <al...@cloudera.com>
---
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/util/rle-encoding.h
M be/src/util/rle-test.cc
M testdata/data/README
A testdata/data/bad_rle_literal_count.parquet
A testdata/data/bad_rle_repeat_count.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts-abort.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts.test
M tests/query_test/test_scanners.py
9 files changed, 85 insertions(+), 3 deletions(-)

Approvals:
  Alex Behm: Looks good to me, approved; Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/3299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>

[Impala-CR](cdh5-trunk) IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3646: Handle corrupt RLE literal or repeat counts of 0.
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/3299/1/be/src/util/rle-test.cc
File be/src/util/rle-test.cc:

Line 284: TEST(Rle, BitWidthZeroRepeated) {
It's good that we have some coverage of this.


http://gerrit.cloudera.org:8080/#/c/3299/1/testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts.test:

PS1, Line 8: hdfs
Does this work on all FS's?


http://gerrit.cloudera.org:8080/#/c/3299/1/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

PS1, Line 234: \
Continuation isn't needed, since it's in parenthesees.


-- 
To view, visit http://gerrit.cloudera.org:8080/3299
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: Ibf406c82cdded37966f09c81e4cc1446d2b60d63
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes