You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by "anujphadke (Code Review)" <ge...@cloudera.org> on 2016/06/29 20:50:17 UTC

[Impala-CR](cdh5-2.5.0 5.7.x) IMPALA-3441, IMPALA-3659: check for malformed Avro data

Hello Internal Jenkins, Dan Hecht,

I'd like you to do a code review.  Please visit

    http://gerrit.cloudera.org:8080/3537

to review the following change.

Change subject: IMPALA-3441, IMPALA-3659: check for malformed Avro data
......................................................................

IMPALA-3441, IMPALA-3659: check for malformed Avro data

This patch adds error checking to the Avro scanner (both the codegen'd
and interepted paths), including out-of-bounds checks and data
validity checks.

I ran a local benchmark using the following queries:
  set num_scanner_threads=1;
  select count(i) from default.avro_bigints_big; # file contains only longs
  select max(l_orderkey) from biglineitem_avro; # file has tpch.lineitem schema

Both benchmark queries see negligable or no performance impact.

This patch adds a new Avro scanner unit test and an end-to-end test
that queries several corrupted files, as well as updates the zig-zag
varlen int unit test.

Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
Reviewed-on: http://gerrit.cloudera.org:8080/3072
Reviewed-by: Dan Hecht <dh...@cloudera.com>
Tested-by: Internal Jenkins
(cherry picked from commit fbb41c69a0102796979628b1a4925e96cbc967f0)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/13615
Reviewed-by: Anuj Phadke <ap...@cloudera.com>
Tested-by: Anuj Phadke <ap...@cloudera.com>
(cherry picked from commit ed6291885407e522ea58f208b24ab6fd127be072)
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/base-sequence-scanner.h
M be/src/exec/hdfs-avro-scanner-ir.cc
A be/src/exec/hdfs-avro-scanner-test.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-avro-table-writer.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/read-write-util.cc
M be/src/exec/read-write-util.h
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/exec/scanner-context.inline.h
M be/src/exec/zigzag-test.cc
M common/thrift/generate_error_codes.py
A testdata/bad_avro_snap/README
A testdata/bad_avro_snap/invalid_union.avro
A testdata/bad_avro_snap/negative_string_len.avro
A testdata/bad_avro_snap/truncated_float.avro
A testdata/bad_avro_snap/truncated_string.avro
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/DataErrorsTest/avro-errors.test
M tests/common/test_result_verifier.py
M tests/data_errors/test_data_errors.py
26 files changed, 1,202 insertions(+), 199 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/37/3537/1
-- 
To view, visit http://gerrit.cloudera.org:8080/3537
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-2.5.0_5.7.x
Gerrit-Owner: anujphadke <ap...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Skye Wanderman-Milne <sk...@cloudera.com>

[Impala-CR](cdh5-2.5.0 5.7.x) IMPALA-3441, IMPALA-3659: check for malformed Avro data

Posted by "anujphadke (Code Review)" <ge...@cloudera.org>.
anujphadke has submitted this change and it was merged.

Change subject: IMPALA-3441, IMPALA-3659: check for malformed Avro data
......................................................................


IMPALA-3441, IMPALA-3659: check for malformed Avro data

This patch adds error checking to the Avro scanner (both the codegen'd
and interepted paths), including out-of-bounds checks and data
validity checks.

I ran a local benchmark using the following queries:
  set num_scanner_threads=1;
  select count(i) from default.avro_bigints_big; # file contains only longs
  select max(l_orderkey) from biglineitem_avro; # file has tpch.lineitem schema

Both benchmark queries see negligable or no performance impact.

This patch adds a new Avro scanner unit test and an end-to-end test
that queries several corrupted files, as well as updates the zig-zag
varlen int unit test.

Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
Reviewed-on: http://gerrit.cloudera.org:8080/3072
Reviewed-by: Dan Hecht <dh...@cloudera.com>
Tested-by: Internal Jenkins
(cherry picked from commit fbb41c69a0102796979628b1a4925e96cbc967f0)
Reviewed-on: http://gerrit.sjc.cloudera.com:8080/13615
Reviewed-by: Anuj Phadke <ap...@cloudera.com>
Tested-by: Anuj Phadke <ap...@cloudera.com>
(cherry picked from commit ed6291885407e522ea58f208b24ab6fd127be072)
Reviewed-on: http://gerrit.cloudera.org:8080/3537
Reviewed-by: anujphadke <ap...@cloudera.com>
Tested-by: anujphadke <ap...@cloudera.com>
---
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/base-sequence-scanner.h
M be/src/exec/hdfs-avro-scanner-ir.cc
A be/src/exec/hdfs-avro-scanner-test.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-avro-table-writer.cc
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/read-write-util.cc
M be/src/exec/read-write-util.h
M be/src/exec/scanner-context.cc
M be/src/exec/scanner-context.h
M be/src/exec/scanner-context.inline.h
M be/src/exec/zigzag-test.cc
M common/thrift/generate_error_codes.py
A testdata/bad_avro_snap/README
A testdata/bad_avro_snap/invalid_union.avro
A testdata/bad_avro_snap/negative_string_len.avro
A testdata/bad_avro_snap/truncated_float.avro
A testdata/bad_avro_snap/truncated_string.avro
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/DataErrorsTest/avro-errors.test
M tests/common/test_result_verifier.py
M tests/data_errors/test_data_errors.py
26 files changed, 1,202 insertions(+), 199 deletions(-)

Approvals:
  anujphadke: Looks good to me, approved; Verified



-- 
To view, visit http://gerrit.cloudera.org:8080/3537
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
Gerrit-PatchSet: 2
Gerrit-Project: Impala
Gerrit-Branch: cdh5-2.5.0_5.7.x
Gerrit-Owner: anujphadke <ap...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Skye Wanderman-Milne <sk...@cloudera.com>
Gerrit-Reviewer: anujphadke <ap...@cloudera.com>

[Impala-CR](cdh5-2.5.0 5.7.x) IMPALA-3441, IMPALA-3659: check for malformed Avro data

Posted by "anujphadke (Code Review)" <ge...@cloudera.org>.
anujphadke has posted comments on this change.

Change subject: IMPALA-3441, IMPALA-3659: check for malformed Avro data
......................................................................


Patch Set 1: Code-Review+2 Verified+1

http://sandbox.jenkins.cloudera.com/view/Impala/view/Private-Utility/job/impala-private-build-and-test/3551/

-- 
To view, visit http://gerrit.cloudera.org:8080/3537
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
Gerrit-PatchSet: 1
Gerrit-Project: Impala
Gerrit-Branch: cdh5-2.5.0_5.7.x
Gerrit-Owner: anujphadke <ap...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Skye Wanderman-Milne <sk...@cloudera.com>
Gerrit-Reviewer: anujphadke <ap...@cloudera.com>
Gerrit-HasComments: No