You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by "Skye Wanderman-Milne (Code Review)" <ge...@cloudera.org> on 2016/05/18 22:57:33 UTC

[Impala-CR](cdh5-trunk) IMPALA-3441: check for malformed Avro data

Skye Wanderman-Milne has uploaded a new patch set (#4).

Change subject: IMPALA-3441: check for malformed Avro data
......................................................................

IMPALA-3441: check for malformed Avro data

This patch adds error checking to the Avro scanner (both the codegen'd
and interepted paths), including out-of-bounds checks and data
validity checks.

I ran a local benchmark using the following query:
  set num_scanner_threads=1;
  select max(i) from default.avro_ints_big;

where avro_ints_big is an Avro table with a single int column
containing ~90MM values. With this patch, the total query time goes
from 1.6s to 1.8s (14% increase), with the MaterializeTupleTime going
from 975ms to 1195ms (22% increase).

It adds a new Avro scanner unit test, as well as an end-to-end test
that queries several corrupted files.

Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
---
M be/src/exec/CMakeLists.txt
M be/src/exec/base-sequence-scanner.cc
M be/src/exec/base-sequence-scanner.h
M be/src/exec/hdfs-avro-scanner-ir.cc
A be/src/exec/hdfs-avro-scanner-test.cc
M be/src/exec/hdfs-avro-scanner.cc
M be/src/exec/hdfs-avro-scanner.h
M be/src/exec/hdfs-scanner.cc
M be/src/exec/hdfs-scanner.h
M be/src/exec/read-write-util.cc
M be/src/exec/read-write-util.h
M be/src/exec/scanner-context.inline.h
M be/src/exec/zigzag-test.cc
M common/thrift/generate_error_codes.py
A testdata/bad_avro_snap/README
A testdata/bad_avro_snap/invalid_union.avro
A testdata/bad_avro_snap/negative_string_len.avro
A testdata/bad_avro_snap/truncated_string.avro
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M tests/data_errors/test_data_errors.py
21 files changed, 810 insertions(+), 182 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/72/3072/4
-- 
To view, visit http://gerrit.cloudera.org:8080/3072
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I801a11c496a128e02c564c2a9c44baa5a97be132
Gerrit-PatchSet: 4
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Skye Wanderman-Milne <sk...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Skye Wanderman-Milne <sk...@cloudera.com>