You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by "Tim Armstrong (Code Review)" <ge...@cloudera.org> on 2016/06/15 21:29:46 UTC

[Impala-CR](cdh5-trunk) IMPALA-3745: parquet invalid data handling

Tim Armstrong has uploaded a new patch set (#3).

Change subject: IMPALA-3745: parquet invalid data handling
......................................................................

IMPALA-3745: parquet invalid data handling

Added checks/error handling:
* Negative string lengths while decoding dictionary or data page.
* Buffer overruns while decoding dictionary or data page.
* Some metadata FILECHECKs were converted to statuses.

Testing:
Unit tests for:
* decoding of strings with negative lengths
* truncation of all parquet types
* dictionary creation correctly handling error returns from Decode().

End-to-end tests for handling of negative string lengths in
dictionary- and plain-encoded data in corrupt files, and for
handling of buffer overruns for string data. The corrupted
parquet files were generated by hacking Impala's parquet
writer to write invalid lengths, and by hacking it to
write plain-encoded data instead of dictionary-encoded
data by default.

Performance:
set num_nodes=1;
set num_scanner_threads=1;
select * from biglineitem where l_orderkey = -1;

I inspected MaterializeTupleTime. Before the average was 8.24s and after
was 8.36s (a 1.4% slowdown, within the standard deviation of 1.8%).

Change-Id: Id565a2ccb7b82f9f92cc3b07f05642a3a835bece
---
M be/src/exec/data-source-scan-node.cc
M be/src/exec/hdfs-parquet-scanner.cc
M be/src/exec/parquet-common.h
M be/src/exec/parquet-plain-test.cc
M be/src/util/dict-encoding.h
M be/src/util/dict-test.cc
M common/thrift/generate_error_codes.py
A testdata/bad_parquet_data/README
A testdata/bad_parquet_data/dict-encoded-negative-len.parq
A testdata/bad_parquet_data/dict-encoded-out-of-bounds.parq
A testdata/bad_parquet_data/plain-encoded-negative-len.parq
A testdata/bad_parquet_data/plain-encoded-out-of-bounds.parq
M testdata/bin/create-load-data.sh
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
M testdata/workloads/functional-query/queries/QueryTest/parquet-continue-on-error.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts-abort.test
M testdata/workloads/functional-query/queries/QueryTest/parquet-corrupt-rle-counts.test
M tests/query_test/test_scanners.py
19 files changed, 272 insertions(+), 78 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/87/3387/3
-- 
To view, visit http://gerrit.cloudera.org:8080/3387
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Id565a2ccb7b82f9f92cc3b07f05642a3a835bece
Gerrit-PatchSet: 3
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: anujphadke <ap...@cloudera.com>