You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org> on 2018/10/01 09:32:06 UTC

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Hello Zoltan Borok-Nagy, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11521

to look at the new patch set (#5).

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................

IMPALA-7595: Check the validity of the time part of Parquet timestamps

Before this fix Impala did not check whether a timestamp's time part
is out of the valid [0, 24 hour) range when reading Parquet files,
so these timestamps were memcopied as they were to slots, leading to
results like:
1970-01-01 -00:00:00.000000001
1970-01-01 24:00:00

Different parts of Impala treat these timestamp differently:
- string conversion leads to invalid representation that cannot be
  converted back to timestamp
- timezone conversions handle the overflowing time part and give
  a valid timestamp result (at least since CCTZ, I did not check
  older versions of Impala)
- Parquet writing inserts these timestamp as they are, so the
  resulting Parquet file will also contain corrupt timestamps

The fix adds a check that converts these corrupt timestamps to NULL,
similarly to the handling of timestamp outside the [1400..10000)
range. A new error code is added for this case. If both the date
and the time part is corrupt, then error about corrupt time is
returned.

Testing:
- added a new scanner test that reads a corrupted Parquet file
  with edge values

Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
---
M be/src/exec/parquet-column-readers.cc
M be/src/runtime/timestamp-value.h
M common/thrift/generate_error_codes.py
M testdata/data/README
A testdata/data/out_of_range_time_of_day.parquet
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test
M tests/query_test/test_scanners.py
8 files changed, 57 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/21/11521/5
-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>