You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org> on 2018/09/26 16:36:55 UTC

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Csaba Ringhofer has uploaded this change for review. ( http://gerrit.cloudera.org:8080/11521


Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................

IMPALA-7595: Check the validity of the time part of Parquet timestamps

Before this fix Impala did not check whether a timestamp's time part
is out of the valid [0, 24 hour) range when reading Parquet files,
so these timestamps were memcopied as they were to slots, leading to
results like:
1970-01-01 -00:00:00.000000001
1970-01-01 24:00:00

Different parts of Impala treat these timestamp differently:
- string conversion leads to invalid representation that cannot be
  converted back to string
- timezone conversions handle the overflowing time part and give
  a valid timestamp result (at least since CCTZ, I did not check
  older versions of Impala)
- Parquet writing inserts these timestamp as they are, so the
  resulting Parquet file will also contain corrupt timestamps

The fix adds a check that converts these corrupt timestamps to NULL,
similarly to the handling of timestamp outside the [1400..10000)
range. The same error is returned in both cases - it may make sense
to add a new error message for this kind of timestamp, but as this
error did not occur in production (as far as I know), I thought
that separate error message is not necessary.

Testing:
- added a new scanner test that reads a corrupted Parquet file
  with edge values

Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
---
M be/src/exec/parquet-column-readers.cc
M be/src/runtime/timestamp-value.h
M testdata/data/README
A testdata/data/out_of_range_time_of_day.parquet
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test
M tests/query_test/test_scanners.py
7 files changed, 49 insertions(+), 2 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/21/11521/1
-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 1
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11521

to look at the new patch set (#3).

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................

IMPALA-7595: Check the validity of the time part of Parquet timestamps

Before this fix Impala did not check whether a timestamp's time part
is out of the valid [0, 24 hour) range when reading Parquet files,
so these timestamps were memcopied as they were to slots, leading to
results like:
1970-01-01 -00:00:00.000000001
1970-01-01 24:00:00

Different parts of Impala treat these timestamp differently:
- string conversion leads to invalid representation that cannot be
  converted back to timestamp
- timezone conversions handle the overflowing time part and give
  a valid timestamp result (at least since CCTZ, I did not check
  older versions of Impala)
- Parquet writing inserts these timestamp as they are, so the
  resulting Parquet file will also contain corrupt timestamps

The fix adds a check that converts these corrupt timestamps to NULL,
similarly to the handling of timestamp outside the [1400..10000)
range. The same error is returned in both cases - it may make sense
to add a new error message for this kind of timestamp, but as this
error did not occur in production (as far as I know), I thought
that separate error message is not necessary.

Testing:
- added a new scanner test that reads a corrupted Parquet file
  with edge values

Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
---
M be/src/exec/parquet-column-readers.cc
M be/src/runtime/timestamp-value.h
M testdata/data/README
A testdata/data/out_of_range_time_of_day.parquet
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test
M tests/query_test/test_scanners.py
7 files changed, 45 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/21/11521/3
-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11521 )

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................


Patch Set 5:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3250/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 01 Oct 2018 09:40:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11521 )

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/864/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 4
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 28 Sep 2018 13:40:53 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11521 )

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/813/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 26 Sep 2018 17:21:33 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11521

to look at the new patch set (#4).

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................

IMPALA-7595: Check the validity of the time part of Parquet timestamps

Before this fix Impala did not check whether a timestamp's time part
is out of the valid [0, 24 hour) range when reading Parquet files,
so these timestamps were memcopied as they were to slots, leading to
results like:
1970-01-01 -00:00:00.000000001
1970-01-01 24:00:00

Different parts of Impala treat these timestamp differently:
- string conversion leads to invalid representation that cannot be
  converted back to timestamp
- timezone conversions handle the overflowing time part and give
  a valid timestamp result (at least since CCTZ, I did not check
  older versions of Impala)
- Parquet writing inserts these timestamp as they are, so the
  resulting Parquet file will also contain corrupt timestamps

The fix adds a check that converts these corrupt timestamps to NULL,
similarly to the handling of timestamp outside the [1400..10000)
range. A new error code is added for this case. If both the date
and the time part is corrupt, then error about corrupt time is
returned.

Testing:
- added a new scanner test that reads a corrupted Parquet file
  with edge values

Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
---
M be/src/exec/parquet-column-readers.cc
M be/src/runtime/timestamp-value.h
M common/thrift/generate_error_codes.py
M testdata/data/README
A testdata/data/out_of_range_time_of_day.parquet
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test
M tests/query_test/test_scanners.py
8 files changed, 57 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/21/11521/4
-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 4
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Zoltan Borok-Nagy, Tim Armstrong, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11521

to look at the new patch set (#5).

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................

IMPALA-7595: Check the validity of the time part of Parquet timestamps

Before this fix Impala did not check whether a timestamp's time part
is out of the valid [0, 24 hour) range when reading Parquet files,
so these timestamps were memcopied as they were to slots, leading to
results like:
1970-01-01 -00:00:00.000000001
1970-01-01 24:00:00

Different parts of Impala treat these timestamp differently:
- string conversion leads to invalid representation that cannot be
  converted back to timestamp
- timezone conversions handle the overflowing time part and give
  a valid timestamp result (at least since CCTZ, I did not check
  older versions of Impala)
- Parquet writing inserts these timestamp as they are, so the
  resulting Parquet file will also contain corrupt timestamps

The fix adds a check that converts these corrupt timestamps to NULL,
similarly to the handling of timestamp outside the [1400..10000)
range. A new error code is added for this case. If both the date
and the time part is corrupt, then error about corrupt time is
returned.

Testing:
- added a new scanner test that reads a corrupted Parquet file
  with edge values

Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
---
M be/src/exec/parquet-column-readers.cc
M be/src/runtime/timestamp-value.h
M common/thrift/generate_error_codes.py
M testdata/data/README
A testdata/data/out_of_range_time_of_day.parquet
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test
M tests/query_test/test_scanners.py
8 files changed, 57 insertions(+), 4 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/21/11521/5
-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/11521 )

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................


Patch Set 4:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/11521/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11521/3//COMMIT_MSG@30
PS3, Line 30: 
> Yeah, but the current error message is a bit misleading IMO, since currentl
I have thought a bit about this and added a new error code. The rationale is that if someone will work on the encoder, then the distinct error codes can help in identifying bugs.


http://gerrit.cloudera.org:8080/#/c/11521/3/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/11521/3/be/src/exec/parquet-column-readers.cc@697
PS3, Line 697: !TimestampValue::IsValidDate(val->date())
             :       || !TimestampValue::IsValidTime(val->time())
> Would it make sense to merge these two functions and have a single Timestam
I have kept them as separate functions to be able to distinguish between the two kind of errors.

Another reason is that https://gerrit.cloudera.org/#/c/11183/ handles the two errors differently - out of range dates are replaced with not_a_date, while invalid times lead to DCHECK. It is done this way to avoid the performance impact of checking times by relying on callers giving correct values. I plan to check dates with DCHECK too, but that will need some additional changes.


http://gerrit.cloudera.org:8080/#/c/11521/3/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/11521/3/be/src/runtime/timestamp-value.h@195
PS3, Line 195: than
> nit: than
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 4
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 28 Sep 2018 13:20:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11521 )

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................


Patch Set 5: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 01 Oct 2018 13:20:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/11521 )

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................


Patch Set 4: Code-Review+2

(1 comment)

Thank you for fixing Csaba!!

http://gerrit.cloudera.org:8080/#/c/11521/4/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

http://gerrit.cloudera.org:8080/#/c/11521/4/common/thrift/generate_error_codes.py@370
PS4, Line 370:    "The time of day should be 0 <= and < 24 hour (in nanoseconds)."),
Worth including the bogus value?



-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 4
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 28 Sep 2018 19:03:10 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/11521 )

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................

IMPALA-7595: Check the validity of the time part of Parquet timestamps

Before this fix Impala did not check whether a timestamp's time part
is out of the valid [0, 24 hour) range when reading Parquet files,
so these timestamps were memcopied as they were to slots, leading to
results like:
1970-01-01 -00:00:00.000000001
1970-01-01 24:00:00

Different parts of Impala treat these timestamp differently:
- string conversion leads to invalid representation that cannot be
  converted back to timestamp
- timezone conversions handle the overflowing time part and give
  a valid timestamp result (at least since CCTZ, I did not check
  older versions of Impala)
- Parquet writing inserts these timestamp as they are, so the
  resulting Parquet file will also contain corrupt timestamps

The fix adds a check that converts these corrupt timestamps to NULL,
similarly to the handling of timestamp outside the [1400..10000)
range. A new error code is added for this case. If both the date
and the time part is corrupt, then error about corrupt time is
returned.

Testing:
- added a new scanner test that reads a corrupted Parquet file
  with edge values

Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Reviewed-on: http://gerrit.cloudera.org:8080/11521
Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/parquet-column-readers.cc
M be/src/runtime/timestamp-value.h
M common/thrift/generate_error_codes.py
M testdata/data/README
A testdata/data/out_of_range_time_of_day.parquet
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test
M tests/query_test/test_scanners.py
8 files changed, 57 insertions(+), 4 deletions(-)

Approvals:
  Csaba Ringhofer: Looks good to me, approved
  Impala Public Jenkins: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 6
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11521 )

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................


Patch Set 3: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 26 Sep 2018 20:22:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/11521

to look at the new patch set (#2).

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................

IMPALA-7595: Check the validity of the time part of Parquet timestamps

Before this fix Impala did not check whether a timestamp's time part
is out of the valid [0, 24 hour) range when reading Parquet files,
so these timestamps were memcopied as they were to slots, leading to
results like:
1970-01-01 -00:00:00.000000001
1970-01-01 24:00:00

Different parts of Impala treat these timestamp differently:
- string conversion leads to invalid representation that cannot be
  converted back to timestamp
- timezone conversions handle the overflowing time part and give
  a valid timestamp result (at least since CCTZ, I did not check
  older versions of Impala)
- Parquet writing inserts these timestamp as they are, so the
  resulting Parquet file will also contain corrupt timestamps

The fix adds a check that converts these corrupt timestamps to NULL,
similarly to the handling of timestamp outside the [1400..10000)
range. The same error is returned in both cases - it may make sense
to add a new error message for this kind of timestamp, but as this
error did not occur in production (as far as I know), I thought
that separate error message is not necessary.

Testing:
- added a new scanner test that reads a corrupted Parquet file
  with edge values

Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
---
M be/src/exec/parquet-column-readers.cc
M be/src/runtime/timestamp-value.h
M testdata/data/README
A testdata/data/out_of_range_time_of_day.parquet
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-abort-on-error.test
M testdata/workloads/functional-query/queries/QueryTest/out-of-range-timestamp-continue-on-error.test
M tests/query_test/test_scanners.py
7 files changed, 45 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/21/11521/2
-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 2
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11521 )

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/814/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 26 Sep 2018 17:28:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11521 )

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................


Patch Set 3:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/3225/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 26 Sep 2018 16:44:38 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/11521 )

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................


Patch Set 4: Code-Review+1


-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 4
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 28 Sep 2018 16:07:30 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/11521 )

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................


Patch Set 3:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/11521/3//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/11521/3//COMMIT_MSG@30
PS3, Line 30: that separate error message is not necessary.
Yeah, but the current error message is a bit misleading IMO, since currently it says something like "dates should be between 1400-01-01 and 9999-12-31".

Or, maybe we can just change the error message to make it more general.


http://gerrit.cloudera.org:8080/#/c/11521/3/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/11521/3/be/src/exec/parquet-column-readers.cc@697
PS3, Line 697: !TimestampValue::IsValidDate(val->date())
             :       || !TimestampValue::IsValidTime(val->time())
Would it make sense to merge these two functions and have a single TimestampValue::IsValid() member function instead?


http://gerrit.cloudera.org:8080/#/c/11521/3/be/src/runtime/timestamp-value.h
File be/src/runtime/timestamp-value.h:

http://gerrit.cloudera.org:8080/#/c/11521/3/be/src/runtime/timestamp-value.h@195
PS3, Line 195: then
nit: than



-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 3
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 28 Sep 2018 09:52:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11521 )

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/812/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 1
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Wed, 26 Sep 2018 17:14:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/11521 )

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/877/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 01 Oct 2018 10:06:22 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7595: Check the validity of the time part of Parquet timestamps

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/11521 )

Change subject: IMPALA-7595: Check the validity of the time part of Parquet timestamps
......................................................................


Patch Set 5: Code-Review+2

(1 comment)

Patch set 5 is just rebase + conflict resolution.

http://gerrit.cloudera.org:8080/#/c/11521/4/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

http://gerrit.cloudera.org:8080/#/c/11521/4/common/thrift/generate_error_codes.py@370
PS4, Line 370:   ("PARQUET_TIMESTAMP_INVALID_TIME_OF_DAY", 121,
> Worth including the bogus value?
Other value errors I checked do not include the value itself in the log, so I decided to do the same here.

It may make sense to create a flag like ALLOW_VALUES_IN_LOGS and append the values to relevant log messages.



-- 
To view, visit http://gerrit.cloudera.org:8080/11521
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ibc0ae651b6a0a028c61a15fd069ef9e904231058
Gerrit-Change-Number: 11521
Gerrit-PatchSet: 5
Gerrit-Owner: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 01 Oct 2018 09:39:52 +0000
Gerrit-HasComments: Yes