You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Norbert Luksa (Code Review)" <ge...@cloudera.org> on 2020/02/03 09:29:06 UTC

[Impala-ASF-CR] IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types

Norbert Luksa has uploaded this change for review. ( http://gerrit.cloudera.org:8080/15152


Change subject: IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types
......................................................................

IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types

This feature adds support for schema evolution between date and
timestamp for the ORC scanner. This means that we can have two
tables, one with a date column, another with a timestamp column,
and they can both point to the same ORC file. The result will be
that for the first table everything will be converted to date,
and for the second, everything to timestamp.

In order to do that, the OrcTimestampReader and OrcDateColumnReader
are modified to be able to handle batches of the two types. Their
name now represents the destination Impala type.

Note that the life cycle of a OrcColumnReader is within the life cycle of the
HdfsOrcScanner which only reads a split of an ORC file, and an ORC file can't have
two types for one column.

Tests:
 * Added type conversion tests.
 * Tested manually following the use case steps of the Jira.

Change-Id: I7979ecc61b2ab900090d01bc81e7bb7b28c99c9e
---
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/orc-metadata-utils.cc
M testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M tests/query_test/test_scanners.py
5 files changed, 115 insertions(+), 36 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/52/15152/1
-- 
To view, visit http://gerrit.cloudera.org:8080/15152
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I7979ecc61b2ab900090d01bc81e7bb7b28c99c9e
Gerrit-Change-Number: 15152
Gerrit-PatchSet: 1
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>

[Impala-ASF-CR] IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15152 )

Change subject: IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types
......................................................................


Patch Set 1:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/5585/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/15152
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7979ecc61b2ab900090d01bc81e7bb7b28c99c9e
Gerrit-Change-Number: 15152
Gerrit-PatchSet: 1
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 03 Feb 2020 10:16:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/15152 )

Change subject: IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types
......................................................................


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/15152/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15152/2//COMMIT_MSG@7
PS2, Line 7: between date and timestamp types
Looking at IMPALA-6373 it seems like Impala usually supports these conversions when there is no loss of information.

Timestamp and Date don't have the same range, but anyway, supporting Date -> Timestamp seems more reasonable.

What is the behavior of other file formats, e.g. Parquet? I don't think we want different behavior.


http://gerrit.cloudera.org:8080/#/c/15152/2//COMMIT_MSG@12
PS2, Line 12: to the same ORC file
nit: "to the same set of ORC files" probably expresses the intent better


http://gerrit.cloudera.org:8080/#/c/15152/2/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15152/2/be/src/exec/orc-column-readers.h@121
PS2, Line 121:   Status HandleInvalidValue(Tuple* tuple, TErrorCode::type error_code)
nit: Please add comment about what is an invalid value and how it is handled.


http://gerrit.cloudera.org:8080/#/c/15152/2/be/src/exec/orc-column-readers.h@214
PS2, Line 214: ORC support schema evolution
What does it mean "ORC support schema evolution"?


http://gerrit.cloudera.org:8080/#/c/15152/2/be/src/exec/orc-column-readers.h@214
PS2, Line 214: Date and Timestamp tables
Does Hive convert in both directions?


http://gerrit.cloudera.org:8080/#/c/15152/2/be/src/exec/orc-column-readers.h@253
PS2, Line 253: (source_type_ == orc::TypeKind::TIMESTAMP &&
             :           static_cast<orc::TimestampVectorBatch*>(batch_) ==
             :           dynamic_cast<orc::TimestampVectorBatch*>(orc_batch)) ||
             :         (source_type_ == orc::TypeKind::DATE &&
             :           static_cast<orc::LongVectorBatch*>(batch_) ==
             :           dynamic_cast<orc::LongVectorBatch*>(orc_batch))
nit: complicated and duplicated in OrcTimestampReader. Can you put it into a function wi name 'IsDateTime' or stg like that.



-- 
To view, visit http://gerrit.cloudera.org:8080/15152
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7979ecc61b2ab900090d01bc81e7bb7b28c99c9e
Gerrit-Change-Number: 15152
Gerrit-PatchSet: 2
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 03 Feb 2020 17:18:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has abandoned this change. ( http://gerrit.cloudera.org:8080/15152 )

Change subject: IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types
......................................................................


Abandoned
-- 
To view, visit http://gerrit.cloudera.org:8080/15152
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: abandon
Gerrit-Change-Id: I7979ecc61b2ab900090d01bc81e7bb7b28c99c9e
Gerrit-Change-Number: 15152
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/15152 )

Change subject: IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types
......................................................................

IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types

This feature adds support for schema evolution between date and
timestamp for the ORC scanner. This means that we can have two
tables, one with a date column, another with a timestamp column,
and they can both point to the same ORC file. The result will be
that for the first table everything will be converted to date,
and for the second, everything to timestamp.

In order to do that, the OrcTimestampReader and OrcDateColumnReader
are modified to be able to handle batches of the two types. Their
name now represents the destination Impala type.

Note that the life cycle of a OrcColumnReader is within the life
cycle of the HdfsOrcScanner which only reads a split of an ORC
file, and an ORC file can't have two types for one column.

Tests:
 * Added type conversion tests.
 * Tested manually following the use case steps of the Jira.

Change-Id: I7979ecc61b2ab900090d01bc81e7bb7b28c99c9e
---
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/orc-metadata-utils.cc
M testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M tests/query_test/test_scanners.py
5 files changed, 125 insertions(+), 43 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/52/15152/2
-- 
To view, visit http://gerrit.cloudera.org:8080/15152
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7979ecc61b2ab900090d01bc81e7bb7b28c99c9e
Gerrit-Change-Number: 15152
Gerrit-PatchSet: 2
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has posted comments on this change. ( http://gerrit.cloudera.org:8080/15152 )

Change subject: IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types
......................................................................


Patch Set 2:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/15152/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15152/1//COMMIT_MSG@20
PS1, Line 20: 
> nit: wrap at 72 chars
Done


http://gerrit.cloudera.org:8080/#/c/15152/1/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15152/1/be/src/exec/orc-column-readers.h@224
PS1, Line 224: tampReader(const orc::Type* node, const SlotDes
> Do we need this, can't we just compare check != nullprt?
I just followed the convention of the other UpdateInputBatch functions where the static casted batch is compared to the dynamic casted one. Maybe we could simplify the DCHECKs everywhere?


http://gerrit.cloudera.org:8080/#/c/15152/1/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/15152/1/be/src/exec/orc-column-readers.cc@211
PS1, Line 211:   if (IsNull(DCHECK_NOTNULL(batch_), row_idx)) {
> nit: extra line
Done


http://gerrit.cloudera.org:8080/#/c/15152/1/be/src/exec/orc-column-readers.cc@226
PS1, Line 226:     }
             :     *slot = DateValue(ts.DaysSinceUnixEpoch());
             :   }
> This is probably not speed critical, but it could be done faster by ignorin
Done


http://gerrit.cloudera.org:8080/#/c/15152/1/be/src/exec/orc-column-readers.cc@229
PS1, Line 229: alid())) {
> This will hit a DCHECK if the timestamp is not valid, see https://github.co
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/15152
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7979ecc61b2ab900090d01bc81e7bb7b28c99c9e
Gerrit-Change-Number: 15152
Gerrit-PatchSet: 2
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Comment-Date: Mon, 03 Feb 2020 15:47:41 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/15152 )

Change subject: IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types
......................................................................


Patch Set 1:

(5 comments)

The design + code seems good to me, but I am not too enthusiastic about the feature itself. We may need to support it, if there are already tables like that (and people want to read them with Impala), but I would prefer not to do it if we don't know whether it is needed. This kind of schema evolution seems generally a bad idea to me, as both Data->Timestamp and Timestamp->Data are lossy conversion in Impala. I also expect some complex work around this code related to timezones and predicate push down in the future, and supporting these two type mappings will make it harder.

http://gerrit.cloudera.org:8080/#/c/15152/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15152/1//COMMIT_MSG@20
PS1, Line 20: f
nit: wrap at 72 chars


http://gerrit.cloudera.org:8080/#/c/15152/1/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15152/1/be/src/exec/orc-column-readers.h@224
PS1, Line 224: static_cast<orc::TimestampVectorBatch*>(batch_)
Do we need this, can't we just compare check != nullprt?


http://gerrit.cloudera.org:8080/#/c/15152/1/be/src/exec/orc-column-readers.cc
File be/src/exec/orc-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/15152/1/be/src/exec/orc-column-readers.cc@211
PS1, Line 211: 
nit: extra line


http://gerrit.cloudera.org:8080/#/c/15152/1/be/src/exec/orc-column-readers.cc@226
PS1, Line 226:     int64_t nanos = current_batch->nanoseconds.data()[row_idx];
             :     TimestampValue ts = TimestampValue::FromUnixTimeNanos(secs, nanos,
             :         scanner_->state_->local_time_zone());
This is probably not speed critical, but it could be done faster by ignoring nanoseconds and using FromUnixTime() directly.


http://gerrit.cloudera.org:8080/#/c/15152/1/be/src/exec/orc-column-readers.cc@229
PS1, Line 229: DaysSinceUnixEpoch
This will hit a DCHECK if the timestamp is not valid, see https://github.com/apache/impala/blob/master/be/src/runtime/timestamp-value.inline.h#L94 , so it can be only called after checking the timestamps validity.



-- 
To view, visit http://gerrit.cloudera.org:8080/15152
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7979ecc61b2ab900090d01bc81e7bb7b28c99c9e
Gerrit-Change-Number: 15152
Gerrit-PatchSet: 1
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 03 Feb 2020 13:41:11 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/15152 )

Change subject: IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/5593/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/15152
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7979ecc61b2ab900090d01bc81e7bb7b28c99c9e
Gerrit-Change-Number: 15152
Gerrit-PatchSet: 2
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Comment-Date: Mon, 03 Feb 2020 16:32:29 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types

Posted by "Quanlong Huang (Code Review)" <ge...@cloudera.org>.
Quanlong Huang has posted comments on this change. ( http://gerrit.cloudera.org:8080/15152 )

Change subject: IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types
......................................................................


Patch Set 3: Code-Review+1

(1 comment)

The change looks good and concise to me. I think we don't need to worry about predicate push down in the OrcColumnReader layer. Becaues predicates will be pushed to orc::Reader and orc::RowReader instead, OrcColumnReaders just need to materialize the filtered results.

http://gerrit.cloudera.org:8080/#/c/15152/1/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15152/1/be/src/exec/orc-column-readers.h@224
PS1, Line 224: ((source_type_ == orc::TypeKind::TIMESTAMP &&
> I just followed the convention of the other UpdateInputBatch functions wher
The main purpose is to call dynamic_cast here to check the type. The static_cast is just used to create something equals to it.

We can't use "!= nullptr" since orc_batch can really be null: 
https://github.com/apache/impala/blob/2c54dbe22507661664b39cb76849f794cf4743d6/be/src/exec/hdfs-orc-scanner.cc#L620
https://github.com/apache/impala/blob/2c54dbe22507661664b39cb76849f794cf4743d6/be/src/exec/orc-column-readers.cc#L358



-- 
To view, visit http://gerrit.cloudera.org:8080/15152
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7979ecc61b2ab900090d01bc81e7bb7b28c99c9e
Gerrit-Change-Number: 15152
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Quanlong Huang <hu...@gmail.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Mon, 24 Feb 2020 08:17:28 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/15152 )

Change subject: IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types
......................................................................

IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types

This feature adds support for schema evolution between date and
timestamp for the ORC scanner. This means that we can have two
tables, one with a date column, another with a timestamp column,
and they can both point to the same set of ORC files. The result
will be that for the first table everything will be converted to
date, and for the second, everything to timestamp.

In order to do that, the OrcTimestampReader and OrcDateColumnReader
are modified to be able to handle batches of the two types. Their
name now represents the destination Impala type.

Note that the life cycle of a OrcColumnReader is within the life
cycle of the HdfsOrcScanner which only reads a split of an ORC
file, and an ORC file can't have two types for one column.

Tests:
 * Added type conversion tests.
 * Tested manually following the use case steps of the Jira.

Change-Id: I7979ecc61b2ab900090d01bc81e7bb7b28c99c9e
---
M be/src/exec/hdfs-orc-scanner.h
M be/src/exec/orc-column-readers.cc
M be/src/exec/orc-column-readers.h
M be/src/exec/orc-metadata-utils.cc
M testdata/workloads/functional-query/queries/DataErrorsTest/orc-type-checks.test
M tests/query_test/test_scanners.py
6 files changed, 132 insertions(+), 53 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/52/15152/3
-- 
To view, visit http://gerrit.cloudera.org:8080/15152
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I7979ecc61b2ab900090d01bc81e7bb7b28c99c9e
Gerrit-Change-Number: 15152
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has posted comments on this change. ( http://gerrit.cloudera.org:8080/15152 )

Change subject: IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types
......................................................................


Patch Set 3:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/15152/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/15152/2//COMMIT_MSG@7
PS2, Line 7: between date and timestamp types
> Looking at IMPALA-6373 it seems like Impala usually supports these conversi
Parquet does not support this kind of compatibility between date and timestamp. I created date and timestamp tables and set them to point to the same location (as described in the Jira), and inserted some values to both tables. When I try to query any of them, I get an error that the schemas of the files are incompatible.

I think here we do not wish to follow Parquet and other file formats, but Hive.


http://gerrit.cloudera.org:8080/#/c/15152/2//COMMIT_MSG@12
PS2, Line 12: to the same set of O
> nit: "to the same set of ORC files" probably expresses the intent better
Done


http://gerrit.cloudera.org:8080/#/c/15152/2/be/src/exec/orc-column-readers.h
File be/src/exec/orc-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/15152/2/be/src/exec/orc-column-readers.h@121
PS2, Line 121: 
> nit: Please add comment about what is an invalid value and how it is handle
Done


http://gerrit.cloudera.org:8080/#/c/15152/2/be/src/exec/orc-column-readers.h@214
PS2, Line 214:  split of an ORC file, an
> Does Hive convert in both directions?
Yes, it does.


http://gerrit.cloudera.org:8080/#/c/15152/2/be/src/exec/orc-column-readers.h@214
PS2, Line 214: HdfsOrcScanner which only re
> What does it mean "ORC support schema evolution"?
Done


http://gerrit.cloudera.org:8080/#/c/15152/2/be/src/exec/orc-column-readers.h@253
PS2, Line 253: 
             : class OrcDateColumnReader : public OrcDateAndTimeStampReader {
             :  public:
             :   OrcDateColumnReader(const orc::Type* node, const SlotDescriptor* slot_desc,
             :       HdfsOrcScanner* scanner)
             :       : OrcDateAndTimeStampReader(node, slot_desc, scanne
> nit: complicated and duplicated in OrcTimestampReader. Can you put it into 
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/15152
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7979ecc61b2ab900090d01bc81e7bb7b28c99c9e
Gerrit-Change-Number: 15152
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 04 Feb 2020 10:01:59 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types

Posted by "Norbert Luksa (Code Review)" <ge...@cloudera.org>.
Norbert Luksa has posted comments on this change. ( http://gerrit.cloudera.org:8080/15152 )

Change subject: IMPALA-9290: ORC scanner should support schema evolution between date and timestamp types
......................................................................


Patch Set 3:

Regarding Csaba's concerns, I checked if Hive supports the same kind of schema evolution for Parquet, since we know that Impala doesn't. Found out that Hive doesn't support it either, only for ORC.


-- 
To view, visit http://gerrit.cloudera.org:8080/15152
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7979ecc61b2ab900090d01bc81e7bb7b28c99c9e
Gerrit-Change-Number: 15152
Gerrit-PatchSet: 3
Gerrit-Owner: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Norbert Luksa <no...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 18 Feb 2020 10:22:31 +0000
Gerrit-HasComments: No