You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Matthew Jacobs (Code Review)" <ge...@cloudera.org> on 2017/01/31 23:04:41 UTC

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Matthew Jacobs has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/5840

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................

IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Creating a table in Impala, changing the column schema
outside of Impala, and then reading again in Impala may
result in a crash. Neither Impala nor the Kudu client
validates the schema immediately before reading, so Impala
may attempt to dereference pointers that aren't there. This
happens if a string column is dropped and then a new, non
string column is added with the old string column's name.

This change adds validation after opening a scan token to
ensure the projection schema matches the expected schema.
The scan is guaranteed to be valid while the KuduScanner is
open, even if the schema changes concurrently.

Testing: Adds a test case that previously would crash Impala.

Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
---
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-scanner.h
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/runtime/descriptors.h
M common/thrift/generate_error_codes.py
M tests/query_test/test_kudu.py
7 files changed, 99 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/40/5840/1
-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5840/7/tests/query_test/test_kudu.py
File tests/query_test/test_kudu.py:

Line 101:   def test_kudu_col_changed(self, cursor, kudu_client, unique_database):
> Tests lgtm, but there seems to be a lot of repeated boilerplate stuff. Mayb
I did the refactoring below but there are enough small differences between these functions that it's not working out nicely to refactor.


Line 106:     kudu_tbl_name = "impala::%s.foo" % unique_database
> seems worth factoring out this function and adding a comment that it must b
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 6: Code-Review+1

-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 9:

Build started: http://jenkins.impala.io:8080/job/gerrit-verify-dryrun/283/

-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has uploaded a new patch set (#4).

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................

IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Creating a table in Impala, changing the column schema
outside of Impala, and then reading again in Impala may
result in a crash. Impala may attempt to dereference
pointers that aren't there. This happens if a string column
is dropped and then a new, non string column is added with
the old string column's name.

The Kudu scan token contains the projection schema, and that
is validated when opening the Kudu scanner, but the issue is
that during planning, Impala assumes the types of columns
haven't changed when creating the scan tokens. This is fixed
by adding a check when creating the scan token, and failing
the query if the column types changed. Impala then relies on
the Kudu scanner to properly validate that the underlying
schema is still represented by the scan token, and that
deserialization will fail if it no longer matches. A test
case was added for this particular crash scenario, which now
fails during planning as expected. This does not attempt to
validate the Kudu client validation at deserialization time,
though that would be valuable coverage to add in the future.

Columns being removed don't produce a crash; the query fails
gracefully. A test was added for this case.

Columns being added should not affect this scenario, but a
test was added anyway.

Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
---
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/runtime/descriptors.h
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M tests/query_test/test_kudu.py
6 files changed, 150 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/40/5840/4
-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 4
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 6:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5840/2/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

Line 149: 
> I think Thomas is right and the query may crash if there is a difference in
I couldn't reproduce any failures, but looking at TupleDescriptor I agree it seems like it could be possible to crash. Good call Thomas & Alex. I'll check that here too.

Will the Kudu-side validation check that the projection is 100% identical to what the scan token expects? Nullability etc. included?
In theory it should, thought it looks like it's missing on the Kudu side as well, so I filed a JIRA on them https://issues.apache.org/jira/browse/KUDU-1881 . Thanks for pointing this out.


http://gerrit.cloudera.org:8080/#/c/5840/6/fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
File fe/src/main/java/org/apache/impala/planner/KuduScanNode.java:

Line 161:       if (!colType.matchesType(kuduColType)) {
> Shouldn't this use equals() instead of matchesType()?
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has uploaded a new patch set (#2).

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................

IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Creating a table in Impala, changing the column schema
outside of Impala, and then reading again in Impala may
result in a crash. Neither Impala nor the Kudu client
validates the schema immediately before reading, so Impala
may attempt to dereference pointers that aren't there. This
happens if a string column is dropped and then a new, non
string column is added with the old string column's name.

This change adds validation after opening a scan token to
ensure the projection schema matches the expected schema.
The scan is guaranteed to be valid while the KuduScanner is
open, even if the schema changes concurrently. This is the
earliest point and coursest granularity with which Impala
can detect and gracefully handle a schema mismatch.

Also handles the cases where columns were added or removed.

Testing: Adds test cases for these scenarios.

Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
---
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-scanner.h
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/runtime/descriptors.h
M common/thrift/generate_error_codes.py
M tests/query_test/test_kudu.py
7 files changed, 179 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/40/5840/2
-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 5: Code-Review+1

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5840/5//COMMIT_MSG
Commit Message:

PS5, Line 22: Kudu scanner
nit: I got confused thinking it was our KuduScanner. It's happening on the Kudu side right? If so, maybe change this to "Kudu's client scanner"?


-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Hello Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5840

to look at the new patch set (#6).

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................

IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Creating a table in Impala, changing the column schema
outside of Impala, and then reading again in Impala may
result in a crash. Impala may attempt to dereference
pointers that aren't there. This happens if a string column
is dropped and then a new, non string column is added with
the old string column's name.

The Kudu scan token contains the projection schema, and that
is validated when opening the Kudu scanner, but the issue is
that during planning, Impala assumes the types of columns
haven't changed when creating the scan tokens. This is fixed
by adding a check when creating the scan token, and failing
the query if the column types changed. Impala then relies on
the Kudu client to properly validate that the underlying
schema is still represented by the scan token, and that
deserialization will fail if it no longer matches. A test
case was added for this particular crash scenario, which now
fails during planning as expected. This does not attempt to
validate the Kudu client validation at deserialization time,
though that would be valuable coverage to add in the future.

Columns being removed don't produce a crash; the query fails
gracefully. A test was added for this case.

Columns being added should not affect this scenario, but a
test was added anyway.

Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
---
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M tests/query_test/test_kudu.py
2 files changed, 116 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/40/5840/6
-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 2:

After discussing with Dan on the Kudu team, we can make this simpler by checking at plan time because the kudu scan token encodes the col metadata and deserializing it will fail if the projection schema is no longer valid. The issue for us was that we did not check the Kudu col type matched our col type at plan time.

-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 1:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/5840/1//COMMIT_MSG
Commit Message:

Line 20: open, even if the schema changes concurrently.
It would probably be worth adding that this is the earliest point and the coarsest granularity at which we can successfully detect a schema change if there is any, without leaving room for error, just to assuage any concerns of whether we're doing this check too often.


http://gerrit.cloudera.org:8080/#/c/5840/1/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

PS1, Line 133: for (int i = 0; i < tuple_desc->slots().size(); ++i) {
As we spoke, if an ALTER TABLE is done and a column is removed, a crash might occur in this loop.

Another point I wanted to call out: What's the expected behavior when a column is added?
 1) Currently, this might pass and we will probably return the results for N-1 columns successfully. Is that what we want?

 2) Or should we fail the query?

 3) Or should we return the results and also give a warning asking the user to REFRESH?


My vote is for the 3rd option, but I'm open to others take on this.


http://gerrit.cloudera.org:8080/#/c/5840/1/be/src/exec/kudu-util.cc
File be/src/exec/kudu-util.cc:

PS1, Line 77: PrimitiveType KuduColTypeToPrimitiveType(
            :     const KuduColumnSchema::DataType& type) {
This may be an over optimization, but what if we had an array mapping KuduColumnSchemas to PrimitiveTypes? Since they both are just enums.

Just worried about the case where there are a large number of columns and we bounce on the switch statement for every one of them for every scan token. If you don't think it's too pressing, we can forego that.


http://gerrit.cloudera.org:8080/#/c/5840/1/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

PS1, Line 314: is type
nit: is of type


-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Thomas Tauber-Marshall (Code Review)" <ge...@cloudera.org>.
Thomas Tauber-Marshall has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/5840/2/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

Line 149:     if (slot->type().type != KuduColTypeToPrimitiveType(col.type())) {
This doesn't address the case where you removed a column and then added a new one with the same type but with other attributes changed, eg. name and nullability.

Do we keep our column names consistent with the column names in Kudu? If so, maybe we should log an error in that case, though probably not fail the query.


http://gerrit.cloudera.org:8080/#/c/5840/2/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

PS2, Line 317: KUDU_COL_MISMATCH
> nit: maybe call this 'KUDU_NUM_COLS_MISMATCH', else it sounds like it could
Maybe also explicitly mention refreshing here, as above.


http://gerrit.cloudera.org:8080/#/c/5840/2/tests/query_test/test_kudu.py
File tests/query_test/test_kudu.py:

Line 148: 
nit: extra line


-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Hello Thomas Tauber-Marshall, Sailesh Mukil, Alex Behm,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5840

to look at the new patch set (#9).

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................

IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Creating a table in Impala, changing the column schema
outside of Impala, and then reading again in Impala may
result in a crash. Impala may attempt to dereference
pointers that aren't there. This happens if a string column
is dropped and then a new, non string column is added with
the old string column's name.

The Kudu scan token contains the projection schema, and that
is validated when opening the Kudu scanner (with the
exception of KUDU-1881), but the issue is that during
planning, Impala assumes the types/nullability of columns
haven't changed when creating the scan tokens. This is fixed
by adding a check when creating the scan token, and failing
the query if the column types changed. Impala then relies on
the Kudu client to properly validate that the underlying
schema is still represented by the scan token, and that
deserialization will fail if it no longer matches. Test
cases were added for this particular crash scenario, which now
fails during planning as expected. This does not attempt to
validate the Kudu client validation at deserialization time,
though that would be valuable coverage to add in the future.

Columns being removed don't produce a crash; the query fails
gracefully. A test was added for this case.

Columns being added should not affect this scenario, but a
test was added anyway.

Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
---
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M tests/common/kudu_test_suite.py
M tests/query_test/test_kudu.py
3 files changed, 227 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/40/5840/9
-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Creating a table in Impala, changing the column schema
outside of Impala, and then reading again in Impala may
result in a crash. Impala may attempt to dereference
pointers that aren't there. This happens if a string column
is dropped and then a new, non string column is added with
the old string column's name.

The Kudu scan token contains the projection schema, and that
is validated when opening the Kudu scanner (with the
exception of KUDU-1881), but the issue is that during
planning, Impala assumes the types/nullability of columns
haven't changed when creating the scan tokens. This is fixed
by adding a check when creating the scan token, and failing
the query if the column types changed. Impala then relies on
the Kudu client to properly validate that the underlying
schema is still represented by the scan token, and that
deserialization will fail if it no longer matches. Test
cases were added for this particular crash scenario, which now
fails during planning as expected. This does not attempt to
validate the Kudu client validation at deserialization time,
though that would be valuable coverage to add in the future.

Columns being removed don't produce a crash; the query fails
gracefully. A test was added for this case.

Columns being added should not affect this scenario, but a
test was added anyway.

Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Reviewed-on: http://gerrit.cloudera.org:8080/5840
Reviewed-by: Matthew Jacobs <mj...@cloudera.com>
Tested-by: Impala Public Jenkins
---
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M tests/common/kudu_test_suite.py
M tests/query_test/test_kudu.py
3 files changed, 227 insertions(+), 6 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Matthew Jacobs: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 10
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Hello Thomas Tauber-Marshall, Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5840

to look at the new patch set (#7).

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................

IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Creating a table in Impala, changing the column schema
outside of Impala, and then reading again in Impala may
result in a crash. Impala may attempt to dereference
pointers that aren't there. This happens if a string column
is dropped and then a new, non string column is added with
the old string column's name.

The Kudu scan token contains the projection schema, and that
is validated when opening the Kudu scanner (with the
exception of KUDU-1881), but the issue is that during
planning, Impala assumes the types/nullability of columns
haven't changed when creating the scan tokens. This is fixed
by adding a check when creating the scan token, and failing
the query if the column types changed. Impala then relies on
the Kudu client to properly validate that the underlying
schema is still represented by the scan token, and that
deserialization will fail if it no longer matches. Test
cases were added for this particular crash scenario, which now
fails during planning as expected. This does not attempt to
validate the Kudu client validation at deserialization time,
though that would be valuable coverage to add in the future.

Columns being removed don't produce a crash; the query fails
gracefully. A test was added for this case.

Columns being added should not affect this scenario, but a
test was added anyway.

Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
---
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M tests/query_test/test_kudu.py
2 files changed, 216 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/40/5840/7
-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 9: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 6:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5840/2/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

Line 149: 
> Good point, though I'll just leave this as a TODO for now because we don't 
I think Thomas is right and the query may crash if there is a difference in column nullability.

We do rely on the nullability. The SlotDescriptors are set to nullable or non-nullable which affects the memory layout of the tuple. If the tuples returned by Kudu are not 100% identical to what Impala expects, you may get non-deterministic wrong results (and maybe a crash). See the class comment in TupleDescriptor.java regarding the Kudu layout.

In response to one of your comments:
"After discussing with Dan on the Kudu team, we can make this simpler by checking at plan time because the kudu scan token encodes the col metadata and deserializing it will fail if the projection schema is no longer valid. The issue for us was that we did not check the Kudu col type matched our col type at plan time."

Will the Kudu-side validation check that the projection is 100% identical to what the scan token expects? Nullability etc. included?


http://gerrit.cloudera.org:8080/#/c/5840/6/fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
File fe/src/main/java/org/apache/impala/planner/KuduScanNode.java:

Line 161:       if (!colType.matchesType(kuduColType)) {
Shouldn't this use equals() instead of matchesType()?


-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has uploaded a new patch set (#5).

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................

IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Creating a table in Impala, changing the column schema
outside of Impala, and then reading again in Impala may
result in a crash. Impala may attempt to dereference
pointers that aren't there. This happens if a string column
is dropped and then a new, non string column is added with
the old string column's name.

The Kudu scan token contains the projection schema, and that
is validated when opening the Kudu scanner, but the issue is
that during planning, Impala assumes the types of columns
haven't changed when creating the scan tokens. This is fixed
by adding a check when creating the scan token, and failing
the query if the column types changed. Impala then relies on
the Kudu scanner to properly validate that the underlying
schema is still represented by the scan token, and that
deserialization will fail if it no longer matches. A test
case was added for this particular crash scenario, which now
fails during planning as expected. This does not attempt to
validate the Kudu client validation at deserialization time,
though that would be valuable coverage to add in the future.

Columns being removed don't produce a crash; the query fails
gracefully. A test was added for this case.

Columns being added should not affect this scenario, but a
test was added anyway.

Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
---
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M tests/query_test/test_kudu.py
2 files changed, 116 insertions(+), 2 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/40/5840/5
-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has uploaded a new patch set (#3).

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................

IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Creating a table in Impala, changing the column schema
outside of Impala, and then reading again in Impala may
result in a crash. Neither Impala nor the Kudu client
validates the schema immediately before reading, so Impala
may attempt to dereference pointers that aren't there. This
happens if a string column is dropped and then a new, non
string column is added with the old string column's name.

This change adds validation after opening a scan token to
ensure the projection schema matches the expected schema.
The scan is guaranteed to be valid while the KuduScanner is
open, even if the schema changes concurrently. This is the
latest point and coursest granularity with which Impala
can detect and gracefully handle a schema mismatch.

Also handles the cases where columns were added or removed.

Testing: Adds test cases for these scenarios.

Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
---
M be/src/exec/kudu-scanner.cc
M be/src/exec/kudu-scanner.h
M be/src/exec/kudu-util.cc
M be/src/exec/kudu-util.h
M be/src/runtime/descriptors.h
M common/thrift/generate_error_codes.py
M tests/query_test/test_kudu.py
7 files changed, 183 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/40/5840/3
-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5840/5//COMMIT_MSG
Commit Message:

PS5, Line 22: Kudu scanner
> nit: I got confused thinking it was our KuduScanner. It's happening on the 
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 5
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 2:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5840/2/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

PS2, Line 140:   } else if (scanner_schema.num_columns() > tuple_desc->slots().size()) {
             :     state_->LogError(ErrorMsg::Init(TErrorCode::KUDU_COL_MISMATCH,
             :         scan_node_->table_->name(), scanner_schema.num_columns(),
             :         tuple_desc->slots().size()));
After thinking about this more, I don't think it should be possible to find more than we expected in planning. The reason is that in planning we get a scan token which knows about the expected columns. That gets deserialized here, so there shouldn't be any new cols that could be added to it. Even if new cols were added to the table. I made this a DCHECK but I'll double check my thinking with the Kudu team.


-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 2:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/5840/2/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

Line 149:     if (slot->type().type != KuduColTypeToPrimitiveType(col.type())) {
> This doesn't address the case where you removed a column and then added a n
Good point, though I'll just leave this as a TODO for now because we don't rely on either of these (in fact I think we won't even know about nullability in our tuple descriptor because I don't believe it gets returned by the catalog).


http://gerrit.cloudera.org:8080/#/c/5840/2/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

PS2, Line 317: KUDU_COL_MISMATCH
> Maybe also explicitly mention refreshing here, as above.
Done


PS2, Line 317: KUDU_COL_MISMATCH
> nit: maybe call this 'KUDU_NUM_COLS_MISMATCH', else it sounds like it could
Done


http://gerrit.cloudera.org:8080/#/c/5840/2/tests/query_test/test_kudu.py
File tests/query_test/test_kudu.py:

Line 148: 
> nit: extra line
Done


PS2, Line 162: cursor.execute("SELECT * FROM %s.foo" % (unique_database))
             :     assert cursor.fetchall() == [(0, )]
> Should we assert that the warning is printed here too?
It's a good idea but I don't think we can with the cursor mechanism. It'd be a good thing to add to the cursor, so I'll file a JIRA and leave a TODO here.


PS2, Line 191: Column 's' not found in kudu table impala::test_kudu_col_removed
> This is not the same as the 'KUDU_COL_MISMATCH' error. Does that mean this 
Yes you're right. This gets caught during planning. I still added the validation in the scanner to be safe though because it's possible for the schema to change after planning but before the scanner is opened. If that happens, the validation after the scanner opening will catch that. I think it's too hard to catch that case in a test though. Ideally we'd have some test that repeatedly makes DDL changes and issues queries, some sort of stress, but I think that's something we should consider for the future.


-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Sailesh Mukil (Code Review)" <ge...@cloudera.org>.
Sailesh Mukil has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 2:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/5840/2/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

PS2, Line 317: KUDU_COL_MISMATCH
nit: maybe call this 'KUDU_NUM_COLS_MISMATCH', else it sounds like it could be a type mismatch too.


http://gerrit.cloudera.org:8080/#/c/5840/2/tests/query_test/test_kudu.py
File tests/query_test/test_kudu.py:

PS2, Line 162: cursor.execute("SELECT * FROM %s.foo" % (unique_database))
             :     assert cursor.fetchall() == [(0, )]
Should we assert that the warning is printed here too?


PS2, Line 191: Column 's' not found in kudu table impala::test_kudu_col_removed
This is not the same as the 'KUDU_COL_MISMATCH' error. Does that mean this gets caught earlier on?

In which case, does that extra code comparing the expected number of columns with the actual number of columns actually help?


-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Hello Thomas Tauber-Marshall, Sailesh Mukil,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/5840

to look at the new patch set (#8).

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................

IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Creating a table in Impala, changing the column schema
outside of Impala, and then reading again in Impala may
result in a crash. Impala may attempt to dereference
pointers that aren't there. This happens if a string column
is dropped and then a new, non string column is added with
the old string column's name.

The Kudu scan token contains the projection schema, and that
is validated when opening the Kudu scanner (with the
exception of KUDU-1881), but the issue is that during
planning, Impala assumes the types/nullability of columns
haven't changed when creating the scan tokens. This is fixed
by adding a check when creating the scan token, and failing
the query if the column types changed. Impala then relies on
the Kudu client to properly validate that the underlying
schema is still represented by the scan token, and that
deserialization will fail if it no longer matches. Test
cases were added for this particular crash scenario, which now
fails during planning as expected. This does not attempt to
validate the Kudu client validation at deserialization time,
though that would be valuable coverage to add in the future.

Columns being removed don't produce a crash; the query fails
gracefully. A test was added for this case.

Columns being added should not affect this scenario, but a
test was added anyway.

Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
---
M fe/src/main/java/org/apache/impala/planner/KuduScanNode.java
M tests/common/kudu_test_suite.py
M tests/query_test/test_kudu.py
3 files changed, 227 insertions(+), 6 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/40/5840/8
-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Dan Hecht (Code Review)" <ge...@cloudera.org>.
Dan Hecht has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5840/1/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

Line 150:       "Unable to deserialize scan token");
a quick comment explaining why this is needed would be helpful.


-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/5840/1//COMMIT_MSG
Commit Message:

Line 20: open, even if the schema changes concurrently.
> It would probably be worth adding that this is the earliest point and the c
Done


http://gerrit.cloudera.org:8080/#/c/5840/1/be/src/exec/kudu-scanner.cc
File be/src/exec/kudu-scanner.cc:

PS1, Line 133: for (int i = 0; i < tuple_desc->slots().size(); ++i) {
> As we spoke, if an ALTER TABLE is done and a column is removed, a crash mig
Really good point.

I handled the cases where there are more or fewer cols than expected. If there are fewer, I fail the query. If there are more, I add a warning but then continue. I added new test cases.


Line 150:       "Unable to deserialize scan token");
> a quick comment explaining why this is needed would be helpful.
Done


http://gerrit.cloudera.org:8080/#/c/5840/1/be/src/exec/kudu-util.cc
File be/src/exec/kudu-util.cc:

PS1, Line 77: PrimitiveType KuduColTypeToPrimitiveType(
            :     const KuduColumnSchema::DataType& type) {
> This may be an over optimization, but what if we had an array mapping KuduC
Good idea but we only use this on a per-partition basis, so not really on the hot path. Lets change it if it ever shows up in perf.


http://gerrit.cloudera.org:8080/#/c/5840/1/common/thrift/generate_error_codes.py
File common/thrift/generate_error_codes.py:

PS1, Line 314: is type
> nit: is of type
Done


-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 9: Code-Review+2

fixed spelling and carry +2

-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 9
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Matthew Jacobs (Code Review)" <ge...@cloudera.org>.
Matthew Jacobs has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 6: Code-Review+1

carrying Sailesh's +1

-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 6
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 8: Code-Review+2

(1 comment)

http://gerrit.cloudera.org:8080/#/c/5840/8/tests/common/kudu_test_suite.py
File tests/common/kudu_test_suite.py:

Line 88:     names. This must be kept in sync with KuduUtil.getDefaultCreateKuduTableName() in the
name


-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4828: Alter Kudu schema outside Impala may crash on read

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-4828: Alter Kudu schema outside Impala may crash on read
......................................................................


Patch Set 7:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/5840/7/tests/query_test/test_kudu.py
File tests/query_test/test_kudu.py:

Line 101:   def test_kudu_col_changed(self, cursor, kudu_client, unique_database):
Tests lgtm, but there seems to be a lot of repeated boilerplate stuff. Maybe you can take another pass and see if it makes sense to factor out some parts to condense the code?


Line 106:     kudu_tbl_name = "impala::%s.foo" % unique_database
seems worth factoring out this function and adding a comment that it must be kept in sync with the FE table-name-generation behavior


-- 
To view, visit http://gerrit.cloudera.org:8080/5840
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I6d43f5bb9811e728ad592933066d006c8fb4553a
Gerrit-PatchSet: 7
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Dan Hecht <dh...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <mj...@cloudera.com>
Gerrit-Reviewer: Sailesh Mukil <sa...@cloudera.com>
Gerrit-Reviewer: Thomas Tauber-Marshall <tm...@cloudera.com>
Gerrit-HasComments: Yes