You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Alex Behm (Code Review)" <ge...@cloudera.org> on 2016/10/12 04:01:03 UTC

[Impala-ASF-CR] IMPALA-3943: Do not throw scan errors for empty Parquet files.

Alex Behm has uploaded a new change for review.

  http://gerrit.cloudera.org:8080/4693

Change subject: IMPALA-3943: Do not throw scan errors for empty Parquet files.
......................................................................

IMPALA-3943: Do not throw scan errors for empty Parquet files.

For Parquet files with no row groups but with num_rows=0 in the
file footer the Parquet scanner returns an error indicating
that the file is invalid. This behavior is a regression from
previous Impala versions which used to accept such files
irrespective of abort_on_error.

This patch restores the previous behavior and adds tests.

Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
---
M be/src/exec/hdfs-parquet-scanner.cc
M testdata/data/README
A testdata/data/zero_rows_one_row_group.parquet
A testdata/data/zero_rows_zero_row_groups.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-zero-rows.test
M tests/query_test/test_scanners.py
6 files changed, 65 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/4693/1
-- 
To view, visit http://gerrit.cloudera.org:8080/4693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newchange
Gerrit-Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>

[Impala-ASF-CR] IMPALA-3943: Do not throw scan errors for empty Parquet files.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-3943: Do not throw scan errors for empty Parquet files.
......................................................................


Patch Set 2: Code-Review+2

Thanks for your speedy review, Marcel!

-- 
To view, visit http://gerrit.cloudera.org:8080/4693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3943: Do not throw scan errors for empty Parquet files.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3943: Do not throw scan errors for empty Parquet files.
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4693/3/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 413:     if (row_group.num_rows == 0 || file_metadata_.num_rows == 0) continue;
> You could have a corrupt file with num_rows=0 but with some row groups. Wit
So I'm right in thinking that the input file is corrupt rather than a valid empty file? I feel like we should call that out somehow, either with a warning or at least a comment here.


-- 
To view, visit http://gerrit.cloudera.org:8080/4693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3943: Do not throw scan errors for empty Parquet files.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Hello Marcel Kornacker,

I'd like you to reexamine a change.  Please visit

    http://gerrit.cloudera.org:8080/4693

to look at the new patch set (#2).

Change subject: IMPALA-3943: Do not throw scan errors for empty Parquet files.
......................................................................

IMPALA-3943: Do not throw scan errors for empty Parquet files.

For Parquet files with no row groups but with num_rows=0 in the
file footer the Parquet scanner returns an error indicating
that the file is invalid. This behavior is a regression from
previous Impala versions which used to accept such files.

This patch restores the previous behavior and adds tests.

Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
---
M be/src/exec/hdfs-parquet-scanner.cc
M testdata/data/README
A testdata/data/zero_rows_one_row_group.parquet
A testdata/data/zero_rows_zero_row_groups.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-zero-rows.test
M tests/query_test/test_scanners.py
6 files changed, 65 insertions(+), 1 deletion(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/93/4693/2
-- 
To view, visit http://gerrit.cloudera.org:8080/4693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>

[Impala-ASF-CR] IMPALA-3943: Do not throw scan errors for empty Parquet files.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-3943: Do not throw scan errors for empty Parquet files.
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4693/3/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 413:     if (row_group.num_rows == 0 || file_metadata_.num_rows == 0) continue;
> How do we even get here if the checks on line 411 and line 413 pass? Line 4
You could have a corrupt file with num_rows=0 but with some row groups. Without this check we might return different results for a select * than a select count(*), so this chane is primarily for consistency,


-- 
To view, visit http://gerrit.cloudera.org:8080/4693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3943: Do not throw scan errors for empty Parquet files.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3943: Do not throw scan errors for empty Parquet files.
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4693/3/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 413:     if (row_group.num_rows == 0 || file_metadata_.num_rows == 0) continue;
> What should we do with abort_on_error?
I actually find it really confusing because without context it's almost impossible to understand why the second condition isn't redundant. I'm fine with adding a comment.

Agree it may be easiest to just swallow the error here if we think that the inconsistency in the file is harmless, that other tools accept them and we'll see them in the wild.


-- 
To view, visit http://gerrit.cloudera.org:8080/4693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3943: Do not throw scan errors for empty Parquet files.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-3943: Do not throw scan errors for empty Parquet files.
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4693/3/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 413:     if (row_group.num_rows == 0 || file_metadata_.num_rows == 0) continue;
> So I'm right in thinking that the input file is corrupt rather than a valid
What should we do with abort_on_error?

I can add a comment in a follow-on patch, but the I don't think this ode here is particularly confusing, it seems to make sense.


-- 
To view, visit http://gerrit.cloudera.org:8080/4693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3943: Do not throw scan errors for empty Parquet files.

Posted by "Tim Armstrong (Code Review)" <ge...@cloudera.org>.
Tim Armstrong has posted comments on this change.

Change subject: IMPALA-3943: Do not throw scan errors for empty Parquet files.
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4693/3/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 413:     if (row_group.num_rows == 0 || file_metadata_.num_rows == 0) continue;
How do we even get here if the checks on line 411 and line 413 pass? Line 411 should cover the case where there are no row groups, and the check on line 413 should cover the case when the row group is empty. 

Is the row group metadata just corrupt in these files?


-- 
To view, visit http://gerrit.cloudera.org:8080/4693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3943: Do not throw scan errors for empty Parquet files.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-3943: Do not throw scan errors for empty Parquet files.
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/4693/3/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 413:     if (row_group.num_rows == 0 || file_metadata_.num_rows == 0) continue;
> I actually find it really confusing because without context it's almost imp
http://gerrit.cloudera.org:8080/4696


-- 
To view, visit http://gerrit.cloudera.org:8080/4693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <ta...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3943: Do not throw scan errors for empty Parquet files.

Posted by "Internal Jenkins (Code Review)" <ge...@cloudera.org>.
Internal Jenkins has submitted this change and it was merged.

Change subject: IMPALA-3943: Do not throw scan errors for empty Parquet files.
......................................................................


IMPALA-3943: Do not throw scan errors for empty Parquet files.

For Parquet files with no row groups but with num_rows=0 in the
file footer the Parquet scanner returns an error indicating
that the file is invalid. This behavior is a regression from
previous Impala versions which used to accept such files.

This patch restores the previous behavior and adds tests.

Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Reviewed-on: http://gerrit.cloudera.org:8080/4693
Reviewed-by: Alex Behm <al...@cloudera.com>
Tested-by: Internal Jenkins
---
M be/src/exec/hdfs-parquet-scanner.cc
M testdata/data/README
A testdata/data/zero_rows_one_row_group.parquet
A testdata/data/zero_rows_zero_row_groups.parquet
A testdata/workloads/functional-query/queries/QueryTest/parquet-zero-rows.test
M tests/query_test/test_scanners.py
6 files changed, 65 insertions(+), 1 deletion(-)

Approvals:
  Internal Jenkins: Verified
  Alex Behm: Looks good to me, approved



-- 
To view, visit http://gerrit.cloudera.org:8080/4693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: merged
Gerrit-Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Gerrit-PatchSet: 3
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>

[Impala-ASF-CR] IMPALA-3943: Do not throw scan errors for empty Parquet files.

Posted by "Alex Behm (Code Review)" <ge...@cloudera.org>.
Alex Behm has posted comments on this change.

Change subject: IMPALA-3943: Do not throw scan errors for empty Parquet files.
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/4693/1/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

Line 250:     succeeds without errors irrespective of abort_on_error."""
> why even mention abort_on_error? i think that's still a legal parquet file.
Removed.


Line 256:     check_call(['hdfs', 'dfs', '-copyFromLocal',
> out of scope, but would be better to have this set up as a permanent table 
Agree, out of scope. I think it is also debatable.

Imo, it's better for test infra stability to have have such one-off tables be self-contained in the test without "polluting" the snapshot and data loading paths. Think about it this way: Adding to the functional suite would require much more script/conf changes in several different files and potentially cause hiccups with stale snapshots. Doesn't seem worth it if only this single test needs those files.


-- 
To view, visit http://gerrit.cloudera.org:8080/4693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-3943: Do not throw scan errors for empty Parquet files.

Posted by "Internal Jenkins (Code Review)" <ge...@cloudera.org>.
Internal Jenkins has posted comments on this change.

Change subject: IMPALA-3943: Do not throw scan errors for empty Parquet files.
......................................................................


Patch Set 2: Verified+1

-- 
To view, visit http://gerrit.cloudera.org:8080/4693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Internal Jenkins
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-3943: Do not throw scan errors for empty Parquet files.

Posted by "Marcel Kornacker (Code Review)" <ge...@cloudera.org>.
Marcel Kornacker has posted comments on this change.

Change subject: IMPALA-3943: Do not throw scan errors for empty Parquet files.
......................................................................


Patch Set 1: Code-Review+2

(2 comments)

http://gerrit.cloudera.org:8080/#/c/4693/1/tests/query_test/test_scanners.py
File tests/query_test/test_scanners.py:

Line 250:     succeeds without errors irrespective of abort_on_error."""
why even mention abort_on_error? i think that's still a legal parquet file.


Line 256:     check_call(['hdfs', 'dfs', '-copyFromLocal',
out of scope, but would be better to have this set up as a permanent table in the 'functional' db.


-- 
To view, visit http://gerrit.cloudera.org:8080/4693
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I50ac3df6ff24bc5c384ef22e0f804a5132adb62e
Gerrit-PatchSet: 1
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Alex Behm <al...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <ma...@cloudera.com>
Gerrit-HasComments: Yes