You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Michael Smith (Code Review)" <ge...@cloudera.org> on 2023/05/16 23:03:28 UTC

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Michael Smith has uploaded this change for review. ( http://gerrit.cloudera.org:8080/19898


Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................

IMPALA-10186: Fix writing empty parquet page

Fixeus writing an empty parquet page when a page fills at the same time
that its dictionary fills. We had logic to detect both, but they were
handled separately and could trigger sequentially, resulting in creating
two new pages in a row. Now if a page becomes full, we also check
whether the dictionary is full and flush it if needed.

Dictionary size is checked after insert fails because if the item
already exists in the dictionary then we don't need a new page.

Testing: manually tested by copying a table with an empty page.

Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/util/dict-encoding.h
2 files changed, 13 insertions(+), 3 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/19898/1
-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Smith <mi...@cloudera.com>

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Michael Smith (Code Review)" <ge...@cloudera.org>.
Hello Gabor Kaszab, Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19898

to look at the new patch set (#6).

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................

IMPALA-10186: Fix writing empty parquet page

Fixes writing an empty parquet page when a page fills (or reaches
parquet_page_row_count_limit) at the same time that its dictionary
fills.

When a page filled (or reached parquet_page_row_count_limit) at the same
time that the dictionary filled, Impala would first detect the page was
full and create a new page. It would then detect the dictionary is full
and create another page, resulting in an empty page.

Parquet readers like Hive error if they encounter an empty page. This
patch attempts to make it impossible to generate an empty page by
reworking AppendRow and adding DCHECKs for empty pages. Dictionary size
is checked on FinalizeCurrentPage so whenever a page is written, we also
flush the dictionary if full.

Addresses clang-tidy by adding override in source files.

Testing:
- new test for full page size reached with full dictionary
- new test for parquet_page_row_count_limit with full dictionary
- new test for parquet_page_row_count_limit followed by large value.
  This seems useful as a theoretical corner-case; it currently writes
  the too-large value to the page anyway, but if we ever start checking
  whether the first value will fit the page this could become an issue.

Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/util/dict-encoding.h
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/empty_parquet_page_source_impala10186/data.csv
M tests/query_test/test_parquet_page_index.py
6 files changed, 102,267 insertions(+), 70 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/19898/6
-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 6
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 5: Code-Review+1

(2 comments)

Thanks for fixing the issue, and also simplifying the writer logic!

http://gerrit.cloudera.org:8080/#/c/19898/4/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/19898/4/be/src/util/dict-encoding.h@237
PS4, Line 237:     enum { INVALID_INDEX = 40000 };
Maybe add a comment here that changing this value might need tuning tests in test_parquet_page_index.py


http://gerrit.cloudera.org:8080/#/c/19898/4/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/19898/4/tests/query_test/test_parquet_page_index.py@452
PS4, Line 452: vector
Unfortunately we modify the vector all around in this file, but we should deepcopy the vector and modify the copy of it, so other tests won't be affected.



-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 May 2023 17:19:12 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Michael Smith (Code Review)" <ge...@cloudera.org>.
Hello Gabor Kaszab, Zoltan Borok-Nagy, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19898

to look at the new patch set (#5).

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................

IMPALA-10186: Fix writing empty parquet page

Fixes writing an empty parquet page when a page fills (or reaches
parquet_page_row_count_limit) at the same time that its dictionary
fills.

When a page filled (or reached parquet_page_row_count_limit) at the same
time that the dictionary filled, Impala would first detect the page was
full and create a new page. It would then detect the dictionary is full
and create another page, resulting in an empty page.

Parquet readers like Hive error if they encounter an empty page. This
patch attempts to make it impossible to generate an empty page by
reworking AppendRow and adding DCHECKs for empty pages. Dictionary size
is checked on FinalizeCurrentPage so whenever a page is written, we also
flush the dictionary if full.

Addresses clang-tidy by adding override in source files.

Testing:
- new test for full page size reached with full dictionary
- new test for parquet_page_row_count_limit with full dictionary
- new test for parquet_page_row_count_limit followed by large value.
  This seems useful as a theoretical corner-case; it currently writes
  the too-large value to the page anyway, but if we ever start checking
  whether the first value will fit the page this could become an issue.

Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/util/dict-encoding.h
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/empty_parquet_page_source_impala10186/data.csv
M tests/query_test/test_parquet_page_index.py
6 files changed, 102,266 insertions(+), 70 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/19898/5
-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 8:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9330/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 8
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 May 2023 06:38:10 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/13061/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Tue, 16 May 2023 23:25:35 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/13070/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 May 2023 00:05:25 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 7:

Restarted the GVO as the error seemed irrelevant.


-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 7
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 May 2023 06:38:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 8: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 8
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 May 2023 06:38:09 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Michael Smith (Code Review)" <ge...@cloudera.org>.
Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 6:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/19898/4/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/19898/4/be/src/util/dict-encoding.h@237
PS4, Line 237:     /// Changing this value will require re-tuning test_parquet_page_index.py.
> Maybe add a comment here that changing this value might need tuning tests i
Done


http://gerrit.cloudera.org:8080/#/c/19898/4/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/19898/4/tests/query_test/test_parquet_page_index.py@452
PS4, Line 452: vector
> I thought we got a new copy for every test run. I'll try to confirm that.
Every test invocation gets its own version of vector via https://github.com/apache/impala/blob/master/tests/conftest.py#L219-L242.



-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 6
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 May 2023 17:34:15 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 2:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/13069/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 May 2023 23:38:14 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 7: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 7
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 May 2023 21:13:38 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................

IMPALA-10186: Fix writing empty parquet page

Fixes writing an empty parquet page when a page fills (or reaches
parquet_page_row_count_limit) at the same time that its dictionary
fills.

When a page filled (or reached parquet_page_row_count_limit) at the same
time that the dictionary filled, Impala would first detect the page was
full and create a new page. It would then detect the dictionary is full
and create another page, resulting in an empty page.

Parquet readers like Hive error if they encounter an empty page. This
patch attempts to make it impossible to generate an empty page by
reworking AppendRow and adding DCHECKs for empty pages. Dictionary size
is checked on FinalizeCurrentPage so whenever a page is written, we also
flush the dictionary if full.

Addresses clang-tidy by adding override in source files.

Testing:
- new test for full page size reached with full dictionary
- new test for parquet_page_row_count_limit with full dictionary
- new test for parquet_page_row_count_limit followed by large value.
  This seems useful as a theoretical corner-case; it currently writes
  the too-large value to the page anyway, but if we ever start checking
  whether the first value will fit the page this could become an issue.

Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Reviewed-on: http://gerrit.cloudera.org:8080/19898
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/util/dict-encoding.h
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/empty_parquet_page_source_impala10186/data.csv
M tests/query_test/test_parquet_page_index.py
6 files changed, 102,267 insertions(+), 70 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 9
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/13075/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 6
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 May 2023 17:56:17 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Michael Smith (Code Review)" <ge...@cloudera.org>.
Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 5:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/19898/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/19898/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@537
PS3, Line 537:         // Shouldn't happen on an empty page as it should be sized for bytes_needed.
> A dcheck could be added for current_page_->header.uncompressed_page_size ==
Done


http://gerrit.cloudera.org:8080/#/c/19898/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@928
PS3, Line 928: easons - because the current page is not bi
> after the patch we try again only once
Done


http://gerrit.cloudera.org:8080/#/c/19898/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@946
PS3, Line 946:     }
> Can you add a comment about this being a second try and that it needs to su
Done


http://gerrit.cloudera.org:8080/#/c/19898/3/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/19898/3/testdata/datasets/functional/functional_schema_template.sql@4160
PS3, Line 4160: ====
> a name like empty_parquet_page_source and a comment with the Jira number co
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 May 2023 16:27:56 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9329/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 7
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 May 2023 21:13:39 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 7: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/9329/


-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 7
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 May 2023 02:27:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Michael Smith (Code Review)" <ge...@cloudera.org>.
Hello Gabor Kaszab, Zoltan Borok-Nagy, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19898

to look at the new patch set (#3).

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................

IMPALA-10186: Fix writing empty parquet page

Fixes writing an empty parquet page when a page fills (or reaches
parquet_page_row_count_limit) at the same time that its dictionary
fills.

When a page filled (or reached parquet_page_row_count_limit) at the same
time that the dictionary filled, Impala would first detect the page was
full and create a new page. It would then detect the dictionary is full
and create another page, resulting in an empty page.

Parquet readers like Hive error if they encounter an empty page. This
patch attempts to make it impossible to generate an empty page by
reworking AppendRow and adding DCHECKs for empty pages. Dictionary size
is checked on FinalizeCurrentPage so whenever a page is written, we also
flush the dictionary if full.

Addresses clang-tidy by adding override in source files.

Testing:
- new test for full page size reached with full dictionary
- new test for parquet_page_row_count_limit with full dictionary
- new test for parquet_page_row_count_limit followed by large value.
  This seems useful as a theoretical corner-case; it currently writes
  the too-large value to the page anyway, but if we ever start checking
  whether the first value will fit the page this could become an issue.

Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/util/dict-encoding.h
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/empty_parquet_page/data.csv
M tests/query_test/test_parquet_page_index.py
6 files changed, 102,262 insertions(+), 69 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/19898/3
-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 3:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/9325/ DRY_RUN=true


-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 May 2023 00:01:46 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 1:

(2 comments)

Thanks for fixing this issue, Michael.

Based on your investigation the following statements reproduce the bug:

set PARQUET_PAGE_ROW_COUNT_LIMIT=20000;

create table empty_parquet_page
stored as parquet
as select distinct l_orderkey as n from tpch_parquet.lineitem order by l_orderkey limit 41000;

Could you please add an e2e for this? test_parquet_page_index.py is a good place for this and has examples about verifying the Page index.

http://gerrit.cloudera.org:8080/#/c/19898/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19898/1//COMMIT_MSG@9
PS1, Line 9: Fixeus
nit: Fixes


http://gerrit.cloudera.org:8080/#/c/19898/1/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/19898/1/be/src/util/dict-encoding.h@206
PS1, Line 206: inline
nit: inline keyword is redundant here, see: https://en.cppreference.com/w/cpp/language/inline

"A function defined entirely inside a class/struct/union definition, whether it's a member function or a non-member friend function, is implicitly an inline function unless it is attached to a named module (since C++20)."



-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 May 2023 09:58:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Michael Smith (Code Review)" <ge...@cloudera.org>.
Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 1:

I haven't been able to generate a new test case that actually triggers this. The table I used isn't mine to publish.


-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 1
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 16 May 2023 23:56:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 6: Code-Review+2

(1 comment)

Thanks, Michael! The change looks great!

http://gerrit.cloudera.org:8080/#/c/19898/4/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/19898/4/tests/query_test/test_parquet_page_index.py@452
PS4, Line 452: vector
> Every test invocation gets its own version of vector via https://github.com
Interesting, I vaguely remember seeing weird test errors due to not deepcopying the vector. Also, other test files have a lot of deepcopy.

That said, I couldn't reproduce any errors by not deepcopying, and this file already doesn't deepcopy, so I'm OK with it.



-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 6
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 May 2023 18:23:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 3: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 May 2023 05:20:42 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Michael Smith (Code Review)" <ge...@cloudera.org>.
Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 5:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/19898/4/tests/query_test/test_parquet_page_index.py
File tests/query_test/test_parquet_page_index.py:

http://gerrit.cloudera.org:8080/#/c/19898/4/tests/query_test/test_parquet_page_index.py@452
PS4, Line 452: vector
> Unfortunately we modify the vector all around in this file, but we should d
I thought we got a new copy for every test run. I'll try to confirm that.



-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 May 2023 17:25:48 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has removed a vote on this change.

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Removed Verified-1 by Impala Public Jenkins <im...@cloudera.com>
-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 7
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 8: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 8
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 19 May 2023 12:02:41 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Michael Smith (Code Review)" <ge...@cloudera.org>.
Hello Gabor Kaszab, Zoltan Borok-Nagy, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/19898

to look at the new patch set (#2).

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................

IMPALA-10186: Fix writing empty parquet page

Fixes writing an empty parquet page when a page fills (or reaches
parquet_page_row_count_limit) at the same time that its dictionary
fills.

When a page filled (or reached parquet_page_row_count_limit) at the same
time that the dictionary filled, Impala would first detect the page was
full and create a new page. It would then detect the dictionary is full
and create another page, resulting in an empty page.

Parquet readers like Hive error if they encounter an empty page. This
patch attempts to make it impossible to generate an empty page by
reworking AppendRow and adding DCHECKs for empty pages. Dictionary size
is checked on FinalizeCurrentPage so whenever a page is written, we also
flush the dictionary if full.

Testing:
- new test for full page size reached with full dictionary
- new test for parquet_page_row_count_limit with full dictionary
- new test for parquet_page_row_count_limit followed by large value.
  This seems useful as a theoretical corner-case; it currently writes
  the too-large value to the page anyway, but if we ever start checking
  whether the first value will fit the page this could become an issue.

Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/util/dict-encoding.h
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/empty_parquet_page/data.csv
M tests/query_test/test_parquet_page_index.py
6 files changed, 102,255 insertions(+), 62 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/98/19898/2
-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Michael Smith (Code Review)" <ge...@cloudera.org>.
Michael Smith has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 2:

(3 comments)

I ended up doing a larger refactor to address the TODO around removing the loop.

http://gerrit.cloudera.org:8080/#/c/19898/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/19898/1//COMMIT_MSG@9
PS1, Line 9: Fixes 
> nit: Fixes
Done


http://gerrit.cloudera.org:8080/#/c/19898/2/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/19898/2/be/src/exec/parquet/hdfs-parquet-table-writer.cc@918
PS2, Line 918:       bytes_needed = BytesNeededFor(value);
This is not currently an issue, and may actually represent a change in behavior. I couldn't generate a test scenario where ProcessValue would fail for the first value on a new page.


http://gerrit.cloudera.org:8080/#/c/19898/1/be/src/util/dict-encoding.h
File be/src/util/dict-encoding.h:

http://gerrit.cloudera.org:8080/#/c/19898/1/be/src/util/dict-encoding.h@206
PS1, Line 206:   for 
> nit: inline keyword is redundant here, see: https://en.cppreference.com/w/c
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 2
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 17 May 2023 23:20:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 3: Code-Review+1

(4 comments)

http://gerrit.cloudera.org:8080/#/c/19898/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/19898/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@537
PS3, Line 537:         return false;
A dcheck could be added for current_page_->header.uncompressed_page_size == 0 as this shouldn't occur in empty pages


http://gerrit.cloudera.org:8080/#/c/19898/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@928
PS3, Line 928: so we may need to try again more than once.
after the patch we try again only once


http://gerrit.cloudera.org:8080/#/c/19898/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@946
PS3, Line 946: 
Can you add a comment about this being a second try and that it needs to succeed?


http://gerrit.cloudera.org:8080/#/c/19898/3/testdata/datasets/functional/functional_schema_template.sql
File testdata/datasets/functional/functional_schema_template.sql:

http://gerrit.cloudera.org:8080/#/c/19898/3/testdata/datasets/functional/functional_schema_template.sql@4160
PS3, Line 4160: empty_parquet_page
a name like empty_parquet_page_source and a comment with the Jira number could make the role of this table clearer



-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 3
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 May 2023 12:32:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10186: Fix writing empty parquet page

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/19898 )

Change subject: IMPALA-10186: Fix writing empty parquet page
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/13072/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/19898
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I90d30d958f07c6289a1beba1b5df1ab3d7213799
Gerrit-Change-Number: 19898
Gerrit-PatchSet: 5
Gerrit-Owner: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Gabor Kaszab <ga...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Michael Smith <mi...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 18 May 2023 16:49:26 +0000
Gerrit-HasComments: No