You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org> on 2021/07/12 16:38:34 UTC

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Zoltan Borok-Nagy has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17678


Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................

IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

IMPALA-7087 is about reading Parquet decimal columns with lower
precision/scale than table metadata.
IMPALA-8131 is about reading Parquet decimal columns with higher
scale than table metadata.

Both are resolved by this patch. It reuses some parts from an
earlier change request from Sahil Takiar:
https://gerrit.cloudera.org/#/c/12163/

I introduced a new utility class, ParquetDataConverter which does
the data conversion. It also helps to decide whether data conversion
is needed or not.

Parquet column stats reader is also updated to convert the decimal
values.

This patch also enables schema evolution of decimal columns of Iceberg
tables.

Testing:
 * added e2e tests

Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
---
M be/src/exec/parquet/hdfs-parquet-scanner.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/runtime/decimal-value.h
M be/src/runtime/decimal-value.inline.h
M fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
M testdata/data/README
A testdata/data/binary_decimal_precision_and_scale_widening.parquet
M testdata/workloads/functional-query/queries/QueryTest/iceberg-alter.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test
M tests/query_test/test_scanners.py
16 files changed, 403 insertions(+), 149 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/17678/1
-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................

IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

IMPALA-7087 is about reading Parquet decimal columns with lower
precision/scale than table metadata.
IMPALA-8131 is about reading Parquet decimal columns with higher
scale than table metadata.

Both are resolved by this patch. It reuses some parts from an
earlier change request from Sahil Takiar:
https://gerrit.cloudera.org/#/c/12163/

A new utility class has been introduced, ParquetDataConverter which does
the data conversion. It also helps to decide whether data conversion
is needed or not.

NULL values are returned in case of overflows. This behavior is
consistent with Hive.

Parquet column stats reader is also updated to convert the decimal
values. The stats reader is used to evaluate min/max conjuncts. It
works well because later we also evaluate the conjuncts on the
converted values anyway.

The status of different filterings:
 * dictionary filtering: disabled for columns that need conversion
 * runtime bloom filters: work on the converted values
 * runtime min/max filters: work on the converted values

This patch also enables schema evolution of decimal columns of Iceberg
tables.

Testing:
 * added e2e tests

Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Reviewed-on: http://gerrit.cloudera.org:8080/17678
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Csaba Ringhofer <cs...@cloudera.com>
---
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
A be/src/exec/parquet/parquet-data-converter.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/runtime/decimal-value.h
M be/src/runtime/decimal-value.inline.h
M fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
M testdata/data/README
A testdata/data/binary_decimal_precision_and_scale_widening.parquet
M testdata/workloads/functional-query/queries/QueryTest/iceberg-alter.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test
M tests/query_test/test_scanners.py
16 files changed, 718 insertions(+), 155 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Csaba Ringhofer: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 8
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 3:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/9085/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 14:32:16 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9088/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 17:08:37 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 1:

Build Failed 

https://jenkins.impala.io/job/gerrit-code-review-checks/9075/ : Initial code review checks failed. See linked job for details on the failure.


-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 1
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Jul 2021 16:47:50 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 4: Code-Review+2

(4 comments)

Some minor comments.

http://gerrit.cloudera.org:8080/#/c/17678/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17678/2//COMMIT_MSG@25
PS2, Line 25: Parquet column stats reader is also updated to convert the decimal
            : values.
> Bloom filters work for decimals. I extended the commit message.
Are you sure that they work for decimals? https://github.com/apache/impala/commit/817ca5920d93442b591a00bb8b280bd4e05f470c writes in the commit message that decimals are not supported. Generally bloom filters would not work well with "lossy" type conversions - the bloom filters contains hashes of the type written to the files, while the eq predicate is evaluated against Impala's type. E.g. if the Parquet file contains 0.9, but the precision is 0, it will be converted to 1. The decimal_col = 1 predicate can't use the bloom filter, because many values will be rounded to 1 but they will have totally different hashes.


http://gerrit.cloudera.org:8080/#/c/17678/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17678/4//COMMIT_MSG@35
PS4, Line 35: It's OK for the filters to work on the converted values because we'd
            : use those values later anyway
This seems redundant as line 26/27/28 means nearly the same.


http://gerrit.cloudera.org:8080/#/c/17678/4/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test:

http://gerrit.cloudera.org:8080/#/c/17678/4/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test@16
PS4, Line 16: 23.333
Maybe change this to 23.633 too?


http://gerrit.cloudera.org:8080/#/c/17678/4/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test@43
PS4, Line 43: 23.33
Maybe change this to 23.63 too?



-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 16:32:29 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 7: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 14 Jul 2021 12:31:56 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 2:

(1 comment)

Looks great!

http://gerrit.cloudera.org:8080/#/c/17678/2/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test:

http://gerrit.cloudera.org:8080/#/c/17678/2/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test@58
PS2, Line 58: ====
nit. Suggest to add an INSERT query inserting a value like 100000000.9999 that fits DECIMAL(32,8) and then select the value back.



-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 13:50:34 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7296/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 19:58:18 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 7: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 14 Jul 2021 02:00:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 4: Code-Review+2

(2 comments)

Looks great!

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exprs/decimal-operators-ir.cc
File be/src/exprs/decimal-operators-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exprs/decimal-operators-ir.cc@119
PS2, Line 119: /*round=*/false
> I chose to preserve the old behavior to be on the safe side.
Sounds like a good idea. 

Rounding up or down can be complicated after seeing your new tests. If all values in a decimal column are always kept consistent to the column type, the matter probably is simpler.


http://gerrit.cloudera.org:8080/#/c/17678/2/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test:

http://gerrit.cloudera.org:8080/#/c/17678/2/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test@58
PS2, Line 58: ----
> I added it a bit later. I cannot add it here because in the subsequent test
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 15:14:43 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 4:

(4 comments)

http://gerrit.cloudera.org:8080/#/c/17678/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17678/2//COMMIT_MSG@25
PS2, Line 25: Parquet column stats reader is also updated to convert the decimal
            : values.
> Are you sure that they work for decimals? https://github.com/apache/impala/
Sorry, I was thinking about the runtime bloom filters.


http://gerrit.cloudera.org:8080/#/c/17678/4//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17678/4//COMMIT_MSG@35
PS4, Line 35: It's OK for the filters to work on the converted values because we'd
            : use those values later anyway
> This seems redundant as line 26/27/28 means nearly the same.
Done


http://gerrit.cloudera.org:8080/#/c/17678/4/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test:

http://gerrit.cloudera.org:8080/#/c/17678/4/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test@16
PS4, Line 16: 23.333
> Maybe change this to 23.633 too?
Done


http://gerrit.cloudera.org:8080/#/c/17678/4/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test@43
PS4, Line 43: 23.33
> Maybe change this to 23.63 too?
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 16:49:28 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17678

to look at the new patch set (#5).

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................

IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

IMPALA-7087 is about reading Parquet decimal columns with lower
precision/scale than table metadata.
IMPALA-8131 is about reading Parquet decimal columns with higher
scale than table metadata.

Both are resolved by this patch. It reuses some parts from an
earlier change request from Sahil Takiar:
https://gerrit.cloudera.org/#/c/12163/

A new utility class has been introduced, ParquetDataConverter which does
the data conversion. It also helps to decide whether data conversion
is needed or not.

NULL values are returned in case of overflows. This behavior is
consistent with Hive.

Parquet column stats reader is also updated to convert the decimal
values. The stats reader is used to evaluate min/max conjuncts. It
works well because later we also evaluate the conjuncts on the
converted values anyway.

The status of different filterings:
 * dictionary filtering: disabled for columns that need conversion
 * runtime bloom filters: work on the converted values
 * runtime min/max filters: work on the converted values

It's OK for the filters to work on the converted values because we'd
use those values later anyway.

This patch also enables schema evolution of decimal columns of Iceberg
tables.

Testing:
 * added e2e tests

Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
---
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
A be/src/exec/parquet/parquet-data-converter.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/runtime/decimal-value.h
M be/src/runtime/decimal-value.inline.h
M fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
M testdata/data/README
A testdata/data/binary_decimal_precision_and_scale_widening.parquet
M testdata/workloads/functional-query/queries/QueryTest/iceberg-alter.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test
M tests/query_test/test_scanners.py
16 files changed, 717 insertions(+), 155 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/17678/5
-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 5:

PS5 fixes a bug: during the batched decoding of the page index, the values were not converted.

I think there's still a bug with timestamp values, opened IMPALA-10793 for that.


-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 16:39:42 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Csaba Ringhofer (Code Review)" <ge...@cloudera.org>.
Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 2: Code-Review+1

(8 comments)

lgtm, only have small comments

http://gerrit.cloudera.org:8080/#/c/17678/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17678/2//COMMIT_MSG@25
PS2, Line 25: Parquet column stats reader is also updated to convert the decimal
            : values.
Can you mention how the different filters handle this conversion? e.g.:
dictionary filter: disabled for all column that need conversion
bloom filter: not yet supported for Decimal 
min-max filter: supported, because the conversion preservers the ordering of values, so if min<=val<=max then round(min)<=round(val)<=round(max) will be also true + stats that are converted to NULL are ignored

These may seem self-evident, but it took some time to convince myself that we handle all these cases correctly :)


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.cc
File be/src/exec/parquet/parquet-column-stats.cc:

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.cc@24
PS2, Line 24: #include "exec/parquet/parquet-data-converter.h"
            : 
            : #include "common/names.h"
> The order is not in ascending order.
common/names.h is usually an exception for this, I am not sure why


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.cc@353
PS2, Line 353:   // No need for an extra buffer, we can do the conversion in-place.
Can you mention why we can handle this correctly? See my comment in the commit message.


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-data-converter.h
File be/src/exec/parquet/parquet-data-converter.h:

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-data-converter.h@32
PS2, Line 32: ParquetDataConverter
Thanks for moving this logic out for parquet-column-readers.cc!


http://gerrit.cloudera.org:8080/#/c/17678/2/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test:

http://gerrit.cloudera.org:8080/#/c/17678/2/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test@7
PS2, Line 7: 333
Can you change to a value that will be also rounded up at some scale, e.g. 23.633? This would allow a test for like the one I mention at line 99 but for max_value


http://gerrit.cloudera.org:8080/#/c/17678/2/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test@60
PS2, Line 60: ALTER TABLE test_dec CHANGE COLUMN d d DECIMAL(32, 8);
This ALTER TABLE is not needed.


http://gerrit.cloudera.org:8080/#/c/17678/2/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test@77
PS2, Line 77: ALTER TABLE test_dec CHANGE COLUMN d d DECIMAL(10, 2);
This ALTER TABLE is not needed.


http://gerrit.cloudera.org:8080/#/c/17678/2/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test@99
PS2, Line 99: -23.7
Can you add a test where search for -23.7 specifically? This would cover the case when the value we look for is lower than the min_value in the stats, so it would ensure that the stats are rounded correctly.



-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Jul 2021 23:38:02 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9076/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Jul 2021 17:11:45 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17678

to look at the new patch set (#2).

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................

IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

IMPALA-7087 is about reading Parquet decimal columns with lower
precision/scale than table metadata.
IMPALA-8131 is about reading Parquet decimal columns with higher
scale than table metadata.

Both are resolved by this patch. It reuses some parts from an
earlier change request from Sahil Takiar:
https://gerrit.cloudera.org/#/c/12163/

I introduced a new utility class, ParquetDataConverter which does
the data conversion. It also helps to decide whether data conversion
is needed or not.

We return NULL values in case of overflows. This behavior is consistent
with Hive.

Parquet column stats reader is also updated to convert the decimal
values.

This patch also enables schema evolution of decimal columns of Iceberg
tables.

Testing:
 * added e2e tests

Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
---
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
A be/src/exec/parquet/parquet-data-converter.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/runtime/decimal-value.h
M be/src/runtime/decimal-value.inline.h
M fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
M testdata/data/README
A testdata/data/binary_decimal_precision_and_scale_widening.parquet
M testdata/workloads/functional-query/queries/QueryTest/iceberg-alter.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test
M tests/query_test/test_scanners.py
16 files changed, 583 insertions(+), 149 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/17678/2
-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17678

to look at the new patch set (#3).

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................

IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

IMPALA-7087 is about reading Parquet decimal columns with lower
precision/scale than table metadata.
IMPALA-8131 is about reading Parquet decimal columns with higher
scale than table metadata.

Both are resolved by this patch. It reuses some parts from an
earlier change request from Sahil Takiar:
https://gerrit.cloudera.org/#/c/12163/

A new utility class has been introduced, ParquetDataConverter which does
the data conversion. It also helps to decide whether data conversion
is needed or not.

NULL values are returned in case of overflows. This behavior is
consistent with Hive.

Parquet column stats reader is also updated to convert the decimal
values. The stats reader is used to evaluate min/max conjuncts. It
works well because later we also evaluate the conjuncts on the
converted values anyway.

The status of different filterings:
 * dictionary filtering: disabled for columns that need conversion
 * runtime bloom filters: work on the converted values
 * runtime min/max filters: work on the converted values

It's OK for the filters to work on the converted values because we'd
use those values later anyway.

This patch also enables schema evolution of decimal columns of Iceberg
tables.

Testing:
 * added e2e tests

Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
---
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
A be/src/exec/parquet/parquet-data-converter.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/runtime/decimal-value.h
M be/src/runtime/decimal-value.inline.h
M fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
M testdata/data/README
A testdata/data/binary_decimal_precision_and_scale_widening.parquet
M testdata/workloads/functional-query/queries/QueryTest/iceberg-alter.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test
M tests/query_test/test_scanners.py
16 files changed, 704 insertions(+), 152 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/17678/3
-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Hello Qifan Chen, Csaba Ringhofer, Impala Public Jenkins, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/17678

to look at the new patch set (#6).

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................

IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

IMPALA-7087 is about reading Parquet decimal columns with lower
precision/scale than table metadata.
IMPALA-8131 is about reading Parquet decimal columns with higher
scale than table metadata.

Both are resolved by this patch. It reuses some parts from an
earlier change request from Sahil Takiar:
https://gerrit.cloudera.org/#/c/12163/

A new utility class has been introduced, ParquetDataConverter which does
the data conversion. It also helps to decide whether data conversion
is needed or not.

NULL values are returned in case of overflows. This behavior is
consistent with Hive.

Parquet column stats reader is also updated to convert the decimal
values. The stats reader is used to evaluate min/max conjuncts. It
works well because later we also evaluate the conjuncts on the
converted values anyway.

The status of different filterings:
 * dictionary filtering: disabled for columns that need conversion
 * runtime bloom filters: work on the converted values
 * runtime min/max filters: work on the converted values

This patch also enables schema evolution of decimal columns of Iceberg
tables.

Testing:
 * added e2e tests

Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
---
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-stats.cc
M be/src/exec/parquet/parquet-column-stats.h
A be/src/exec/parquet/parquet-data-converter.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M be/src/exprs/decimal-operators-ir.cc
M be/src/runtime/decimal-value.h
M be/src/runtime/decimal-value.inline.h
M fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
M testdata/data/README
A testdata/data/binary_decimal_precision_and_scale_widening.parquet
M testdata/workloads/functional-query/queries/QueryTest/iceberg-alter.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-negative.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test
A testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-widening.test
M tests/query_test/test_scanners.py
16 files changed, 718 insertions(+), 155 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/78/17678/6
-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9087/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 16:57:14 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Qifan Chen (Code Review)" <ge...@cloudera.org>.
Qifan Chen has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 2:

(12 comments)

Completed the 1st pass of code review. Looks good!

Will find time to look at the tests.

http://gerrit.cloudera.org:8080/#/c/17678/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17678/2//COMMIT_MSG@18
PS2, Line 18: I 
nit. Put the sentence in passive voice?


http://gerrit.cloudera.org:8080/#/c/17678/2//COMMIT_MSG@22
PS2, Line 22: We
nit. Same as above.


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-readers.cc@740
PS2, Line 740:       // The value is invalid but execution should continue - set the null indicator and
nit. 'The value or the conversion'


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.h
File be/src/exec/parquet/parquet-column-stats.h:

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.h@351
PS2, Line 351:   template <typename DecimalType>
nit. May need to add a comment.


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.h@352
PS2, Line 352: bool DecodeDecimal(const std::string& stat_value,
             :       DecimalType* slot) const;
nit. It seems that they can appear in one line (instead of two)?


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.cc
File be/src/exec/parquet/parquet-column-stats.cc:

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.cc@24
PS2, Line 24: #include "exec/parquet/parquet-data-converter.h"
            : 
            : #include "common/names.h"
The order is not in ascending order.


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.cc@348
PS2, Line 348: ParquetDataConverter<DecimalType, true> data_converter(&element_, &col_type_);
This line can be moved between line 351 and 352 when the converter is needed.


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.cc@352
PS2, Line 352: if (!data_converter.NeedsConversion())
nit. LIKELY()?


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-data-converter.h
File be/src/exec/parquet/parquet-data-converter.h:

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-data-converter.h@49
PS2, Line 49: ParquetTimestampDecoder
nit. const and &?


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-data-converter.h@179
PS2, Line 179: /*round=*/true
nit. Just wonder why the round flag is set to true here.


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exprs/decimal-operators-ir.cc
File be/src/exprs/decimal-operators-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exprs/decimal-operators-ir.cc@119
PS2, Line 119: /*round=*/false
nit. I wonder if there is a situation in which 'round' should be set to true. This could be highly data/application dependent. 

If so, maybe we need to new query option?


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/runtime/decimal-value.inline.h
File be/src/runtime/decimal-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/runtime/decimal-value.inline.h@144
PS2, Line 144: if (delta_scale < 0) 
nit. It seems this IF can be removed.



-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 2
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Comment-Date: Mon, 12 Jul 2021 19:39:40 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17678 )

Change subject: IMPALA-7087, IMPALA-8131: Read decimals from Parquet files with different precision/scale
......................................................................


Patch Set 3:

(19 comments)

Thanks for all the comments!

http://gerrit.cloudera.org:8080/#/c/17678/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17678/2//COMMIT_MSG@18
PS2, Line 18: A 
> nit. Put the sentence in passive voice?
Done


http://gerrit.cloudera.org:8080/#/c/17678/2//COMMIT_MSG@22
PS2, Line 22: NU
> nit. Same as above.
Done


http://gerrit.cloudera.org:8080/#/c/17678/2//COMMIT_MSG@25
PS2, Line 25: Parquet column stats reader is also updated to convert the decimal
            : values.
> Can you mention how the different filters handle this conversion? e.g.:
Bloom filters work for decimals. I extended the commit message.


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-readers.cc@740
PS2, Line 740:       // The value or the conversion is invalid but execution should continue - set the
> nit. 'The value or the conversion'
Done


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.h
File be/src/exec/parquet/parquet-column-stats.h:

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.h@351
PS2, Line 351:   /// Decodes decimal value into slot. Does conversion if needed.
> nit. May need to add a comment.
Done


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.h@352
PS2, Line 352: template <typename DecimalType>
             :   bool DecodeDecimal(const std:
> nit. It seems that they can appear in one line (instead of two)?
Done


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.cc
File be/src/exec/parquet/parquet-column-stats.cc:

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.cc@24
PS2, Line 24: #include "exec/parquet/parquet-data-converter.h"
            : 
            : #include "common/names.h"
> common/names.h is usually an exception for this, I am not sure why
'common/names.h' has using declarations, i.e. it has side-effects. E.g. imagine the following scenario:

 #include "common/names.h"      // has dozens of using
 #include "some-other-header.h" // Now it compiles with the above usings


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.cc@348
PS2, Line 348: bool ret = ColumnStats<DecimalType>::DecodePlainValue(stat_value, slot,
> This line can be moved between line 351 and 352 when the converter is neede
Done


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.cc@352
PS2, Line 352: // Let's convert the decimal value to 
> nit. LIKELY()?
Done


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-column-stats.cc@353
PS2, Line 353:   // filters and min/max conjuncts against the converted values as later we'd also
> Can you mention why we can handle this correctly? See my comment in the com
Added comment.


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-data-converter.h
File be/src/exec/parquet/parquet-data-converter.h:

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-data-converter.h@49
PS2, Line 49: const ParquetTimestampD
> nit. const and &?
Done


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exec/parquet/parquet-data-converter.h@179
PS2, Line 179: /*round=*/true
> nit. Just wonder why the round flag is set to true here.
I think both truncating and rounding would be valid based on the SQL standard, but I chose rounding to follow Hive's behavior.


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exprs/decimal-operators-ir.cc
File be/src/exprs/decimal-operators-ir.cc:

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/exprs/decimal-operators-ir.cc@119
PS2, Line 119: /*round=*/false
> nit. I wonder if there is a situation in which 'round' should be set to tru
I chose to preserve the old behavior to be on the safe side.
I didn't put too much thinking into it, probably rounding would be more rational.

Once this patch goes in I'll create a new Jira ticket to re-evaluate this behavior. We can make it tunable via a query option, but maybe it's better to choose a side in this case.


http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/runtime/decimal-value.inline.h
File be/src/runtime/decimal-value.inline.h:

http://gerrit.cloudera.org:8080/#/c/17678/2/be/src/runtime/decimal-value.inline.h@144
PS2, Line 144: {
> nit. It seems this IF can be removed.
Done


http://gerrit.cloudera.org:8080/#/c/17678/2/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test
File testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test:

http://gerrit.cloudera.org:8080/#/c/17678/2/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test@7
PS2, Line 7: 633
> Can you change to a value that will be also rounded up at some scale, e.g. 
Done


http://gerrit.cloudera.org:8080/#/c/17678/2/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test@58
PS2, Line 58: ----
> nit. Suggest to add an INSERT query inserting a value like 100000000.9999 t
I added it a bit later. I cannot add it here because in the subsequent tests we lower the precision to 10. And we still don't allow to read a file that has decimal with higher precision than table metadata.

Though maybe we could also allow it in this patch, and return NULL values in case of overflows.


http://gerrit.cloudera.org:8080/#/c/17678/2/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test@60
PS2, Line 60: ---- RESULTS
> This ALTER TABLE is not needed.
Done


http://gerrit.cloudera.org:8080/#/c/17678/2/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test@77
PS2, Line 77: 23.63
> This ALTER TABLE is not needed.
Done


http://gerrit.cloudera.org:8080/#/c/17678/2/testdata/workloads/functional-query/queries/QueryTest/parquet-decimal-precision-and-scale-altering.test@99
PS2, Line 99: DECIM
> Can you add a test where search for -23.7 specifically? This would cover th
Done. Also added a search for 24.



-- 
To view, visit http://gerrit.cloudera.org:8080/17678
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Icefa7e545ca9f7df1741a2d1225375ecf54434da
Gerrit-Change-Number: 17678
Gerrit-PatchSet: 3
Gerrit-Owner: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <cs...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Qifan Chen <qc...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 13 Jul 2021 14:24:37 +0000
Gerrit-HasComments: Yes