You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@impala.apache.org by "Attila Jeges (Code Review)" <ge...@cloudera.org> on 2021/08/24 10:55:43 UTC

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Attila Jeges has uploaded this change for review. ( http://gerrit.cloudera.org:8080/17806


Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................

IMPALA-10879: Add parquet stats to iceberg manifest

This patch adds parquet stats to iceberg manifest as per-datafile
metrics.

The following metrics are supported:
- column_sizes :
  Map from column id to the total size on disk of all regions that
  store the column. Does not include bytes necessary to read other
  columns, like footers.

- null_value_counts :
  Map from column id to number of null values in the column.

- lower_bounds :
  Map from column id to lower bound in the column serialized as
  binary. Each value must be less than or equal to all non-null,
  non-NaN values in the column for the file.

- upper_bounds :
  Map from column id to upper bound in the column serialized as
  binary. Each value must be greater than or equal to all non-null,
  non-Nan values in the column for the file.

The corresponding parquet stats are collected by 'ColumnStats'
(in 'min_value_', 'max_value_', 'null_count_' members) and
'HdfsParquetTableWriter::BaseColumnWriter' (in
'total_compressed_byte_size_' member).

Testing:
- New e2e test was added to verify that the metrics are written to the
  Iceberg manifest upon inserting data.
- New e2e test was added to verify that lower_bounds/upper_bounds
  metrics are used to prune data files on querying iceberg tables.
- Existing e2e tests were updated to work with the new behavior.

Relevant Iceberg documentation:
- Manifest:
  https://iceberg.apache.org/spec/#manifests
- Values in lower_bounds and upper_bounds maps should be Single-value
  serialized to binary:
  https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-writer.h
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-column-stats.inline.h
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/util/bit-util.h
M common/fbs/IcebergObjects.fbs
M common/protobuf/control_service.proto
M fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
M infra/python/deps/requirements.txt
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partition-transform-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
A testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
M tests/query_test/test_iceberg.py
17 files changed, 965 insertions(+), 17 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/06/17806/1
-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 1
Gerrit-Owner: Attila Jeges <at...@cloudera.com>

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 3:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/17806/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/17806/3/be/src/exec/parquet/hdfs-parquet-table-writer.cc@1725
PS3, Line 1725:       const ColumnDescriptor& col_desc = table_desc_->col_descs()[i + num_clustering_cols];
line too long (91 > 90)



-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 3
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Aug 2021 15:43:35 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#5). ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................

IMPALA-10879: Add parquet stats to iceberg manifest

This patch adds parquet stats to iceberg manifest as per-datafile
metrics.

The following metrics are supported:
- column_sizes :
  Map from column id to the total size on disk of all regions that
  store the column. Does not include bytes necessary to read other
  columns, like footers.

- null_value_counts :
  Map from column id to number of null values in the column.

- lower_bounds :
  Map from column id to lower bound in the column serialized as
  binary. Each value must be less than or equal to all non-null,
  non-NaN values in the column for the file.

- upper_bounds :
  Map from column id to upper bound in the column serialized as
  binary. Each value must be greater than or equal to all non-null,
  non-Nan values in the column for the file.

The corresponding parquet stats are collected by 'ColumnStats'
(in 'min_value_', 'max_value_', 'null_count_' members) and
'HdfsParquetTableWriter::BaseColumnWriter' (in
'total_compressed_byte_size_' member).

Testing:
- New e2e test was added to verify that the metrics are written to the
  Iceberg manifest upon inserting data.
- New e2e test was added to verify that lower_bounds/upper_bounds
  metrics are used to prune data files on querying iceberg tables.
- Existing e2e tests were updated to work with the new behavior.
- BE test for single-value serialization.

Relevant Iceberg documentation:
- Manifest:
  https://iceberg.apache.org/spec/#manifests
- Values in lower_bounds and upper_bounds maps should be Single-value
  serialized to binary:
  https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-writer.h
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-column-stats.inline.h
A be/src/exec/parquet/serialize-single-value-test.cc
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/util/bit-util.h
M common/fbs/IcebergObjects.fbs
M common/protobuf/control_service.proto
M fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
M infra/python/deps/requirements.txt
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partition-transform-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
A testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
M tests/query_test/test_iceberg.py
19 files changed, 1,036 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/06/17806/5
-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 5
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 7: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 7
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Aug 2021 19:06:21 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 2:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9356/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Aug 2021 12:38:03 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 6:

patch-set #6 contains the refactored flat buffer generation.


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 6
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Aug 2021 18:18:34 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 10: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 10
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 02 Sep 2021 21:34:13 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 10: Verified+1


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 10
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 02 Sep 2021 19:28:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 1:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py
File tests/query_test/test_iceberg.py:

http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@24
PS1, Line 24: import avro.schema
> flake8: F401 'avro.schema' imported but unused
Done


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@25
PS1, Line 25: from avro.datafile import DataFileReader, DataFileWriter
> flake8: F401 'avro.datafile.DataFileWriter' imported but unused
Done


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@26
PS1, Line 26: from avro.io import DatumReader, DatumWriter
> flake8: F401 'avro.io.DatumWriter' imported but unused
Done


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@199
PS1, Line 199:  
> flake8: E202 whitespace before ']'
Done


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@199
PS1, Line 199:  
> flake8: E201 whitespace after '['
Done


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@260
PS1, Line 260: ,
> flake8: E231 missing whitespace after ','
Done


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@268
PS1, Line 268: ,
> flake8: E231 missing whitespace after ','
Done


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@282
PS1, Line 282: ,
> flake8: E231 missing whitespace after ','
Done


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@290
PS1, Line 290: ,
> flake8: E231 missing whitespace after ','
Done


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@292
PS1, Line 292: :
> flake8: E231 missing whitespace after ':'
Done


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@307
PS1, Line 307: :
> flake8: E231 missing whitespace after ':'
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 1
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Aug 2021 12:01:33 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#6). ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................

IMPALA-10879: Add parquet stats to iceberg manifest

This patch adds parquet stats to iceberg manifest as per-datafile
metrics.

The following metrics are supported:
- column_sizes :
  Map from column id to the total size on disk of all regions that
  store the column. Does not include bytes necessary to read other
  columns, like footers.

- null_value_counts :
  Map from column id to number of null values in the column.

- lower_bounds :
  Map from column id to lower bound in the column serialized as
  binary. Each value must be less than or equal to all non-null,
  non-NaN values in the column for the file.

- upper_bounds :
  Map from column id to upper bound in the column serialized as
  binary. Each value must be greater than or equal to all non-null,
  non-Nan values in the column for the file.

The corresponding parquet stats are collected by 'ColumnStats'
(in 'min_value_', 'max_value_', 'null_count_' members) and
'HdfsParquetTableWriter::BaseColumnWriter' (in
'total_compressed_byte_size_' member).

Testing:
- New e2e test was added to verify that the metrics are written to the
  Iceberg manifest upon inserting data.
- New e2e test was added to verify that lower_bounds/upper_bounds
  metrics are used to prune data files on querying iceberg tables.
- Existing e2e tests were updated to work with the new behavior.
- BE test for single-value serialization.

Relevant Iceberg documentation:
- Manifest:
  https://iceberg.apache.org/spec/#manifests
- Values in lower_bounds and upper_bounds maps should be Single-value
  serialized to binary:
  https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-writer.h
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-column-stats.inline.h
A be/src/exec/parquet/serialize-single-value-test.cc
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/util/bit-util.h
M common/fbs/IcebergObjects.fbs
M common/protobuf/control_service.proto
M fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
M infra/python/deps/requirements.txt
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partition-transform-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
A testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
M tests/query_test/test_iceberg.py
19 files changed, 1,025 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/06/17806/6
-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 6
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 6: Code-Review+2

Great job!


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 6
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Aug 2021 19:06:02 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 1:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9355/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 1
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Aug 2021 11:17:12 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#4). ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................

IMPALA-10879: Add parquet stats to iceberg manifest

This patch adds parquet stats to iceberg manifest as per-datafile
metrics.

The following metrics are supported:
- column_sizes :
  Map from column id to the total size on disk of all regions that
  store the column. Does not include bytes necessary to read other
  columns, like footers.

- null_value_counts :
  Map from column id to number of null values in the column.

- lower_bounds :
  Map from column id to lower bound in the column serialized as
  binary. Each value must be less than or equal to all non-null,
  non-NaN values in the column for the file.

- upper_bounds :
  Map from column id to upper bound in the column serialized as
  binary. Each value must be greater than or equal to all non-null,
  non-Nan values in the column for the file.

The corresponding parquet stats are collected by 'ColumnStats'
(in 'min_value_', 'max_value_', 'null_count_' members) and
'HdfsParquetTableWriter::BaseColumnWriter' (in
'total_compressed_byte_size_' member).

Testing:
- New e2e test was added to verify that the metrics are written to the
  Iceberg manifest upon inserting data.
- New e2e test was added to verify that lower_bounds/upper_bounds
  metrics are used to prune data files on querying iceberg tables.
- Existing e2e tests were updated to work with the new behavior.
- BE test for single-value serialization.

Relevant Iceberg documentation:
- Manifest:
  https://iceberg.apache.org/spec/#manifests
- Values in lower_bounds and upper_bounds maps should be Single-value
  serialized to binary:
  https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-writer.h
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-column-stats.inline.h
A be/src/exec/parquet/serialize-single-value-test.cc
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/util/bit-util.h
M common/fbs/IcebergObjects.fbs
M common/protobuf/control_service.proto
M fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
M infra/python/deps/requirements.txt
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partition-transform-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
A testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
M tests/query_test/test_iceberg.py
19 files changed, 1,034 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/06/17806/4
-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 4
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 3:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9386/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 3
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Aug 2021 16:04:31 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#2). ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................

IMPALA-10879: Add parquet stats to iceberg manifest

This patch adds parquet stats to iceberg manifest as per-datafile
metrics.

The following metrics are supported:
- column_sizes :
  Map from column id to the total size on disk of all regions that
  store the column. Does not include bytes necessary to read other
  columns, like footers.

- null_value_counts :
  Map from column id to number of null values in the column.

- lower_bounds :
  Map from column id to lower bound in the column serialized as
  binary. Each value must be less than or equal to all non-null,
  non-NaN values in the column for the file.

- upper_bounds :
  Map from column id to upper bound in the column serialized as
  binary. Each value must be greater than or equal to all non-null,
  non-Nan values in the column for the file.

The corresponding parquet stats are collected by 'ColumnStats'
(in 'min_value_', 'max_value_', 'null_count_' members) and
'HdfsParquetTableWriter::BaseColumnWriter' (in
'total_compressed_byte_size_' member).

Testing:
- New e2e test was added to verify that the metrics are written to the
  Iceberg manifest upon inserting data.
- New e2e test was added to verify that lower_bounds/upper_bounds
  metrics are used to prune data files on querying iceberg tables.
- Existing e2e tests were updated to work with the new behavior.

Relevant Iceberg documentation:
- Manifest:
  https://iceberg.apache.org/spec/#manifests
- Values in lower_bounds and upper_bounds maps should be Single-value
  serialized to binary:
  https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-writer.h
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-column-stats.inline.h
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/util/bit-util.h
M common/fbs/IcebergObjects.fbs
M common/protobuf/control_service.proto
M fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
M infra/python/deps/requirements.txt
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partition-transform-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
A testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
M tests/query_test/test_iceberg.py
17 files changed, 964 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/06/17806/2
-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#3). ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................

IMPALA-10879: Add parquet stats to iceberg manifest

This patch adds parquet stats to iceberg manifest as per-datafile
metrics.

The following metrics are supported:
- column_sizes :
  Map from column id to the total size on disk of all regions that
  store the column. Does not include bytes necessary to read other
  columns, like footers.

- null_value_counts :
  Map from column id to number of null values in the column.

- lower_bounds :
  Map from column id to lower bound in the column serialized as
  binary. Each value must be less than or equal to all non-null,
  non-NaN values in the column for the file.

- upper_bounds :
  Map from column id to upper bound in the column serialized as
  binary. Each value must be greater than or equal to all non-null,
  non-Nan values in the column for the file.

The corresponding parquet stats are collected by 'ColumnStats'
(in 'min_value_', 'max_value_', 'null_count_' members) and
'HdfsParquetTableWriter::BaseColumnWriter' (in
'total_compressed_byte_size_' member).

Testing:
- New e2e test was added to verify that the metrics are written to the
  Iceberg manifest upon inserting data.
- New e2e test was added to verify that lower_bounds/upper_bounds
  metrics are used to prune data files on querying iceberg tables.
- Existing e2e tests were updated to work with the new behavior.
- BE test for single-value serialization.

Relevant Iceberg documentation:
- Manifest:
  https://iceberg.apache.org/spec/#manifests
- Values in lower_bounds and upper_bounds maps should be Single-value
  serialized to binary:
  https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-writer.h
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-column-stats.inline.h
A be/src/exec/parquet/serialize-single-value-test.cc
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/util/bit-util.h
M common/fbs/IcebergObjects.fbs
M common/protobuf/control_service.proto
M fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
M infra/python/deps/requirements.txt
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partition-transform-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
A testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
M tests/query_test/test_iceberg.py
19 files changed, 1,033 insertions(+), 17 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/06/17806/3
-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 3
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 9: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7433/


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 9
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Sat, 28 Aug 2021 12:32:38 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 1:

(11 comments)

http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py
File tests/query_test/test_iceberg.py:

http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@24
PS1, Line 24: import avro.schema
flake8: F401 'avro.schema' imported but unused


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@25
PS1, Line 25: from avro.datafile import DataFileReader, DataFileWriter
flake8: F401 'avro.datafile.DataFileWriter' imported but unused


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@26
PS1, Line 26: from avro.io import DatumReader, DatumWriter
flake8: F401 'avro.io.DatumWriter' imported but unused


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@199
PS1, Line 199:  
flake8: E201 whitespace after '['


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@199
PS1, Line 199:  
flake8: E202 whitespace before ']'


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@260
PS1, Line 260: ,
flake8: E231 missing whitespace after ','


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@268
PS1, Line 268: ,
flake8: E231 missing whitespace after ','


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@282
PS1, Line 282: ,
flake8: E231 missing whitespace after ','


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@290
PS1, Line 290: ,
flake8: E231 missing whitespace after ','


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@292
PS1, Line 292: :
flake8: E231 missing whitespace after ':'


http://gerrit.cloudera.org:8080/#/c/17806/1/tests/query_test/test_iceberg.py@307
PS1, Line 307: :
flake8: E231 missing whitespace after ':'



-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 1
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Tue, 24 Aug 2021 10:56:47 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 4:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9387/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 4
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Aug 2021 16:22:01 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 6:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9390/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 6
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Aug 2021 18:39:15 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 2:

(4 comments)

Did a first round, the change looks great!
Thanks for adding so many tests.

http://gerrit.cloudera.org:8080/#/c/17806/2/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/17806/2/be/src/exec/parquet/hdfs-parquet-table-writer.cc@1723
PS2, Line 1723: col_name
Instead of 'col_name' we could use the Iceberg field_id.

E.g. we retrieve it here as well:
https://github.com/apache/impala/blob/237ed5e8738ec565bc8d3ce813d9b70c12ad4ce7/be/src/exec/parquet/hdfs-parquet-table-writer.cc#L1470

It would simplify the code in IcebergCatalogOpExecutor. It's also safer because someone might renames a column in parallel.


http://gerrit.cloudera.org:8080/#/c/17806/2/be/src/exec/parquet/parquet-column-stats.inline.h
File be/src/exec/parquet/parquet-column-stats.inline.h:

http://gerrit.cloudera.org:8080/#/c/17806/2/be/src/exec/parquet/parquet-column-stats.inline.h@369
PS2, Line 369: out->assign(data, sizeof(big_endian_val)); 
The spec says "using the minimum number of bytes for the value", but we are writing number of bytes corresponding to the data type.

Iceberg uses BigInteger.toByteArray():
https://github.com/apache/iceberg/blob/90225d6c9413016d611e2ce5eff37db1bc1b4fc5/api/src/main/java/org/apache/iceberg/types/Conversions.java#L121

which only writes a single byte for the value 42.
https://onlinegdb.com/i2QS7vqI3

But I'm OK with it if it's OK for Iceberg.


http://gerrit.cloudera.org:8080/#/c/17806/2/common/fbs/IcebergObjects.fbs
File common/fbs/IcebergObjects.fbs:

http://gerrit.cloudera.org:8080/#/c/17806/2/common/fbs/IcebergObjects.fbs@31
PS2, Line 31:   total_compressed_byte_size: FbLongStat;
            :   null_count: FbLongStat;
Can these be just longs? Like 'record_count' in L40?


http://gerrit.cloudera.org:8080/#/c/17806/2/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/17806/2/common/protobuf/control_service.proto@33
PS2, Line 33: // Iceberg per column statistics.
            : message IcebergDmlColumnStatsPB {
            :   // The on disk byte size.
            :   optional int64 column_size = 1;
            :   optional int64 null_count = 2;
            :   // min and max stats are single-value serialized.
            :   optional string min_binary = 3;
            :   optional string max_binary = 4;
            : }
I wonder if we could create the Iceberg flat buffers at the data sinks again. Like it was in https://gerrit.cloudera.org/#/c/16545/

That way we wouldn't need to duplicate the stats struct in protobuf. But I'm OK with the current solution if changing it would bring more complexity elsewhere.



-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 2
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Wed, 25 Aug 2021 18:54:50 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 7:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7430/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 7
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Aug 2021 19:06:22 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 9:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7433/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 9
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Sat, 28 Aug 2021 06:20:08 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 8:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9392/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 8
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Sat, 28 Aug 2021 06:41:02 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has uploaded a new patch set (#8). ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................

IMPALA-10879: Add parquet stats to iceberg manifest

This patch adds parquet stats to iceberg manifest as per-datafile
metrics.

The following metrics are supported:
- column_sizes :
  Map from column id to the total size on disk of all regions that
  store the column. Does not include bytes necessary to read other
  columns, like footers.

- null_value_counts :
  Map from column id to number of null values in the column.

- lower_bounds :
  Map from column id to lower bound in the column serialized as
  binary. Each value must be less than or equal to all non-null,
  non-NaN values in the column for the file.

- upper_bounds :
  Map from column id to upper bound in the column serialized as
  binary. Each value must be greater than or equal to all non-null,
  non-Nan values in the column for the file.

The corresponding parquet stats are collected by 'ColumnStats'
(in 'min_value_', 'max_value_', 'null_count_' members) and
'HdfsParquetTableWriter::BaseColumnWriter' (in
'total_compressed_byte_size_' member).

Testing:
- New e2e test was added to verify that the metrics are written to the
  Iceberg manifest upon inserting data.
- New e2e test was added to verify that lower_bounds/upper_bounds
  metrics are used to prune data files on querying iceberg tables.
- Existing e2e tests were updated to work with the new behavior.
- BE test for single-value serialization.

Relevant Iceberg documentation:
- Manifest:
  https://iceberg.apache.org/spec/#manifests
- Values in lower_bounds and upper_bounds maps should be Single-value
  serialized to binary:
  https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-writer.h
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-column-stats.inline.h
A be/src/exec/parquet/serialize-single-value-test.cc
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/util/bit-util.h
M common/fbs/IcebergObjects.fbs
M common/protobuf/control_service.proto
M fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
M infra/python/deps/requirements.txt
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partition-transform-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
A testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
M tests/query_test/test_iceberg.py
19 files changed, 1,025 insertions(+), 33 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/06/17806/8
-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 8
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Attila Jeges (Code Review)" <ge...@cloudera.org>.
Attila Jeges has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................

IMPALA-10879: Add parquet stats to iceberg manifest

This patch adds parquet stats to iceberg manifest as per-datafile
metrics.

The following metrics are supported:
- column_sizes :
  Map from column id to the total size on disk of all regions that
  store the column. Does not include bytes necessary to read other
  columns, like footers.

- null_value_counts :
  Map from column id to number of null values in the column.

- lower_bounds :
  Map from column id to lower bound in the column serialized as
  binary. Each value must be less than or equal to all non-null,
  non-NaN values in the column for the file.

- upper_bounds :
  Map from column id to upper bound in the column serialized as
  binary. Each value must be greater than or equal to all non-null,
  non-Nan values in the column for the file.

The corresponding parquet stats are collected by 'ColumnStats'
(in 'min_value_', 'max_value_', 'null_count_' members) and
'HdfsParquetTableWriter::BaseColumnWriter' (in
'total_compressed_byte_size_' member).

Testing:
- New e2e test was added to verify that the metrics are written to the
  Iceberg manifest upon inserting data.
- New e2e test was added to verify that lower_bounds/upper_bounds
  metrics are used to prune data files on querying iceberg tables.
- Existing e2e tests were updated to work with the new behavior.
- BE test for single-value serialization.

Relevant Iceberg documentation:
- Manifest:
  https://iceberg.apache.org/spec/#manifests
- Values in lower_bounds and upper_bounds maps should be Single-value
  serialized to binary:
  https://iceberg.apache.org/spec/#appendix-d-single-value-serialization

Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Reviewed-on: http://gerrit.cloudera.org:8080/17806
Tested-by: Impala Public Jenkins <im...@cloudera.com>
Reviewed-by: Attila Jeges <at...@cloudera.com>
---
M be/src/exec/hdfs-table-sink.cc
M be/src/exec/hdfs-table-writer.h
M be/src/exec/parquet/CMakeLists.txt
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/hdfs-parquet-table-writer.h
M be/src/exec/parquet/parquet-column-stats.h
M be/src/exec/parquet/parquet-column-stats.inline.h
A be/src/exec/parquet/serialize-single-value-test.cc
M be/src/runtime/dml-exec-state.cc
M be/src/runtime/dml-exec-state.h
M be/src/util/bit-util.h
M common/fbs/IcebergObjects.fbs
M common/protobuf/control_service.proto
M fe/src/main/java/org/apache/impala/service/IcebergCatalogOpExecutor.java
M infra/python/deps/requirements.txt
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partition-transform-insert.test
M testdata/workloads/functional-query/queries/QueryTest/iceberg-partitioned-insert.test
A testdata/workloads/functional-query/queries/QueryTest/iceberg-upper-lower-bound-metrics.test
M tests/query_test/test_iceberg.py
19 files changed, 1,025 insertions(+), 33 deletions(-)

Approvals:
  Impala Public Jenkins: Verified
  Attila Jeges: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 11
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Zoltan Borok-Nagy (Code Review)" <ge...@cloudera.org>.
Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 4: Code-Review+1

(2 comments)

The change looks great! I only left some minor comments.

http://gerrit.cloudera.org:8080/#/c/17806/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc
File be/src/exec/parquet/hdfs-parquet-table-writer.cc:

http://gerrit.cloudera.org:8080/#/c/17806/4/be/src/exec/parquet/hdfs-parquet-table-writer.cc@1682
PS4, Line 1682: [field_id]
Can we retrieve the map element only once?


http://gerrit.cloudera.org:8080/#/c/17806/2/common/protobuf/control_service.proto
File common/protobuf/control_service.proto:

http://gerrit.cloudera.org:8080/#/c/17806/2/common/protobuf/control_service.proto@33
PS2, Line 33: // Iceberg per column statistics.
            : message IcebergDmlColumnStatsPB {
            :   // The on disk byte size.
            :   optional int64 column_size = 1;
            :   optional int64 null_count = 2;
            :   // min and max stats are single-value serialized.
            :   optional string min_binary = 3;
            :   optional string max_binary = 4;
            : }
> I will give this a try and upload it later in a separate patch-set for easi
Thanks. Maybe you just need to add a string field to DmlFileStatusPb storing the flat buffer. Or add a vector of strings to DmlPartitionStatusPB.

Then in ClientRequestState we just need to extract the list in UpdateCatalog().

But I'm OK with the current code if it gets more complicated than that.



-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 4
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Aug 2021 16:12:10 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 5:

Build Successful 

https://jenkins.impala.io/job/gerrit-code-review-checks/9389/ : Initial code review checks passed. Use gerrit-verify-dryrun-external or gerrit-verify-dryrun to run full precommit tests.


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 5
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Fri, 27 Aug 2021 18:34:47 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 7: Verified-1

Build failed: https://jenkins.impala.io/job/gerrit-verify-dryrun/7430/


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 7
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Sat, 28 Aug 2021 01:22:59 +0000
Gerrit-HasComments: No

[Impala-ASF-CR] IMPALA-10879: Add parquet stats to iceberg manifest

Posted by "Impala Public Jenkins (Code Review)" <ge...@cloudera.org>.
Impala Public Jenkins has posted comments on this change. ( http://gerrit.cloudera.org:8080/17806 )

Change subject: IMPALA-10879: Add parquet stats to iceberg manifest
......................................................................


Patch Set 10:

Build started: https://jenkins.impala.io/job/gerrit-verify-dryrun/7453/ DRY_RUN=false


-- 
To view, visit http://gerrit.cloudera.org:8080/17806
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: Ic31f2260bc6f6a7f307ac955ff05eb154917675b
Gerrit-Change-Number: 17806
Gerrit-PatchSet: 10
Gerrit-Owner: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Attila Jeges <at...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <im...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <bo...@cloudera.com>
Gerrit-Comment-Date: Thu, 02 Sep 2021 13:15:59 +0000
Gerrit-HasComments: No