You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/08/26 08:10:01 UTC
[GitHub] [iceberg] dotjdk opened a new issue, #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests
dotjdk opened a new issue, #5641:
URL: https://github.com/apache/iceberg/issues/5641
### Apache Iceberg version
0.14.0 (latest release)
### Query engine
Spark
### Please describe the bug 🐞
If we create a table with `format-version = 2` and replace the partition spec after table creation, we are unable to read data from the table after executing a `rewrite_manifests`
If we look in the metadata after the manifest rewrite, the min/max stats for the partition has been nulled in both the manifest and the manifest list. This means that all queries on the table using metadata pruning will not return any results.
The issue does not appear if the table has the correct partition spec from creation, or if using `format-version = 1`
See the attached script for steps to reproduce, and similar steps to show that it works correctly with format 1 tables, or when the partition spec is specified on table creation and not modified after.
We are using Spark 3.3.0 with Iceberg 0.14.0, and reproduce the issue easily with a few steps in spark shell on a newly created table as in the attached script.
[script.scala.zip](https://github.com/apache/iceberg/files/9431000/script.scala.zip)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests
Posted by GitBox <gi...@apache.org>.
Fokko commented on issue #5641:
URL: https://github.com/apache/iceberg/issues/5641#issuecomment-1228213426
## Broken V2
```
➜ python git:(master) pyiceberg --uri thrift://localhost:9083 describe data.rewrite_test
Table format version 2
Metadata location file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test/metadata/00004-23f9a545-96f2-4c67-a6cd-e415676bfdf0.metadata.json
Table UUID 325fd058-afd1-48fd-846a-26a491ec8d68
Last Updated 1661498735055
Partition spec [
1001: ts_day: unknown(2)
]
Sort order []
Current schema Schema, id=0
├── 1: id: optional int
├── 2: ts: optional timestamptz
└── 3: day_of_ts: optional date
Current snapshot Operation.REPLACE: id=5015440624103305225, parent_id=4893783402811059498, schema_id=0
Snapshots Snapshots
├── Snapshot 4893783402811059498, schema 0: file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test/metadata/snap-4893783402811059498-1-1bd5b04a-7a85-4014-a703-6a56e8a7e741.avro
└── Snapshot 5015440624103305225, schema 0: file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test/metadata/snap-5015440624103305225-1-58d11599-f485-4b55-9e36-198e7ad1be7e.avro
Properties owner root
```
## Working V1
```json
➜ python git:(master) avro-tools tojson /Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test/metadata/snap-5015440624103305225-1-58d11599-f485-4b55-9e36-198e7ad1be7e.avro | jq
22/08/26 10:26:24 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
{
"manifest_path": "file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test/metadata/58d11599-f485-4b55-9e36-198e7ad1be7e-m0.avro",
"manifest_length": 6874,
"partition_spec_id": 1,
"content": 0,
"sequence_number": 2,
"min_sequence_number": 1,
"added_snapshot_id": 5015440624103305000,
"added_data_files_count": 0,
"existing_data_files_count": 1,
"deleted_data_files_count": 0,
"added_rows_count": 0,
"existing_rows_count": 1,
"deleted_rows_count": 0,
"partitions": {
"array": [
{
"contains_null": true,
"contains_nan": {
"boolean": false
},
"lower_bound": null,
"upper_bound": null
}
]
}
}
```
```bash
➜ python git:(master) pyiceberg --uri thrift://localhost:9083 describe data.rewrite_test3
Table format version 1
Metadata location file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test3/metadata/00003-d324b9a3-d18c-4e6d-92b3-a6a991fa21b4.metadata.json
Table UUID c099df65-a04a-449f-8ab0-f1e2c7028d1f
Last Updated 1661499222654
Partition spec [
1000: ts_day: unknown(2)
]
Sort order []
Current schema Schema, id=0
├── 1: id: optional int
├── 2: ts: optional timestamptz
└── 3: day_of_ts: optional date
Current snapshot Operation.REPLACE: id=7085493087562750676, parent_id=2500621515342040057, schema_id=0
Snapshots Snapshots
├── Snapshot 2500621515342040057, schema 0: file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test3/metadata/snap-2500621515342040057-1-ab084d85-0f7c-4c75-b647-3dcd1107b07a.avro
└── Snapshot 7085493087562750676, schema 0: file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test3/metadata/snap-7085493087562750676-1-2bbe10ad-b311-4397-a645-c64c0ae44b45.avro
Properties owner root
```
```json
➜ python git:(master) avro-tools tojson file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test3/metadata/snap-7085493087562750676-1-2bbe10ad-b311-4397-a645-c64c0ae44b45.avro | jq
22/08/26 10:26:41 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
{
"manifest_path": "file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test3/metadata/2bbe10ad-b311-4397-a645-c64c0ae44b45-m0.avro",
"manifest_length": 6878,
"partition_spec_id": 0,
"content": 0,
"sequence_number": 2,
"min_sequence_number": 1,
"added_snapshot_id": 7085493087562751000,
"added_data_files_count": 0,
"existing_data_files_count": 1,
"deleted_data_files_count": 0,
"added_rows_count": 0,
"existing_rows_count": 1,
"deleted_rows_count": 0,
"partitions": {
"array": [
{
"contains_null": false,
"contains_nan": {
"boolean": false
},
"lower_bound": {
"bytes": "1J\u0000\u0000"
},
"upper_bound": {
"bytes": "1J\u0000\u0000"
}
}
]
}
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests
Posted by GitBox <gi...@apache.org>.
Fokko commented on issue #5641:
URL: https://github.com/apache/iceberg/issues/5641#issuecomment-1230001969
Found the root issue:
![image](https://user-images.githubusercontent.com/1134248/187167760-7342b309-5d79-4c4a-b3a1-73a895e4e902.png)
It turns out that we select all the partition fields (the old and new ones), but we only update the statistics on the current partition keys, and this one is null. PR follows.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests
Posted by GitBox <gi...@apache.org>.
Fokko commented on issue #5641:
URL: https://github.com/apache/iceberg/issues/5641#issuecomment-1235113896
This has been fixed in https://github.com/apache/iceberg/pull/5691, thanks @rdblue. And thanks @dotjdk for reporting, much appreciated. Otherwise we wouldn't have caught this 🐛
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko commented on issue #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests
Posted by GitBox <gi...@apache.org>.
Fokko commented on issue #5641:
URL: https://github.com/apache/iceberg/issues/5641#issuecomment-1228197861
I'm able to reproduce it on my side:
## Works with V1:
```sql
%%sql
DROP TABLE if EXISTS data.rewrite_test3
```
<table>
<thead>
<tr>
</tr>
</thead>
<tbody>
</tbody>
</table>
```sql
%%sql
create table data.rewrite_test3 (id int, ts timestamp, day_of_ts date) using iceberg partitioned by (days(ts))
```
<table>
<thead>
<tr>
</tr>
</thead>
<tbody>
</tbody>
</table>
```sql
%%sql
alter table data.rewrite_test3 SET TBLPROPERTIES ('format-version' = '2')
```
22/08/26 07:33:13 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
<table>
<thead>
<tr>
</tr>
</thead>
<tbody>
</tbody>
</table>
```sql
%%sql
insert into data.rewrite_test3 values (1, CAST('2022-01-01 10:00:00' AS TIMESTAMP), CAST('2022-01-01' AS DATE))
```
<table>
<thead>
<tr>
</tr>
</thead>
<tbody>
</tbody>
</table>
```sql
%%sql
select * from data.rewrite_test3 where ts < current_timestamp()
```
<table>
<thead>
<tr>
<th>id</th>
<th>ts</th>
<th>day_of_ts</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2022-01-01 10:00:00</td>
<td>2022-01-01</td>
</tr>
</tbody>
</table>
```sql
%%sql
call system.rewrite_manifests(table => 'data.rewrite_test3')
```
<table>
<thead>
<tr>
<th>rewritten_manifests_count</th>
<th>added_manifests_count</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>
```sql
%%sql
select * from data.rewrite_test3 where ts < current_timestamp()
```
<table>
<thead>
<tr>
<th>id</th>
<th>ts</th>
<th>day_of_ts</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2022-01-01 10:00:00</td>
<td>2022-01-01</td>
</tr>
</tbody>
</table>
## Seems to be broken with V2
```sql
%%sql
drop table if exists data.rewrite_test
```
<table>
<thead>
<tr>
</tr>
</thead>
<tbody>
</tbody>
</table>
```sql
%%sql
create table data.rewrite_test (id int, ts timestamp, day_of_ts date) using iceberg partitioned by (day_of_ts)
```
<table>
<thead>
<tr>
</tr>
</thead>
<tbody>
</tbody>
</table>
```sql
%%sql
describe table extended data.rewrite_test
```
<table>
<thead>
<tr>
<th>col_name</th>
<th>data_type</th>
<th>comment</th>
</tr>
</thead>
<tbody>
<tr>
<td>id</td>
<td>int</td>
<td></td>
</tr>
<tr>
<td>ts</td>
<td>timestamp</td>
<td></td>
</tr>
<tr>
<td>day_of_ts</td>
<td>date</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td># Partitioning</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Part 0</td>
<td>day_of_ts</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td># Metadata Columns</td>
<td></td>
<td></td>
</tr>
<tr>
<td>_spec_id</td>
<td>int</td>
<td></td>
</tr>
<tr>
<td>_partition</td>
<td>struct<day_of_ts:date></td>
<td></td>
</tr>
<tr>
<td>_file</td>
<td>string</td>
<td></td>
</tr>
<tr>
<td>_pos</td>
<td>bigint</td>
<td></td>
</tr>
<tr>
<td>_deleted</td>
<td>boolean</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td># Detailed Table Information</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Name</td>
<td>demo.data.rewrite_test</td>
<td></td>
</tr>
<tr>
<td>Location</td>
<td>file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test</td>
<td></td>
</tr>
<tr>
<td>Provider</td>
<td>iceberg</td>
<td></td>
</tr>
<tr>
<td>Owner</td>
<td>root</td>
<td></td>
</tr>
<tr>
<td>Table Properties</td>
<td>[current-snapshot-id=none,format=iceberg/parquet,format-version=1]</td>
<td></td>
</tr>
</tbody>
</table>
```sql
%%sql
alter table data.rewrite_test SET TBLPROPERTIES ('format-version' = '2')
```
22/08/26 07:24:13 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up
<table>
<thead>
<tr>
</tr>
</thead>
<tbody>
</tbody>
</table>
```sql
%%sql
describe table extended data.rewrite_test
```
<table>
<thead>
<tr>
<th>col_name</th>
<th>data_type</th>
<th>comment</th>
</tr>
</thead>
<tbody>
<tr>
<td>id</td>
<td>int</td>
<td></td>
</tr>
<tr>
<td>ts</td>
<td>timestamp</td>
<td></td>
</tr>
<tr>
<td>day_of_ts</td>
<td>date</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td># Partitioning</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Part 0</td>
<td>day_of_ts</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td># Metadata Columns</td>
<td></td>
<td></td>
</tr>
<tr>
<td>_spec_id</td>
<td>int</td>
<td></td>
</tr>
<tr>
<td>_partition</td>
<td>struct<day_of_ts:date></td>
<td></td>
</tr>
<tr>
<td>_file</td>
<td>string</td>
<td></td>
</tr>
<tr>
<td>_pos</td>
<td>bigint</td>
<td></td>
</tr>
<tr>
<td>_deleted</td>
<td>boolean</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td># Detailed Table Information</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Name</td>
<td>demo.data.rewrite_test</td>
<td></td>
</tr>
<tr>
<td>Location</td>
<td>file:/Users/fokkodriesprong/Desktop/docker-spark-iceberg/wh/data.db/rewrite_test</td>
<td></td>
</tr>
<tr>
<td>Provider</td>
<td>iceberg</td>
<td></td>
</tr>
<tr>
<td>Owner</td>
<td>root</td>
<td></td>
</tr>
<tr>
<td>Table Properties</td>
<td>[current-snapshot-id=none,format=iceberg/parquet,format-version=2]</td>
<td></td>
</tr>
</tbody>
</table>
```sql
%%sql
ALTER TABLE data.rewrite_test REPLACE PARTITION FIELD day_of_ts WITH days(ts)
```
<table>
<thead>
<tr>
</tr>
</thead>
<tbody>
</tbody>
</table>
```sql
%%sql
insert into data.rewrite_test values (1, CAST('2022-01-01 10:00:00' AS TIMESTAMP), CAST('2022-01-01' AS DATE))
```
<table>
<thead>
<tr>
</tr>
</thead>
<tbody>
</tbody>
</table>
```sql
%%sql
select * from data.rewrite_test where ts < current_timestamp()
```
<table>
<thead>
<tr>
<th>id</th>
<th>ts</th>
<th>day_of_ts</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2022-01-01 10:00:00</td>
<td>2022-01-01</td>
</tr>
</tbody>
</table>
```sql
%%sql
call system.rewrite_manifests(table => 'data.rewrite_test')
```
22/08/26 07:25:34 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
<table>
<thead>
<tr>
<th>rewritten_manifests_count</th>
<th>added_manifests_count</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>
```sql
%%sql
select * from data.rewrite_test where ts < current_timestamp()
```
<table>
<thead>
<tr>
<th>id</th>
<th>ts</th>
<th>day_of_ts</th>
</tr>
</thead>
<tbody>
</tbody>
</table>
```sql
%%sql
select * from data.rewrite_test where day_of_ts < current_timestamp()
```
<table>
<thead>
<tr>
<th>id</th>
<th>ts</th>
<th>day_of_ts</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>2022-01-01 10:00:00</td>
<td>2022-01-01</td>
</tr>
</tbody>
</table>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] Fokko closed issue #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests
Posted by GitBox <gi...@apache.org>.
Fokko closed issue #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests
URL: https://github.com/apache/iceberg/issues/5641
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org
[GitHub] [iceberg] dotjdk commented on issue #5641: Core: Metadata min/max stats nulled after updating partition spec and rewriting manifests
Posted by GitBox <gi...@apache.org>.
dotjdk commented on issue #5641:
URL: https://github.com/apache/iceberg/issues/5641#issuecomment-1228200072
Yes, v1 doesn't have the issue. And v2 doesn't have the issue if the partition spec was specified on table creation and not modified after.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org