You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "Fokko (via GitHub)" <gi...@apache.org> on 2023/04/28 18:05:47 UTC

[GitHub] [iceberg] Fokko opened a new issue, #7463: Incorrect summary

Fokko opened a new issue, #7463:
URL: https://github.com/apache/iceberg/issues/7463

   ### Apache Iceberg version
   
   1.2.1 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   Make sure to set the number of partitions to one, to avoid having files with a single record (merge on read will just discard the file when all the rows are marked as deleted).
   ```sql
   SET spark.sql.shuffle.partitions=1;
   ```
   
   Create a table:
   ```sql
   create table default.t1
   (
       foo          string,
       bar          int,
       dt           timestamp
   ) TBLPROPERTIES (
       'format-version'='2',
       'write.delete.mode'='merge-on-read',
       'write.update.mode'='merge-on-read',
       'write.merge.mode'='merge-on-read'
   )
   PARTITIONED BY (days(dt));
   ```
   
   Insert some data:
   ```sql
   INSERT INTO default.t1 VALUES
       ('a', 1, now()), 
       ('b', 2, now()), 
       ('c', 3, now());
   ```
   
   SELECT returns:
   
   foo | bar | dt
   -- | -- | --
   a | 1 | 2023-04-28 13:51:28.522796
   b | 2 | 2023-04-28 13:51:28.522817
   c | 3 | 2023-04-28 13:51:28.522837
   
   Delete one row:
   ```sql
   DELETE FROM default.t1 WHERE foo = 'b'
   ```
   
   Look at the files:
   ```sql
   SELECT * FROM default.t1.files
   ```
   
   We see one data file, and one delete file:
   
   content | file_path | file_format | spec_id | partition | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets | equality_ids | sort_order_id | readable_metrics
   -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
   0 | s3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2424-8d75df2c-4639-49a4-82e6-1772365a9355-00001.parquet | PARQUET | 0 | Row(dt_day=datetime.date(2023, 4, 28)) | 3 | 897 | {1: 54, 2: 52, 3: 56} | {1: 3, 2: 3, 3: 3} | {1: 0, 2: 0, 3: 0} | {} | {1: bytearray(b'a'), 2: bytearray(b'\x01\x00\x00\x00'), 3: bytearray(b',\x92\x12\xc5e\xfa\x05\x00')} | {1: bytearray(b'c'), 2: bytearray(b'\x03\x00\x00\x00'), 3: bytearray(b'U\x92\x12\xc5e\xfa\x05\x00')} | None | [4] | None | 0 | Row(bar=Row(column_size=52, value_count=3, null_value_count=0, nan_value_count=None, lower_bound=1, upper_bound=3), dt=Row(column_size=56, value_count=3, null_value_count=0, nan_value_count=None, lower_bound=datetime.datetime(2023, 4, 28, 13, 51, 28, 522796), upper_bound=datetime.datetime(2023, 4, 28, 13, 51, 28, 522837)), foo=Row(column_size=54, value_count=3, null_value_count=0, nan_value_count=None, lower_bound='a', upper_bound='c'))
   1 | s3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2428-3743fe19-d841-4d1c-b940-bd73692b29c7-00001-deletes.parquet | PARQUET | 0 | Row(dt_day=datetime.date(2023, 4, 28)) | 1 | 1490 | {2147483546: 142, 2147483545: 46} | {2147483546: 1, 2147483545: 1} | {2147483546: 0, 2147483545: 0} | {} | {2147483546: bytearray(b's3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2424-8d75df2c-4639-49a4-82e6-1772365a9355-00001.parquet'), 2147483545: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00')} | {2147483546: bytearray(b's3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2424-8d75df2c-4639-49a4-82e6-1772365a9355-00001.parquet'), 2147483545: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00')} | None | None | None | None | Row(bar=Row(column_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None), dt=Row(column_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None), foo=Row(colu
 mn_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None))
   
   I don't like delete files, they slow down my queries, so let's clean them up:
   
   ```sql
   CALL system.rewrite_data_files(table => 'default.t1', options => map (
       'delete-file-threshold', '1'
   ))
   ```
   
   Let's look again at the files:
   
   ```sql
   SELECT * FROM default.t1.files
   ```
   
   This is incorrect, we still get the delete file, but the data file has been updated (we can see in the value count that there are two rows):
   
   content | file_path | file_format | spec_id | partition | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets | equality_ids | sort_order_id | readable_metrics
   -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
   0 | s3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2433-3108a2a5-b3f3-4c16-849b-4c54604584f5-00001.parquet | PARQUET | 0 | Row(dt_day=datetime.date(2023, 4, 28)) | 2 | 909 | {1: 55, 2: 53, 3: 60} | {1: 2, 2: 2, 3: 2} | {1: 0, 2: 0, 3: 0} | {} | {1: bytearray(b'a'), 2: bytearray(b'\x01\x00\x00\x00'), 3: bytearray(b',\x92\x12\xc5e\xfa\x05\x00')} | {1: bytearray(b'c'), 2: bytearray(b'\x03\x00\x00\x00'), 3: bytearray(b'U\x92\x12\xc5e\xfa\x05\x00')} | None | [4] | None | 0 | Row(bar=Row(column_size=53, value_count=2, null_value_count=0, nan_value_count=None, lower_bound=1, upper_bound=3), dt=Row(column_size=60, value_count=2, null_value_count=0, nan_value_count=None, lower_bound=datetime.datetime(2023, 4, 28, 13, 51, 28, 522796), upper_bound=datetime.datetime(2023, 4, 28, 13, 51, 28, 522837)), foo=Row(column_size=55, value_count=2, null_value_count=0, nan_value_count=None, lower_bound='a', upper_bound='c'))
   1 | s3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2428-3743fe19-d841-4d1c-b940-bd73692b29c7-00001-deletes.parquet | PARQUET | 0 | Row(dt_day=datetime.date(2023, 4, 28)) | 1 | 1490 | {2147483546: 142, 2147483545: 46} | {2147483546: 1, 2147483545: 1} | {2147483546: 0, 2147483545: 0} | {} | {2147483546: bytearray(b's3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2424-8d75df2c-4639-49a4-82e6-1772365a9355-00001.parquet'), 2147483545: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00')} | {2147483546: bytearray(b's3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2424-8d75df2c-4639-49a4-82e6-1772365a9355-00001.parquet'), 2147483545: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00')} | None | None | None | None | Row(bar=Row(column_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None), dt=Row(column_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None), foo=Row(colu
 mn_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None))
   
   Double checking, it still looks good:
   
   ```sql
   SELECT * FROM default.t1
   ```
   
   foo | bar | dt
   -- | -- | --
   a | 1 | 2023-04-28 13:51:28.522796
   c | 3 | 2023-04-28 13:51:28.522837
   
   Also in the snapshots:
   
   ```sql
   SELECT * FROM default.t1.snapshots
   ```
   
   The summary still mentions 3 rows:
   
   
   
   committed_at | snapshot_id | parent_id | operation | manifest_list | summary
   -- | -- | -- | -- | -- | --
   2023-04-28 13:51:28.654000 | 1830440247885882398 | None | append | s3://warehouse/default/t1/metadata/snap-1830440247885882398-1-e17ec3b6-01a0-49e8-8be4-fffca9dbdaef.avro | {'spark.app.id': 'local-1682689536619', 'changed-partition-count': '1', 'added-data-files': '1', 'total-equality-deletes': '0', 'added-records': '3', 'total-position-deletes': '0', 'added-files-size': '897', 'total-delete-files': '0', 'total-files-size': '897', 'total-records': '3', 'total-data-files': '1'}
   2023-04-28 13:51:34.738000 | 1705447835275510977 | 1830440247885882398 | overwrite | s3://warehouse/default/t1/metadata/snap-1705447835275510977-1-44a6bbb7-a1fc-4e6b-bed7-5f8a578a2a11.avro | {'spark.app.id': 'local-1682689536619', 'changed-partition-count': '1', 'added-position-deletes': '1', 'total-equality-deletes': '0', 'total-position-deletes': '1', 'added-position-delete-files': '1', 'added-files-size': '1490', 'total-delete-files': '1', 'added-delete-files': '1', 'total-files-size': '2387', 'total-records': '3', 'total-data-files': '1'}
   2023-04-28 15:06:04.304000 | 1992524767636422274 | 1705447835275510977 | replace | s3://warehouse/default/t1/metadata/snap-1992524767636422274-1-8c696509-2fee-4dd8-ae4c-1729da3bfe17.avro | {'added-data-files': '1', 'total-equality-deletes': '0', 'added-records': '2', 'deleted-data-files': '1', 'deleted-records': '3', 'total-records': '2', 'removed-files-size': '897', 'changed-partition-count': '1', 'total-position-deletes': '1', 'added-files-size': '909', 'total-delete-files': '1', 'total-files-size': '2399', 'total-data-files': '1'}
   
   I thought, maybe we need to rewrite the manifests:
   
   ```sql
   CALL system.rewrite_manifests('default.t1')
   ```
   
   rewritten_manifests_count | added_manifests_count
   -- | --
   2 | 1
   
   ```sql
   SELECT * FROM default.t1.snapshots
   ```
   
   Adds a new snapshot, but still 3 rows:
   
   committed_at | snapshot_id | parent_id | operation | manifest_list | summary
   -- | -- | -- | -- | -- | --
   2023-04-28 13:51:28.654000 | 1830440247885882398 | None | append | s3://warehouse/default/t1/metadata/snap-1830440247885882398-1-e17ec3b6-01a0-49e8-8be4-fffca9dbdaef.avro | {'spark.app.id': 'local-1682689536619', 'changed-partition-count': '1', 'added-data-files': '1', 'total-equality-deletes': '0', 'added-records': '3', 'total-position-deletes': '0', 'added-files-size': '897', 'total-delete-files': '0', 'total-files-size': '897', 'total-records': '3', 'total-data-files': '1'}
   2023-04-28 13:51:34.738000 | 1705447835275510977 | 1830440247885882398 | overwrite | s3://warehouse/default/t1/metadata/snap-1705447835275510977-1-44a6bbb7-a1fc-4e6b-bed7-5f8a578a2a11.avro | {'spark.app.id': 'local-1682689536619', 'changed-partition-count': '1', 'added-position-deletes': '1', 'total-equality-deletes': '0', 'total-position-deletes': '1', 'added-position-delete-files': '1', 'added-files-size': '1490', 'total-delete-files': '1', 'added-delete-files': '1', 'total-files-size': '2387', 'total-records': '3', 'total-data-files': '1'}
   2023-04-28 15:06:04.304000 | 1992524767636422274 | 1705447835275510977 | replace | s3://warehouse/default/t1/metadata/snap-1992524767636422274-1-8c696509-2fee-4dd8-ae4c-1729da3bfe17.avro | {'added-data-files': '1', 'total-equality-deletes': '0', 'added-records': '2', 'deleted-data-files': '1', 'deleted-records': '3', 'total-records': '2', 'removed-files-size': '897', 'changed-partition-count': '1', 'total-position-deletes': '1', 'added-files-size': '909', 'total-delete-files': '1', 'total-files-size': '2399', 'total-data-files': '1'}
   2023-04-28 18:01:58.283000 | 5057249889609572796 | 1992524767636422274 | replace | s3://warehouse/default/t1/metadata/snap-5057249889609572796-1-31f5ecc8-a81f-40c6-8a9d-205a52f1c64c.avro | {'changed-partition-count': '0', 'total-equality-deletes': '0', 'manifests-created': '1', 'total-position-deletes': '1', 'total-delete-files': '1', 'manifests-kept': '0', 'manifests-replaced': '2', 'entries-processed': '0', 'total-files-size': '2399', 'total-records': '2', 'total-data-files': '1'}
   
   Also, the files are still incorrect:
   
   ```sql
   SELECT * FROM default.t1.files
   ```
   
   
   
   content | file_path | file_format | spec_id | partition | record_count | file_size_in_bytes | column_sizes | value_counts | null_value_counts | nan_value_counts | lower_bounds | upper_bounds | key_metadata | split_offsets | equality_ids | sort_order_id | readable_metrics
   -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
   0 | s3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2433-3108a2a5-b3f3-4c16-849b-4c54604584f5-00001.parquet | PARQUET | 0 | Row(dt_day=datetime.date(2023, 4, 28)) | 2 | 909 | {1: 55, 2: 53, 3: 60} | {1: 2, 2: 2, 3: 2} | {1: 0, 2: 0, 3: 0} | {} | {1: bytearray(b'a'), 2: bytearray(b'\x01\x00\x00\x00'), 3: bytearray(b',\x92\x12\xc5e\xfa\x05\x00')} | {1: bytearray(b'c'), 2: bytearray(b'\x03\x00\x00\x00'), 3: bytearray(b'U\x92\x12\xc5e\xfa\x05\x00')} | None | [4] | None | 0 | Row(bar=Row(column_size=53, value_count=2, null_value_count=0, nan_value_count=None, lower_bound=1, upper_bound=3), dt=Row(column_size=60, value_count=2, null_value_count=0, nan_value_count=None, lower_bound=datetime.datetime(2023, 4, 28, 13, 51, 28, 522796), upper_bound=datetime.datetime(2023, 4, 28, 13, 51, 28, 522837)), foo=Row(column_size=55, value_count=2, null_value_count=0, nan_value_count=None, lower_bound='a', upper_bound='c'))
   1 | s3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2428-3743fe19-d841-4d1c-b940-bd73692b29c7-00001-deletes.parquet | PARQUET | 0 | Row(dt_day=datetime.date(2023, 4, 28)) | 1 | 1490 | {2147483546: 142, 2147483545: 46} | {2147483546: 1, 2147483545: 1} | {2147483546: 0, 2147483545: 0} | {} | {2147483546: bytearray(b's3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2424-8d75df2c-4639-49a4-82e6-1772365a9355-00001.parquet'), 2147483545: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00')} | {2147483546: bytearray(b's3://warehouse/default/t1/data/dt_day=2023-04-28/00000-2424-8d75df2c-4639-49a4-82e6-1772365a9355-00001.parquet'), 2147483545: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00')} | None | None | None | None | Row(bar=Row(column_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None), dt=Row(column_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None), foo=Row(colu
 mn_size=None, value_count=None, null_value_count=None, nan_value_count=None, lower_bound=None, upper_bound=None))
   
   This looks like an inconsistency, can anyone confirm?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

Re: [I] Spark: inconsistency in rewrite data and summary [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #7463:
URL: https://github.com/apache/iceberg/issues/7463#issuecomment-1797064259

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on issue #7463: Spark: inconsistency in rewrite data and summary

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.

singhpk234 commented on issue #7463:
URL: https://github.com/apache/iceberg/issues/7463#issuecomment-1528386404

   Also I am not still sure, why the doc mentions `.files` table to show current data_files but displays delete files as well 
   
   https://iceberg.apache.org/docs/latest/spark-queries/#files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] szehon-ho commented on issue #7463: Spark: inconsistency in rewrite data and summary

Posted by "szehon-ho (via GitHub)" <gi...@apache.org>.

szehon-ho commented on issue #7463:
URL: https://github.com/apache/iceberg/issues/7463#issuecomment-1528684496

   Yea, my writeup (linked by @singhpk234 ) explains the problem with not cleaning up dangling deletes.  (If that is problem you refer to).  Actually I have been working on #7389 to solve it.  Maybe its also possible to improve rewrite_data_files to do this automatically as well after this change. 
   
   > Also I am not still sure, why the doc mentions .files table to show current data_files but displays delete files as well, imho we should fix the doc or the behaviour
   > 
   > https://iceberg.apache.org/docs/latest/spark-queries/#files
   
   Totally agree, I forgot to document those tables when I added them.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

Re: [I] Spark: inconsistency in rewrite data and summary [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] closed issue #7463: Spark: inconsistency in rewrite data and summary
URL: https://github.com/apache/iceberg/issues/7463


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] singhpk234 commented on issue #7463: Spark: inconsistency in rewrite data and summary

Posted by "singhpk234 (via GitHub)" <gi...@apache.org>.

singhpk234 commented on issue #7463:
URL: https://github.com/apache/iceberg/issues/7463#issuecomment-1528384960

   As per my understanding it's because of this @Fokko 
   - https://github.com/apache/iceberg/issues/4127
   
   RewriteDataFiles runs with 'use-starting-sequence-number’=true (the default!). 
   Example:
   A partition has Data File with sequence number 1, 
   and applicable Delete File with sequence number 2.
   RewriteDataFiles is run on this partition
   New Data Files are created with sequence number 2.  
   The now-invalid Delete File with sequence number 2 will not be removed.
   
   This is something @szehon-ho's proposal also mentions https://docs.google.com/document/d/11d-cIUR_89kRsMmWnEoxXGZCvp7L4TUmPJqUC60zB5M/edit
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

Re: [I] Spark: inconsistency in rewrite data and summary [iceberg]

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.

github-actions[bot] commented on issue #7463:
URL: https://github.com/apache/iceberg/issues/7463#issuecomment-1820009219

   This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org