You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "abmo-x (via GitHub)" <gi...@apache.org> on 2023/02/13 06:47:06 UTC

[GitHub] [iceberg] abmo-x opened a new issue, #6817: add_files data corruption - old partitions get lost when new partition is added

abmo-x opened a new issue, #6817:
URL: https://github.com/apache/iceberg/issues/6817

   ### Apache Iceberg version
   
   1.1.0 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   ### Problem:
   when spark application is used to add data files using the add_files spark procedure, previous partition that was added earlier using the add_files procedure gets deleted.
   
   ```
   spark.sql("select * from testpartitionoverwrite2.snapshots").take(100)
   
   [Row(committed_at=datetime.datetime(2023, 2, 12, 2, 19, 3, 467000), snapshot_id=5981366128498817719, parent_id=None, operation='append', manifest_list='s3a://..../testpartitionoverwrite2/metadata/snap-5981366128498817719-1-f64207fd-5e7d-49ef-959e-50edd08806d9.avro', summary={'added-data-files': '20', 'total-equality-deletes': '0', **'added-records': '12105994'**, 'total-position-deletes': '0', 'total-delete-files': '0', 'total-files-size': '0', 'total-records': '12105994', 'total-data-files': '20'}),
    Row(committed_at=datetime.datetime(2023, 2, 12, 18, 25, 44, 660000), snapshot_id=7331297376463770912, parent_id=5981366128498817719, operation='append', manifest_list='s3a://aiml-prod-data-warehouse-default/warehouse/cdotest.db/testpartitionoverwrite2/metadata/snap-7331297376463770912-1-f3151f52-55ff-4f66-80ee-2e019544de94.avro', summary={'added-data-files': '20', 'total-equality-deletes': '0', **'added-records': '11737162',** 'total-position-deletes': '0', 'total-delete-files': '0', 'total-files-size': '0', **'total-records': '23843156'**, 'total-data-files': '40'})]
   ```
   
   Total records expected in the table after two append snapshots  should be **'total-records': '23843156'**,
   
   However, only the records added in new snapshots are available and count of all records in table is: **11737162** which is incorrect
   
   <img width="623" alt="image" src="https://user-images.githubusercontent.com/69539469/218331278-f1f17a4b-7f7d-4775-b73e-1252220db575.png">
   
   ### What to expect
   all previous and latest partitions added using add_files should still be available, total records in table should be same as total_records in latest snapshot
   
   ### Steps to reproduce
   
   1) create a table test
   2) use add_files to add a number of datafiles with partition x
   3) verify partition x is present in table test.partitions
   4) use add_files to add b number of datafiles with partition y
   5) verify test.partitions now only has 'y' partition and x is missing
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] abmo-x closed issue #6817: Spark: add_files loses data when snapshot-id-inheritance.enabled=true

Posted by "abmo-x (via GitHub)" <gi...@apache.org>.
abmo-x closed issue #6817: Spark: add_files loses data when snapshot-id-inheritance.enabled=true
URL: https://github.com/apache/iceberg/issues/6817


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on issue #6817: add_files data corruption - old partitions get lost when new partition is added

Posted by "dramaticlly (via GitHub)" <gi...@apache.org>.
dramaticlly commented on issue #6817:
URL: https://github.com/apache/iceberg/issues/6817#issuecomment-1427476988

   That's a good finding!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] github-actions[bot] commented on issue #6817: Spark: add_files loses data when snapshot-id-inheritance.enabled=true

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on issue #6817:
URL: https://github.com/apache/iceberg/issues/6817#issuecomment-1678253650

   This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org