You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "zohar-plutoflume (via GitHub)" <gi...@apache.org> on 2023/04/19 17:02:20 UTC

[GitHub] [iceberg] zohar-plutoflume opened a new issue, #7379: Delete command from iceberg table does not delete all the data it should delete.

zohar-plutoflume opened a new issue, #7379:
URL: https://github.com/apache/iceberg/issues/7379

   ### Apache Iceberg version
   
   0.14.1
   
   ### Query engine
   
   EMR
   
   ### Please describe the bug 🐞
   
   We noticed that the delete command which executes successfully actually does not delete the data.
   so an example query would be:
   ```
   delete * from table where tenant_id=690
   ```
   which we would expect to delete everything for this tenant, we still get records left.
   but when we query the table after the delete:
   ```
   select count(*) from table where tenant_id=690
   ```
   it returns 7 records
   
   now for the details:
   (emr 6.9.0 iceberg version - 0.14.1, spark version 3.3.0)
   I can't reproduce the issue locally , so unfortunately I can only show the info I got from trying to debug it from the logs:
   
   job correctly loads the table:
   ````
   2023-04-19T12:32:12,561 INFO iceberg.BaseMetastoreTableOperations: Refreshing table metadata from new version: s3://prod-tessian-platform.com-data-lake/email_check_outbound_priority/metadata/32444-ecaf012a-6ff8-4485-a4a5-3343cbc46e00.metadata.json
   ```
   2. job correctly understands that the column we delete from is a partition column and the operation is a metadata operation only:
   ```
   2023-04-19T12:32:15,732 INFO iceberg.BaseTableScan: Scanning table iceberg.iceberg_db.email_check_outbound_priority snapshot 8530920662702686267 created at 2023-04-19 12:20:25.224 with filter tenant_id = (3-digit-int)
   2023-04-19T12:32:17,625 INFO v2.OptimizeMetadataOnlyDeleteFromIcebergTable$: Optimizing delete expression: EqualTo(tenant_id,690) as metadata delete
   ```
   3. job correctly commits a new iceberg snapshot:
   ```
   2023-04-19T12:32:21,859 INFO iceberg.BaseMetastoreTableOperations: Successfully committed to table iceberg.iceberg_db.email_check_outbound_priority in 456 ms
   2023-04-19T12:32:21,859 INFO iceberg.SnapshotProducer: Committed snapshot 1441441847084407586 (StreamingDelete)
   ```
   4. snapshot is found in the table:
   ```
   2023-04-19 12:32:21.233|1441441847084407586|8530920662702686267|delete|s3://prod-tessian-platform.com-data-lake/email_check_outbound_priority/metadata/snap-1441441847084407586-1-e688a3b4-a062-4464-8b78-47c432cedd69.avro%7C
   {
     spark.app.id -> application_1681907352232_0001,
     changed-partition-count -> 0,
     total-records -> 28127739,
     total-files-size -> 3851046875,
     total-data-files -> 234706,
     total-delete-files -> 0,
     total-position-deletes -> 0,
     total-equality-deletes -> 0
   }
   ```
   and yet the data is still there:
   ```
   "SELECT COUNT (*) FROM iceberg.iceberg_db.email_check_outbound_priority WHERE tenant_id = 690"
   +--------+
   |count(1)|
   +--------+
   |7       |
   +--------+
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zohar-plutoflume commented on issue #7379: Delete command from iceberg table does not delete all the data it should delete.

Posted by "zohar-plutoflume (via GitHub)" <gi...@apache.org>.
zohar-plutoflume commented on issue #7379:
URL: https://github.com/apache/iceberg/issues/7379#issuecomment-1515956542

   hi @RussellSpitzer thank you for the quick response. 
   as we are running on EMR it will be a bit hard to test 1.2 (the latest version is 1.1 only). 
   some people in the slack pointed me to this issue:
   https://github.com/apache/iceberg/issues/6670# 
   and this possible fix:
   https://github.com/apache/iceberg/commit/d6e770e3491b75fa20c02336db89d269abc05070.
   
   I'll take a look at the file metrics as you suggested (although not sure what I'm looking for)
   not sure I mentioned but the column we are running the delete on is a partition column. 
   
   thank you again!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zohar-plutoflume commented on issue #7379: Delete command from iceberg table does not delete all the data it should delete.

Posted by "zohar-plutoflume (via GitHub)" <gi...@apache.org>.
zohar-plutoflume commented on issue #7379:
URL: https://github.com/apache/iceberg/issues/7379#issuecomment-1541903846

   thanks for the help, we managed to test it out with 1.2 and it indeed solved the issue we had. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #7379: Delete command from iceberg table does not delete all the data it should delete.

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on issue #7379:
URL: https://github.com/apache/iceberg/issues/7379#issuecomment-1515142051

   I would check if the error reproduces on Iceberg 1.2.0. I know we saw some similar things when dealing with NANs and some other edge case values in the file metrics. For debugging on this version I would check the file metrics of all the files whose partition matches your clause.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on issue #7379: Delete command from iceberg table does not delete all the data it should delete.

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on issue #7379:
URL: https://github.com/apache/iceberg/issues/7379#issuecomment-1516052664

   That could definitely be itSent from my iPhoneOn Apr 20, 2023, at 4:26 AM, zohar-plutoflume ***@***.***> wrote:
   hi @RussellSpitzer thank you for the quick response.
   as we are running on EMR it will be a bit hard to test 1.2 (the latest version is 1.1 only).
   some people in the slack pointed me to this issue:
   #6670
   and this possible fix:
   d6e770e.
   I'll take a look at the file metrics as you suggested (although not sure what I'm looking for)
   not sure I mentioned but the column we are running the delete on is a partition column.
   thank you again!
   
   —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] zohar-plutoflume closed issue #7379: Delete command from iceberg table does not delete all the data it should delete.

Posted by "zohar-plutoflume (via GitHub)" <gi...@apache.org>.
zohar-plutoflume closed issue #7379: Delete command from iceberg table does not delete all the data it should delete.
URL: https://github.com/apache/iceberg/issues/7379


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org