You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "Fokko (via GitHub)" <gi...@apache.org> on 2023/02/28 07:33:55 UTC

[GitHub] [iceberg] Fokko commented on issue #6956: Spark: Data file rewriting spark job fails with oom

Fokko commented on issue #6956:
URL: https://github.com/apache/iceberg/issues/6956#issuecomment-1447709562

   Ah, I see, using merge on read using Flink makes sense.
   
   > And I have a question: with merge on read mode, in the worst case, does an executor have to read all delete records (in my case maybe all the rows before the whole table delete)?
   
   There is some logic involved to optimize this, but equality deletes aren't the best choice when it comes to performance. Because at some point Flink will write a delete (`id=5`), and you have to apply this to the subsequent data files, which is quite costly as you might imagine. Of course, this is limited to the partitions that you're reading and will prune the deletes of the partitions that are outside of the scope of the query.
   
   What also would work is to compact the table using a Spark job periodically (ideally the partitions that aren't being written to anymore). So you'll get rid of the deletes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org