You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/10 06:43:36 UTC

[GitHub] [iceberg] SHuixo commented on issue #6104: Rewrite iceberg small files with flink succeeds but no snapshot is generated (V2 - upsert model)

SHuixo commented on issue #6104:
URL: https://github.com/apache/iceberg/issues/6104#issuecomment-1309843130

   Think you @luoyuxia for your replay.
   
   Later, I tried again a few times and found that when the cumulative storage of iceberg data small files was relatively small, the flink 1.13.5 compressed file ran normally and could generate snapshot files.
   
   However, when the file volume accumulates a lot, it takes a long time to rewrite the data each time, and it is easy to cause **OOM** exceptions, here are my attempts at **Flink 1.13.5 / 1.15.2, iceberg 1.14.1**  and the log logs generated by the task.
   
   > The following figure shows that the compression task has been in the Map stage:
   <img width="872" alt="dag-13-1" src="https://user-images.githubusercontent.com/20868410/201018299-e64b3a02-3ff2-4d49-b1cc-e7bdf703f3aa.PNG">
   
   > OOM exception information that occurs when the compression task occurs:
   
   **flink 1.13.5:**
   [error-flink-1.13.5.log](https://github.com/apache/iceberg/files/9978008/error-flink-1.13.5.log)
   
   
   **flink 1.15.2:**
   [error-flink-1.15.2.log](https://github.com/apache/iceberg/files/9978009/error-flink-1.15.2.log)
   
   
   Here I want to ask, if the data is continuously written to iceberg, the problem of data compression OOM is inevitable, and the compression time will become longer and longer.
   
    I see that there are API methods   **appendsBetween() / appendsAfter()** related to incremental compression in the source code, does this mean that incremental compression can be used to replace the repeated compression process of full data in the future?
   
   thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org