You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by "yegangy0718 (via GitHub)" <gi...@apache.org> on 2023/04/24 05:09:32 UTC

[GitHub] [iceberg] yegangy0718 commented on issue #7393: The serialization problem caused by Flink shuffling design

yegangy0718 commented on issue #7393:
URL: https://github.com/apache/iceberg/issues/7393#issuecomment-1519392273

Hi @huyuanfeng2018 Thanks for showing interest in the project.

We do have plan to add custom serializer for `DataStatisticsOrRecord ` as @stevenzwu commented at https://github.com/apache/iceberg/pull/7269#discussion_r1157718810.

We have done perf test with the internal PoC impl. The result was published at https://www.slideshare.net/FlinkForward/tame-the-small-files-problem-and-optimize-data-layout-for-streaming-ingestion-to-iceberg from slide 44 to the end. We observed the CPU usage increased from 35% to 57% for the simplest streaming job(consumes from Kafka and writes to Iceberg) after applying shuffling. It's expected since we trade more CPU usage for better file size and data clustering.

We may need more information for the test cases you run like the Flink DAGs structure, the data distribution, and so on to analyze the perf impact that happens to you.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org