You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "huyuanfeng2018 (via GitHub)" <gi...@apache.org> on 2023/04/24 09:36:40 UTC

[GitHub] [iceberg] huyuanfeng2018 commented on issue #7393: The serialization problem caused by Flink shuffling design

huyuanfeng2018 commented on issue #7393:
URL: https://github.com/apache/iceberg/issues/7393#issuecomment-1519739497

   HI, @stevenzwu @stevenzwu @hililiwei 
   Thank you for your reply!
   My scenario is that the server logs are written to iceberg in real time, and the peak period of real-time data volume is about 1.0M/s,
   <img width="456" alt="image" src="https://user-images.githubusercontent.com/40817998/233956036-25ef3566-3e6a-4d11-b8df-b7b7b171177a.png">
   At present, according to the day, hour, and an enumeration partition field, we have about 70 enumeration partitions, of which two enumerations account for more than 70% of the total, so the current iceberg write mode certainly cannot meet our requirements. Requirements, currently we have 200 parallel writes online, shuffling by defining the ratio of the amount of data under each enumeration to the total amount of data by ourselves, specifying the ratio like this
   
   'distribution-balance-column-ratio' = 'sysdk_android:0.0005,_wap:0.0003,android_tv:0.003.......'
   
   However, the proportion of each enumeration will change in certain time periods, so there will still be a tilt in certain time periods, resulting in a backlog of my tasks.
   
   So I tried to achieve automatic balancing, but under the same cluster configuration, my processing efficiency was 4 times slower, about 200~300k/s, among which I have put the flame graph on it, and most of the processing is I am doing the serialization operation of the statistics record. I think if you re-implement the serialization interface of the record, can you give me a sample and I can test it in my scenario to see how much improvement there is. In addition, if necessary, I will I can help as much as I can


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org