You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "huyuanfeng2018 (via GitHub)" <gi...@apache.org> on 2023/05/04 02:31:24 UTC

[GitHub] [iceberg] huyuanfeng2018 commented on a diff in pull request #7494: Flink: change sink shuffle to use RowData as data type and statistics key type

huyuanfeng2018 commented on code in PR #7494:
URL: https://github.com/apache/iceberg/pull/7494#discussion_r1184480444


##########
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatistics.java:
##########
@@ -42,12 +43,19 @@
    *
    * @param key generate from data by applying key selector
    */
-  void add(K key);
+  void add(RowData key);

Review Comment:
   Is it necessary to add a method here to count statistical information with values? For example, data under different keys may have different sizes. Is it also a consideration when controlling subsequent balance,like:``` add(Rowdata key, V v)``` Among them, v may represent the record bytes of the row corresponding to the current key. What do you think?



##########
flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOperator.java:
##########
@@ -40,50 +40,49 @@
  * shuffle record to improve data clustering while maintaining relative balanced traffic
  * distribution to downstream subtasks.
  */
-class DataStatisticsOperator<T, K> extends AbstractStreamOperator<DataStatisticsOrRecord<T, K>>
-    implements OneInputStreamOperator<T, DataStatisticsOrRecord<T, K>>, OperatorEventHandler {
+class DataStatisticsOperator<D extends DataStatistics<D, S>, S>

Review Comment:
   Ok, I think this DataStatisticsOperator is good for collecting some statistical information. It seems that we also need an Operator to determine the partitionID, such as PartitionIdAssignerOperator and then pass ```org.apache.flink.api.java.functions.IdPartitioner``` From custom data distribution, I haven’t seen the design of this piece in the design document, can you briefly introduce the follow-up implementation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org