You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "alexeykudinkin (via GitHub)" <gi...@apache.org> on 2023/02/13 21:24:06 UTC
[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #7362: [HUDI-5315] The record size is dynamically estimated when the table i…

alexeykudinkin commented on code in PR #7362:
URL: https://github.com/apache/hudi/pull/7362#discussion_r1105023710


##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/action/commit/BaseSparkCommitActionExecutor.java:
##########
@@ -418,4 +426,23 @@ public Partitioner getLayoutPartitioner(WorkloadProfile profile, String layoutPa
   protected void runPrecommitValidators(HoodieWriteMetadata<HoodieData<WriteStatus>> writeMetadata) {
     SparkValidatorUtils.runValidators(config, writeMetadata, context, table, instantTime);
   }
+
+  private int dynamicSampleRecordSize(JavaRDD<HoodieRecord<T>> inputRecords) {
+    int dynamicSampleRecordSize = config.getCopyOnWriteRecordSizeEstimate();
+    long inputRecordsCount = inputRecords.count();
+    if (inputRecordsCount == 0) {
+      LOG.warn("inputRecords is empty.");
+      return dynamicSampleRecordSize;
+    }
+    int maxSampleRecordNum = (int) Math.ceil(Math.min(inputRecordsCount * config.getRecordSizeDynamicSamplingRatio(), config.getRecordSizeDynamicSamplingMaxnum()));
+    try {
+      List<HoodieRecord<T>> sampleRecords = inputRecords.takeSample(false, maxSampleRecordNum);
+      dynamicSampleRecordSize = (int) (ObjectSizeCalculator.getObjectSize(sampleRecords) / maxSampleRecordNum);

Review Comment:
   This is not the right way to estimate the object size -- this will provide us with the size of the Java object stored in memory (including pointers, etc).
   
   Instead we should
    - Serialize all obtained records 
    - Divide total bytes by the number of records
   
   Note, that some record impls like HoodieSparkRecord, HoodieAvroRecord already hold serialized representations so we don't need to serialize again in that case and can simply expose API (like `estimateSerializedSize`) in HoodieRecord interface to provide it



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -560,6 +560,25 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When table is upgraded from pre 0.12 to 0.12, we check for \"default\" partition and fail if found one. "
           + "Users are expected to rewrite the data in those partitions. Enabling this config will bypass this validation");
 
+  public static final ConfigProperty<String> RECORD_SIZE_DYNAMIC_SAMPLING_MAXNUM = ConfigProperty

Review Comment:
   @weimingdiit let's reduce number of user-facing configuration we provide as part of this feature: IMO we should have just 2:
   
    - Whether dynamic sampling is enabled (by default we should switch this to true)
    - Total number of records to be sampled (this should be deliberately be static, to avoid variation with the batch size if, say, one day we ingest 10Gb batch and in a month we start ingesting 100Gb batches)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org