You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/28 22:26:15 UTC

[GitHub] [hudi] xushiyan commented on a diff in pull request #5629: [HUDI-3384][HUDI-3385] Spark specific file reader/writer.

xushiyan commented on code in PR #5629:
URL: https://github.com/apache/hudi/pull/5629#discussion_r909038899


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -142,6 +146,11 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Easily configure one the built-in key generators, instead of specifying the key generator class."
           + "Currently supports SIMPLE, COMPLEX, TIMESTAMP, CUSTOM, NON_PARTITION, GLOBAL_DELETE");
 
+  public static final ConfigProperty<String> RECORD_TYPE = ConfigProperty
+      .key("hoodie.datasource.write.record.type")
+      .defaultValue(HoodieRecordType.AVRO.toString())
+      .withDocumentation("test");

Review Comment:
   reminder to fix the doc



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/commit/HoodieDeleteHelper.java:
##########
@@ -84,8 +87,19 @@ public HoodieWriteMetadata<HoodieData<WriteStatus>> execute(String instantTime,
         dedupedKeys = keys.repartition(parallelism);
       }
 
-      HoodieData<HoodieRecord<T>> dedupedRecords =
-          dedupedKeys.map(key -> new HoodieAvroRecord(key, new EmptyHoodieRecordPayload()));
+      HoodieData dedupedRecords;
+      if (config.getRecordType() == HoodieRecordType.AVRO) {
+        dedupedRecords =
+            dedupedKeys.map(key -> new HoodieAvroRecord(key, new EmptyHoodieRecordPayload()));
+      } else if (config.getRecordType() == HoodieRecordType.SPARK) {
+        dedupedRecords = dedupedKeys.map(key -> {
+          Class<?> recordClazz = ReflectionUtils.getClass("org.apache.hudi.commmon.model.HoodieSparkRecord");
+          Method method = recordClazz.getMethod("empty", HoodieKey.class);
+          return method.invoke(null, key);

Review Comment:
   this reflection API call is gonna impact perf to quite some extent



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndexUtils.java:
##########
@@ -132,7 +133,8 @@ public static List<String> filterKeysFromFile(Path filePath, List<String> candid
       // Load all rowKeys from the file, to double-confirm
       if (!candidateRecordKeys.isEmpty()) {
         HoodieTimer timer = new HoodieTimer().startTimer();
-        HoodieAvroFileReader fileReader = HoodieFileReaderFactory.getFileReader(configuration, filePath);
+        // Just get row keys

Review Comment:
   this comment looks redundant



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -142,6 +146,11 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Easily configure one the built-in key generators, instead of specifying the key generator class."
           + "Currently supports SIMPLE, COMPLEX, TIMESTAMP, CUSTOM, NON_PARTITION, GLOBAL_DELETE");
 
+  public static final ConfigProperty<String> RECORD_TYPE = ConfigProperty
+      .key("hoodie.datasource.write.record.type")
+      .defaultValue(HoodieRecordType.AVRO.toString())
+      .withDocumentation("test");

Review Comment:
   state valid values for this config?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -977,6 +988,16 @@ public String getKeyGeneratorClass() {
     return getString(KEYGENERATOR_CLASS_NAME);
   }
 
+  public HoodieRecord.HoodieRecordType getRecordType() {
+    HoodieRecordType recordType = HoodieRecord.HoodieRecordType.valueOf(getString(RECORD_TYPE));
+    String basePath = getString(BASE_PATH);
+    boolean metadataTable = HoodieTableMetadata.isMetadataTable(basePath);
+    if (metadataTable) {
+      recordType = HoodieRecordType.AVRO;
+    }
+    return recordType;

Review Comment:
   looks like this `getRecordType()` is invoked everywhere throughout the write path. we should optimize the perf here somehow. for e.g., the ENUM instantiation , metadata table check. since this info is not gonna change throughout a write operation, we should consider cache it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org