You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "yihua (via GitHub)" <gi...@apache.org> on 2023/01/24 23:05:17 UTC

[GitHub] [hudi] yihua commented on a diff in pull request #7379: [HUDI-5323] Support virtual keys in Bloom Index and always write bloom filters to parquet files

yihua commented on code in PR #7379:
URL: https://github.com/apache/hudi/pull/7379#discussion_r1086027954


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java:
##########
@@ -141,11 +143,14 @@ protected HoodieTable(HoodieWriteConfig config, HoodieEngineContext context, Hoo
 
     this.viewManager = FileSystemViewManager.createViewManager(context, config.getMetadataConfig(), config.getViewStorageConfig(), config.getCommonConfig(), () -> metadata);
     this.metaClient = metaClient;
+    this.virtualKeyGeneratorOpt = createVirtualKeyGenerator(config);
     this.index = getIndex(config, context);
     this.storageLayout = getStorageLayout(config);
     this.taskContextSupplier = context.getTaskContextSupplier();
   }
 
+  protected abstract Option<BaseKeyGenerator> createVirtualKeyGenerator(HoodieWriteConfig config);

Review Comment:
   Adding this API so that the virtual key generator is served from either `HoodieTable` or `BaseHoodieWriteClient`, instead of being passed around through the constructor in many places.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java:
##########
@@ -159,10 +161,13 @@ public BaseHoodieWriteClient(HoodieEngineContext context,
                                Option<EmbeddedTimelineService> timelineService,
                                SupportsUpgradeDowngrade upgradeDowngradeHelper) {
     super(context, writeConfig, timelineService);
+    this.virtualKeyGeneratorOpt = createVirtualKeyGenerator(config);
     this.index = createIndex(writeConfig);
     this.upgradeDowngradeHelper = upgradeDowngradeHelper;
   }
 
+  protected abstract Option<BaseKeyGenerator> createVirtualKeyGenerator(HoodieWriteConfig config);

Review Comment:
   Adding this API so that the virtual key generator is served from either `HoodieTable` or `BaseHoodieWriteClient`, instead of being passed around through the constructor in many places.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/simple/HoodieGlobalSimpleIndex.java:
##########
@@ -50,8 +49,8 @@
  * joins with incoming records to find the tagged location.
  */
 public class HoodieGlobalSimpleIndex extends HoodieSimpleIndex {
-  public HoodieGlobalSimpleIndex(HoodieWriteConfig config, Option<BaseKeyGenerator> keyGeneratorOpt) {
-    super(config, keyGeneratorOpt);
+  public HoodieGlobalSimpleIndex(HoodieWriteConfig config) {

Review Comment:
   `keyGeneratorOpt` is removed from many constructors as it can be served from either `HoodieTable` or `BaseHoodieWriteClient` instance.



##########
hudi-common/src/main/java/org/apache/hudi/common/util/ParquetUtils.java:
##########
@@ -120,7 +125,9 @@ private static Set<String> filterParquetRowKeys(Configuration configuration, Pat
       Object obj = reader.read();
       while (obj != null) {
         if (obj instanceof GenericRecord) {
-          String recordKey = ((GenericRecord) obj).get(HoodieRecord.RECORD_KEY_METADATA_FIELD).toString();
+          String recordKey = keyGeneratorOpt.isPresent()
+              ? keyGeneratorOpt.get().getRecordKey((GenericRecord) obj)

Review Comment:
   This fixes how the row keys are generated when virtual keys are enabled.



##########
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/SparkHoodieIndexFactory.java:
##########
@@ -104,13 +100,4 @@ public static boolean isGlobalIndex(HoodieWriteConfig config) {
         return createIndex(config).isGlobal();
     }
   }
-
-  private static Option<BaseKeyGenerator> getKeyGeneratorForSimpleIndex(HoodieWriteConfig config) {

Review Comment:
   All similar logic is moved to `HoodieSparkKeyGeneratorFactory.createVirtualKeyGenerator`



##########
hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java:
##########
@@ -286,6 +287,21 @@ public static Schema getRecordKeySchema() {
     return RECORD_KEY_SCHEMA;
   }
 
+  public static Schema getRecordKeySchema(Option<BaseKeyGenerator> keyGeneratorOpt, Option<SerializableSchema> schemaOpt) {
+    if (keyGeneratorOpt.isPresent()) {
+      List<Schema.Field> schemaFieldList = keyGeneratorOpt.get().getRecordKeyFieldNames().stream()
+          .map(name -> {
+            Schema.Field field = schemaOpt.get().get().getField(name);
+            return new Schema.Field(field.name(), field.schema(), "", field.defaultVal());
+          })
+          .collect(Collectors.toList());
+      Schema recordKeySchema = Schema.createRecord("HoodieRecordKey", "", "", false);

Review Comment:
   @alexeykudinkin This works for root-level fields.  Do you think we also need to handle nested fields separately?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org