You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/01/11 02:05:15 UTC

[GitHub] [hudi] vinothchandar commented on a change in pull request #4449: [HUDI-2763] Metadata table records - support for key deduplication based on hardcoded key field

vinothchandar commented on a change in pull request #4449:
URL: https://github.com/apache/hudi/pull/4449#discussion_r781704816



##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java
##########
@@ -83,6 +84,11 @@
       .withDocumentation("Lower values increase the size of metadata tracked within HFile, but can offer potentially "
           + "faster lookup times.");
 
+  public static final ConfigProperty<String> HFILE_SCHEMA_KEY_FIELD_ID = ConfigProperty
+      .key("hoodie.hfile.schema.key.field.id")

Review comment:
       rename: hoodie.hfile.key.field.name (drop id, since its just a name in reality) 

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java
##########
@@ -122,9 +118,9 @@ public HoodieLogBlockType getBlockType() {
       if (useIntegerKey) {
         recordKey = String.format("%" + keySize + "s", key++);
       } else {
-        recordKey = record.get(keyField.pos()).toString();
+        recordKey = record.get(schemaKeyField.pos()).toString();

Review comment:
       why not just look up by name? Don't understand this need for positional lookups to begin with?

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java
##########
@@ -83,6 +84,11 @@
       .withDocumentation("Lower values increase the size of metadata tracked within HFile, but can offer potentially "
           + "faster lookup times.");
 
+  public static final ConfigProperty<String> HFILE_SCHEMA_KEY_FIELD_ID = ConfigProperty
+      .key("hoodie.hfile.schema.key.field.id")
+      .defaultValue(HoodieMetadataPayload.SCHEMA_FIELD_ID_KEY)

Review comment:
       rename consistently everywhere else

##########
File path: hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieHFileDataBlock.java
##########
@@ -110,8 +106,8 @@ public HoodieLogBlockType getBlockType() {
     boolean useIntegerKey = false;
     int key = 0;
     int keySize = 0;
-    Field keyField = records.get(0).getSchema().getField(this.keyField);
-    if (keyField == null) {
+    final Field schemaKeyField = records.get(0).getSchema().getField(HoodieMetadataPayload.SCHEMA_FIELD_ID_KEY);

Review comment:
       should nt this use the value passed in from storage config? We cannot have this hardcoded here

##########
File path: hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieStorageConfig.java
##########
@@ -83,6 +84,11 @@
       .withDocumentation("Lower values increase the size of metadata tracked within HFile, but can offer potentially "
           + "faster lookup times.");
 
+  public static final ConfigProperty<String> HFILE_SCHEMA_KEY_FIELD_ID = ConfigProperty
+      .key("hoodie.hfile.schema.key.field.id")
+      .defaultValue(HoodieMetadataPayload.SCHEMA_FIELD_ID_KEY)
+      .withDocumentation("Key field name for the HFile schema. This key field is used for on-disk storage optimization");

Review comment:
       explain what purpose this serves i.e reduce storage overhead in not storing keys with values.

##########
File path: hudi-common/src/main/java/org/apache/hudi/io/storage/HoodieHFileReader.java
##########
@@ -62,6 +64,7 @@
   // Scanner used to read individual keys. This is cached to prevent the overhead of opening the scanner for each
   // key retrieval.
   private HFileScanner keyScanner;
+  private final String keyField = HoodieMetadataPayload.SCHEMA_FIELD_ID_KEY;

Review comment:
       We could even stick this into the config object. But cannot do this hardcoding or referencing of metadata payload here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org