You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "vinothchandar (via GitHub)" <gi...@apache.org> on 2023/03/02 01:25:57 UTC
[GitHub] [hudi] vinothchandar commented on a diff in pull request #7912: [HUDI-5739] Review Hudi Config Defaults

vinothchandar commented on code in PR #7912:
URL: https://github.com/apache/hudi/pull/7912#discussion_r1122450879


##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -40,10 +40,13 @@
         + " between writers to a Hudi table. Concurrency between Hudi's own table services "
         + " are auto managed internally.")
 public class DynamoDbBasedLockConfig extends HoodieConfig {
+  // Config for DynamoDB based lock provider
 
   // configs for DynamoDb based locks
   public static final String DYNAMODB_BASED_LOCK_PROPERTY_PREFIX = LOCK_PREFIX + "dynamodb.";
 
+  // Ethan: Can we provide a default value here that is less likely to collide,
+  // So this config also becomes optional?

Review Comment:
   the lock provider should do use sth like `hudi_locks` as the default table name? then the user can override if needed.



##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -96,16 +100,18 @@ public class DynamoDbBasedLockConfig extends HoodieConfig {
       .sinceVersion("0.10.0")
       .withDocumentation("For DynamoDB based lock provider, write capacity units when using PROVISIONED billing mode");
 
+  // Ethan: 10 min.  Is this timeout too long?

Review Comment:
   shorten to 2?



##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -96,16 +100,18 @@ public class DynamoDbBasedLockConfig extends HoodieConfig {
       .sinceVersion("0.10.0")
       .withDocumentation("For DynamoDB based lock provider, write capacity units when using PROVISIONED billing mode");
 
+  // Ethan: 10 min.  Is this timeout too long?
   public static final ConfigProperty<String> DYNAMODB_LOCK_TABLE_CREATION_TIMEOUT = ConfigProperty
       .key(DYNAMODB_BASED_LOCK_PROPERTY_PREFIX + "table_creation_timeout")
       .defaultValue(String.valueOf(10 * 60 * 1000))
       .sinceVersion("0.10.0")
       .withDocumentation("For DynamoDB based lock provider, the maximum number of milliseconds to wait for creating DynamoDB table");
 
+  // Ethan: Move the infer logic here

Review Comment:
   +1 for sane default



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -38,14 +38,17 @@
     groupName = ConfigGroups.Names.WRITE_CLIENT,
     description = "Configurations that control archival.")
 public class HoodieArchivalConfig extends HoodieConfig {
+  //cfg
 
+  // Ethan: Should we remove this?  Archive should always be enabled?
   public static final ConfigProperty<String> AUTO_ARCHIVE = ConfigProperty
       .key("hoodie.archive.automatic")
       .defaultValue("true")
       .withDocumentation("When enabled, the archival table service is invoked immediately after each commit,"
           + " to archive commits if we cross a maximum value of commits."
           + " It's recommended to enable this, to ensure number of active commits is bounded.");
 
+  // Ethan: Is Async archive tested?

Review Comment:
   We should ensure this is safe.. for now advanced config



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -60,11 +63,13 @@ public class HoodieArchivalConfig extends HoodieConfig {
           + " keep the metadata overhead constant, even as the table size grows."
           + " This config controls the maximum number of instants to retain in the active timeline. ");
 
+  // Ethan: Can we use Spark parallelism or auto-derive the configs and remove this?
   public static final ConfigProperty<Integer> DELETE_ARCHIVED_INSTANT_PARALLELISM_VALUE = ConfigProperty
       .key("hoodie.archive.delete.parallelism")
       .defaultValue(100)
       .withDocumentation("Parallelism for deleting archived hoodie commits.");
 
+  // Ethan: pick one of "hoodie.keep.min.commits" and "hoodie.keep.max.commits" only?

Review Comment:
   min/max was added to batch up the archived files. we leave this alone IMO>



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -38,14 +38,17 @@
     groupName = ConfigGroups.Names.WRITE_CLIENT,
     description = "Configurations that control archival.")
 public class HoodieArchivalConfig extends HoodieConfig {
+  //cfg
 
+  // Ethan: Should we remove this?  Archive should always be enabled?

Review Comment:
   should this become an unsafe/advanced config instead?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -47,6 +47,7 @@
     description = "Configurations that control the clustering table service in hudi, "
         + "which optimizes the storage layout for better query performance by sorting and sizing data files.")
 public class HoodieClusteringConfig extends HoodieConfig {
+  // Configs for Clustering operations

Review Comment:
   revisit all docs here, and make it very user friendly



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -102,6 +111,7 @@ public class HoodieCleanConfig extends HoodieConfig {
           + " in the timeline, since the last cleaner run. This is much more efficient than obtaining listings for the full"
           + " table for each planning (even with a metadata table).");
 
+  // Ethan: this should be an internal config, based on concurrency control mode.  Should not be tweaked by user.

Review Comment:
   make this advanced and move on.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -162,6 +168,9 @@ public class HoodieClusteringConfig extends HoodieConfig {
           + "DAY_ROLLING: clustering partitions on a rolling basis by the hour to avoid clustering all partitions each time, "
           + "which strategy sorts the partitions asc and chooses the partition of which index is divided by 24 and the remainder is equal to the current hour.");
 
+  // Ethan: all these fine-tuning knobs can be simplified if we have a config to indicate the purpose:

Review Comment:
   we should simplify this. esp the default. IIRC we only cluster within the group? so picking group size is important?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -45,7 +45,25 @@
     description = "Configurations that control compaction "
         + "(merging of log files onto a new base files).")
 public class HoodieCompactionConfig extends HoodieConfig {
-
+  // Configs for compaction
+
+  // Ethan: Similar to clustering, the compaction configs should be straightened up to be more concise
+  // Sth more intuitive like:
+  // hoodie.compact.schedule.mode -> [inline|async|disabled] // how to schedule compaction plan
+  // hoodie.compact.execute.mode -> [inline|async|disabled] // how to execute compaction plan
+  // // compaction parameters should be independent of inline/async
+  // hoodie.compact.trigger.strategy
+  // hoodie.compact.max.delta.seconds
+  // hoodie.compact.max.delta.commits
+  // ...
+  // instead of many configs that overlap with each other
+  // hoodie.compact.schedule.inline
+  // hoodie.compact.inline
+  // hoodie.datasource.compaction.async.enable
+  // // these are reused for inline/async
+  // hoodie.compact.inline.trigger.strategy
+  // hoodie.compact.inline.max.delta.seconds
+  // hoodie.compact.inline.max.delta.commits
   public static final ConfigProperty<String> INLINE_COMPACT = ConfigProperty
       .key("hoodie.compact.inline")
       .defaultValue("false")

Review Comment:
   why is default false? surprised me.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -60,11 +63,13 @@ public class HoodieArchivalConfig extends HoodieConfig {
           + " keep the metadata overhead constant, even as the table size grows."
           + " This config controls the maximum number of instants to retain in the active timeline. ");
 
+  // Ethan: Can we use Spark parallelism or auto-derive the configs and remove this?

Review Comment:
   idk if there is a way to do it more intelligently, workload based. i.e based on number of isntants deleted or sth?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -132,13 +151,15 @@ public class HoodieCompactionConfig extends HoodieConfig {
           + "compaction during each compaction run. By default. Hudi picks the log file "
           + "with most accumulated unmerged data");
 
+  // Ethan: this seems to be internal
   public static final ConfigProperty<String> COMPACTION_LAZY_BLOCK_READ_ENABLE = ConfigProperty
       .key("hoodie.compaction.lazy.block.read")
       .defaultValue("true")
       .withDocumentation("When merging the delta log files, this config helps to choose whether the log blocks "
           + "should be read lazily or not. Choose true to use lazy block reading (low memory usage, but incurs seeks to each block"
           + " header) or false for immediate block read (higher memory usage)");
 
+  // Ethan: this seems to be internal

Review Comment:
   who uses this code path?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -175,6 +198,9 @@ public class HoodieCompactionConfig extends HoodieConfig {
           + " record sizes. It's recommended to keep this turned on, since hand tuning is otherwise extremely"
           + " cumbersome.");
 
+  // Ethan: Should this be removed?  This should only be used by the first batch.  For first batch,

Review Comment:
   dont we have this in writecofnig as well. I also don't like that this is COW specific



##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -78,6 +81,7 @@ public class DynamoDbBasedLockConfig extends HoodieConfig {
       .withDocumentation("For DynamoDB based lock provider, the region used in endpoint for Amazon DynamoDB service."
                          + " Would try to first get it from AWS_REGION environment variable. If not find, by default use us-east-1");
 
+  // Ethan: Need to confirm these are good defaults for cost

Review Comment:
   I think this is fine.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##########
@@ -74,24 +75,28 @@ public class HoodieBootstrapConfig extends HoodieConfig {
       .sinceVersion("0.6.0")
       .withDocumentation("Class to use for reading the bootstrap dataset partitions/files, for Bootstrap mode FULL_RECORD");
 
+  // Ethan: can this be aligned with key gen for write (no need for a separate bootstrap key gen config)?

Review Comment:
   +1 sane defaults



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -99,6 +103,7 @@ public class HoodieLockConfig extends HoodieConfig {
       .sinceVersion("0.8.0")
       .withDocumentation("Maximum number of times to retry to acquire lock additionally from the lock manager.");
 
+  // Ethan: is this too large? what about 30s?

Review Comment:
   idk



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -111,11 +116,12 @@ public class HoodieLockConfig extends HoodieConfig {
       .sinceVersion("0.8.0")
       .withDocumentation("For DFS based lock providers, path to store the locks under. use Table's meta path as default");
 
+  // Ethan: should the default also allow expiration? Need to dig into the code to understand what this controls.
   public static final ConfigProperty<Integer> FILESYSTEM_LOCK_EXPIRE = ConfigProperty
-        .key(FILESYSTEM_LOCK_EXPIRE_PROP_KEY)
-        .defaultValue(0)
-        .sinceVersion("0.12.0")
-        .withDocumentation("For DFS based lock providers, expire time in minutes, must be a nonnegative number, default means no expire");
+      .key(FILESYSTEM_LOCK_EXPIRE_PROP_KEY)
+      .defaultValue(0)
+      .sinceVersion("0.12.0")
+      .withDocumentation("For DFS based lock providers, expire time in minutes, must be a nonnegative number, default means no expire");

Review Comment:
   DFS lock providers, real?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -139,6 +141,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Field used in preCombining before actual write. When two records have the same key value, "
           + "we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)");
 
+  // Ethan: regarding the namespace, should we use "hoodie.write." instead of "hoodie.datasource.write." as the prefix?

Review Comment:
   payload should be common? lets see how the new merger stuff is named and fix there? idk how painful this migration will be. But agree on `hoodie.write..` 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePayloadConfig.java:
##########
@@ -40,7 +40,10 @@
     description = "Payload related configs, that can be leveraged to "
         + "control merges based on specific business fields in the data.")
 public class HoodiePayloadConfig extends HoodieConfig {
+  // Configs for Hudi payload class
 
+  // Ethan: shall we use preCombine ("hoodie.datasource.write.precombine.field") only and remove this?
+  // It acts as the same purpose as the precombine field.  This confuses user too.

Review Comment:
   what about callers of RDDClient e.g DeltaStreamer. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##########
@@ -46,6 +46,7 @@
         + "The bootstrap operation can flexibly avoid copying data over before you can use Hudi and support running the existing "
         + " writers and new hudi writers in parallel, to validate the migration.")
 public class HoodieBootstrapConfig extends HoodieConfig {
+  // Configs for Bootstrap feature

Review Comment:
   a lot of them may overlap with write configs. good low hanging fruits here.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieMemoryConfig.java:
##########
@@ -62,11 +64,14 @@ public class HoodieMemoryConfig extends HoodieConfig {
   // Minimum memory size (100MB) for the spillable map.
   public static final long DEFAULT_MIN_MEMORY_FOR_SPILLABLE_MAP_IN_BYTES = 100 * 1024 * 1024L;
 
+  // Ethan: Should this be automatically configured as a fraction of memory

Review Comment:
   leave alone?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -84,11 +91,13 @@ public class HoodieCleanConfig extends HoodieConfig {
       .withDocumentation("Controls how cleaning is scheduled. Valid options: "
           + Arrays.stream(CleaningTriggerStrategy.values()).map(Enum::name).collect(Collectors.joining(",")));
 
+  // Ethan: "hoodie.clean.trigger.max.commits" is what this means

Review Comment:
   +1 again on aligining with how we talk about scheduling and executing table services. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -117,18 +127,21 @@ public class HoodieCleanConfig extends HoodieConfig {
           + "failed writes to re-claim space. Choose to perform this rollback of failed writes eagerly before "
           + "every writer starts (only supported for single writer) or lazily by the cleaner (required for multi-writers)");
 
+  // Ethan: same here.  Can we reuse Spark parallelism or derive automatically?
   public static final ConfigProperty<String> CLEANER_PARALLELISM_VALUE = ConfigProperty
       .key("hoodie.cleaner.parallelism")
       .defaultValue("200")
       .withDocumentation("Parallelism for the cleaning operation. Increase this if cleaning becomes slow.");
 
+  // Ethan: wondering if this is still needed.
   public static final ConfigProperty<Boolean> ALLOW_MULTIPLE_CLEANS = ConfigProperty
       .key("hoodie.clean.allow.multiple")
       .defaultValue(true)
       .sinceVersion("0.11.0")
       .withDocumentation("Allows scheduling/executing multiple cleans by enabling this config. If users prefer to strictly ensure clean requests should be mutually exclusive, "
           + ".i.e. a 2nd clean will not be scheduled if another clean is not yet completed to avoid repeat cleaning of same files, they might want to disable this config.");
 
+  // Ethan: should the default behavior be deleting original parquet files and then we can remove this config?

Review Comment:
   we should ask Udit, why this is the default



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -41,10 +42,12 @@
  */
 @Immutable
 @ConfigClassProperty(name = "Clean Configs",
-        groupName = ConfigGroups.Names.WRITE_CLIENT,
-        description = "Cleaning (reclamation of older/unused file groups/slices).")
+    groupName = ConfigGroups.Names.WRITE_CLIENT,
+    description = "Cleaning (reclamation of older/unused file groups/slices).")
 public class HoodieCleanConfig extends HoodieConfig {
+  // Configs for Clean action
 
+  // Ethan: should we standardize on `.enable` instead of `.automatic`?

Review Comment:
   +1 as long as its backwards compatible.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -131,18 +134,21 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .withDocumentation("Turn on inline clustering - clustering will be run after each write operation is complete")
       .withAlternatives("hoodie.datasource.clustering.inline.enable");
 
+  // Ethan: this can apply to both inline or aysnc clustering.  Rename it for better readability.

Review Comment:
   +1



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -58,12 +61,16 @@ public class HoodieCleanConfig extends HoodieConfig {
       .withDocumentation("Only applies when " + AUTO_CLEAN.key() + " is turned on. "
           + "When turned on runs cleaner async with writing, which can speed up overall write performance.");
 
+  // Ethan: when either "hoodie.cleaner.commits.retained", "hoodie.cleaner.hours.retained",

Review Comment:
   so, we infer the cleaning policy and throw errors if multiple of them are set?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePayloadConfig.java:
##########
@@ -53,12 +56,13 @@ public class HoodiePayloadConfig extends HoodieConfig {
       .withDocumentation("Table column/field name to derive timestamp associated with the records. This can"
           + "be useful for e.g, determining the freshness of the table.");
 
+  // Ethan: remove this and align with the write payload class ("hoodie.datasource.write.payload.class")?

Review Comment:
   yes we should infer from table props?>



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -185,6 +192,7 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "**Note** This is being actively worked on. Please use "
           + "`hoodie.datasource.write.keygenerator.class` instead.");
 
+  // Ethan: look like we can remove this as it is mature?

Review Comment:
   same for mdt as well?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -310,6 +321,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When inserted records share same key, controls whether they should be first combined (i.e de-duplicated) before"
           + " writing to storage.");
 
+  // Ethan: should we have one "hoodie.combine.before.write" for all change operations instead of one config per operation?

Review Comment:
   write will include insert?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -408,6 +425,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When true, spins up an instance of the timeline server (meta server that serves cached file listings, statistics),"
           + "running on each writer's driver process, accepting requests during the write from executors.");
 
+  // Ethan: is this still useful?

Review Comment:
   reuse is not that stable/provabley?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -218,6 +229,7 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .withDocumentation("Enable running of clustering service, asynchronously as inserts happen on the table.")
       .withAlternatives("hoodie.datasource.clustering.async.enable");
 
+  // Ethan: this should be removed.  User should not tweak this to corrupt the hoodie_commit_time meta field.

Review Comment:
   user should not have control. we should preserve commit_time for internal write operations (compaction, clustering.. ) . Think about any historical debt, before making final call.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -90,6 +91,7 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .withDocumentation("End partition used to filter partition (inclusive), only effective when the filter mode '"
           + PLAN_PARTITION_FILTER_MODE + "' is " + ClusteringPlanPartitionFilterMode.SELECTED_PARTITIONS.name());
 
+  // Ethan: Should this be aligned with default parquet base file size?

Review Comment:
   we could infer this from base file size. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -471,12 +492,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When enabled, data validation checks are performed during merges to ensure expected "
           + "number of records after merge operation.");
 
+  // Ethan: should "hoodie.combine.before.insert" controls the behavior this one intends to control as well?

Review Comment:
   logically makes sense. but I would check the code for side effects.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -45,7 +45,25 @@
     description = "Configurations that control compaction "
         + "(merging of log files onto a new base files).")
 public class HoodieCompactionConfig extends HoodieConfig {
-
+  // Configs for compaction
+
+  // Ethan: Similar to clustering, the compaction configs should be straightened up to be more concise
+  // Sth more intuitive like:
+  // hoodie.compact.schedule.mode -> [inline|async|disabled] // how to schedule compaction plan
+  // hoodie.compact.execute.mode -> [inline|async|disabled] // how to execute compaction plan
+  // // compaction parameters should be independent of inline/async
+  // hoodie.compact.trigger.strategy
+  // hoodie.compact.max.delta.seconds
+  // hoodie.compact.max.delta.commits
+  // ...
+  // instead of many configs that overlap with each other

Review Comment:
   whatever we do here, we do across the board for all services? We need to factor in the TMS rfcs as well



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -449,6 +469,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Table services such as compaction and clustering can fail and prevent syncing to "
           + "the metaclient. Set this to true to fail writes when table services fail");
 
+  // Ethan: do we still need all of the consistency check logic given major cloud storage are now strongly consistent?

Review Comment:
   leave it in, but make sure its not affecting code paths by default?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -177,6 +179,7 @@ public class HoodieIndexConfig extends HoodieConfig {
           + "When true, the incoming writes will cached to speed up index lookup by reducing IO "
           + "for computing parallelism or affected partitions");
 
+  // Ethan: should the parallelim below be dynamically determined by Spark?

Review Comment:
   probably makes sense to do this. but wondering if the index implementations will already do this.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -329,18 +341,22 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Write status objects hold metadata about a write (stats, errors), that is not yet committed to storage. "
           + "This controls the how that information is cached for inspection by clients. We rarely expect this to be changed.");
 
+  // Ethan: this should be an internal config.
   public static final ConfigProperty<String> AUTO_COMMIT_ENABLE = ConfigProperty
       .key("hoodie.auto.commit")
       .defaultValue("true")
       .withDocumentation("Controls whether a write operation should auto commit. This can be turned off to perform inspection"
           + " of the uncommitted write before deciding to commit.");
 
+  // Ethan: we never talk about this in docs :)

Review Comment:
   yes that coz Uber extends this :)



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -211,6 +214,7 @@ public class HoodieIndexConfig extends HoodieConfig {
       .withDocumentation("Only applies when #simpleIndexUseCaching is set. Determine what level of persistence is used to cache input RDDs. "
           + "Refer to org.apache.spark.storage.StorageLevel for different values");
 
+  // Ethan: should we remove these two and avoid data quality issues?

Review Comment:
   yes.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -521,11 +545,13 @@ public class HoodieWriteConfig extends HoodieConfig {
    * Given the importance of supporting such cases for the user's migration to 0.5.x, we are proposing a safety flag
    * (disabled by default) which will allow this old behavior.
    */
+  // Ethan: is this still needed?

Review Comment:
   this is used by uber. we should confirm with them.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -38,14 +38,17 @@
     groupName = ConfigGroups.Names.WRITE_CLIENT,
     description = "Configurations that control archival.")
 public class HoodieArchivalConfig extends HoodieConfig {
+  //cfg
 
+  // Ethan: Should we remove this?  Archive should always be enabled?
   public static final ConfigProperty<String> AUTO_ARCHIVE = ConfigProperty
       .key("hoodie.archive.automatic")
       .defaultValue("true")
       .withDocumentation("When enabled, the archival table service is invoked immediately after each commit,"
           + " to archive commits if we cross a maximum value of commits."
           + " It's recommended to enable this, to ensure number of active commits is bounded.");
 
+  // Ethan: Is Async archive tested?

Review Comment:
   mark for table service reliability efforts. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java:
##########
@@ -25,12 +25,15 @@
  * Configs/params used for internal purposes.
  */
 public class HoodieInternalConfig extends HoodieConfig {
+  // Some internal configs (?)
 
   private static final long serialVersionUID = 0L;
 
+  // Ethan: why do we need this? BulkInsertPartitioner can already indicate `arePartitionRecordsSorted()`
   public static final String BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED = "hoodie.bulkinsert.are.partitioner.records.sorted";
   public static final Boolean DEFAULT_BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED = false;
 
+  // Ethan: why do we need this?

Review Comment:
   we should try and remove this.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieHBaseIndexConfig.java:
##########
@@ -36,6 +36,7 @@
         + "(when HBase based indexing is enabled), which tags incoming "
         + "records as either inserts or updates to older records.")
 public class HoodieHBaseIndexConfig extends HoodieConfig {
+  // Configs for HBase Index

Review Comment:
   lets leave this class alone.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -539,6 +565,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Whether to allow generation of empty commits, even if no data was written in the commit. "
           + "It's useful in cases where extra metadata needs to be published regardless e.g tracking source offsets when ingesting data");
 
+  // Ethan: maybe this can auto turned on for CDC?

Review Comment:
   idk if we add this for CDC. do we?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -148,6 +149,7 @@ public class HoodieIndexConfig extends HoodieConfig {
           + "When true, interval tree based file pruning optimization is enabled. "
           + "This mode speeds-up file-pruning based on key ranges when compared with the brute-force mode");
 
+  // Ethan: remove this config and clean up relevant logic?

Review Comment:
   advanced



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -471,12 +492,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When enabled, data validation checks are performed during merges to ensure expected "
           + "number of records after merge operation.");
 
+  // Ethan: should "hoodie.combine.before.insert" controls the behavior this one intends to control as well?
   public static final ConfigProperty<String> MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE = ConfigProperty
       .key("hoodie.merge.allow.duplicate.on.inserts")
       .defaultValue("false")
       .withDocumentation("When enabled, we allow duplicate keys even if inserts are routed to merge with an existing file (for ensuring file sizing)."
           + " This is only relevant for insert operation, since upsert, delete operations will ensure unique key constraints are maintained.");
 
+  // Ethan: should we make the default 0, to reduce the write amplication and make MOR write fast?

Review Comment:
   leave this alone.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -60,11 +63,13 @@ public class HoodieArchivalConfig extends HoodieConfig {
           + " keep the metadata overhead constant, even as the table size grows."
           + " This config controls the maximum number of instants to retain in the active timeline. ");
 
+  // Ethan: Can we use Spark parallelism or auto-derive the configs and remove this?

Review Comment:
   tbh idk if this is a real problem to go after.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -162,6 +168,9 @@ public class HoodieClusteringConfig extends HoodieConfig {
           + "DAY_ROLLING: clustering partitions on a rolling basis by the hour to avoid clustering all partitions each time, "
           + "which strategy sorts the partitions asc and chooses the partition of which index is divided by 24 and the remainder is equal to the current hour.");
 
+  // Ethan: all these fine-tuning knobs can be simplified if we have a config to indicate the purpose:

Review Comment:
   30 is kind of arbitrary.. We should rethink this fundamentally.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -564,12 +591,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .sinceVersion("0.11.0")
       .withDocumentation("Control to enable release all persist rdds when the spark job finish.");
 
+  // Ethan: shall we remove this and do auto adjustment everywhere?

Review Comment:
   we should to the extent that its reliable.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -90,6 +91,7 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .withDocumentation("End partition used to filter partition (inclusive), only effective when the filter mode '"
           + PLAN_PARTITION_FILTER_MODE + "' is " + ClusteringPlanPartitionFilterMode.SELECTED_PARTITIONS.name());
 
+  // Ethan: Should this be aligned with default parquet base file size?

Review Comment:
   I would read the code though to understand if it has side effects i.e the Hudi will always leave 100MB files, so will clustering ever kick in? we need to trigger clustering based on col stats and how clustered the table is. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -66,14 +66,18 @@
     areCommonConfigs = true,
     description = "")
 public class HoodieLockConfig extends HoodieConfig {
+  // Configs for common and legacy lock providers
 
+  // Ethan: nit: this is initial delay
   public static final ConfigProperty<String> LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS = ConfigProperty
       .key(LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS_PROP_KEY)
       .defaultValue(DEFAULT_LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS)
       .sinceVersion("0.8.0")
       .withDocumentation("Initial amount of time to wait between retries to acquire locks, "
           + " subsequent retries will exponentially backoff.");
 
+  // Ethan: default should be larger? like 16s?  we already have exponential backoff. sth like 1 -> 2 -> 4 -> 8 -> 16 -> 16 ...
+  // currently, retry times is 15.  The change increases the retry period from ~1min to ~4min.

Review Comment:
   +1



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -310,6 +321,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When inserted records share same key, controls whether they should be first combined (i.e de-duplicated) before"
           + " writing to storage.");
 
+  // Ethan: should we have one "hoodie.combine.before.write" for all change operations instead of one config per operation?

Review Comment:
   leave this alone?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePayloadConfig.java:
##########
@@ -40,7 +40,10 @@
     description = "Payload related configs, that can be leveraged to "
         + "control merges based on specific business fields in the data.")
 public class HoodiePayloadConfig extends HoodieConfig {
+  // Configs for Hudi payload class
 
+  // Ethan: shall we use preCombine ("hoodie.datasource.write.precombine.field") only and remove this?
+  // It acts as the same purpose as the precombine field.  This confuses user too.

Review Comment:
   could we infer this ?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -45,7 +45,25 @@
     description = "Configurations that control compaction "
         + "(merging of log files onto a new base files).")
 public class HoodieCompactionConfig extends HoodieConfig {
-
+  // Configs for compaction
+
+  // Ethan: Similar to clustering, the compaction configs should be straightened up to be more concise
+  // Sth more intuitive like:
+  // hoodie.compact.schedule.mode -> [inline|async|disabled] // how to schedule compaction plan
+  // hoodie.compact.execute.mode -> [inline|async|disabled] // how to execute compaction plan
+  // // compaction parameters should be independent of inline/async
+  // hoodie.compact.trigger.strategy
+  // hoodie.compact.max.delta.seconds
+  // hoodie.compact.max.delta.commits
+  // ...
+  // instead of many configs that overlap with each other
+  // hoodie.compact.schedule.inline
+  // hoodie.compact.inline
+  // hoodie.datasource.compaction.async.enable
+  // // these are reused for inline/async
+  // hoodie.compact.inline.trigger.strategy
+  // hoodie.compact.inline.max.delta.seconds
+  // hoodie.compact.inline.max.delta.commits
   public static final ConfigProperty<String> INLINE_COMPACT = ConfigProperty
       .key("hoodie.compact.inline")
       .defaultValue("false")

Review Comment:
   `"hoodie.compact.schedule.inline"` can be redone similar to how cleaner, others are. Smaller bandaid?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -271,6 +276,7 @@ public class HoodieIndexConfig extends HoodieConfig {
       .withDocumentation("Only applies if index type is BUCKET. Determine the number of buckets in the hudi table, "
           + "and each partition is divided to N buckets.");
 
+  // Ethan: should the defaults based on "hoodie.bucket.index.num.buckets": e.g., min=0.5*num, max=4*num?

Review Comment:
   yeah having some default, but overridde-able is ok



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLayoutConfig.java:
##########
@@ -37,12 +38,15 @@
     description = "Configurations that control storage layout and data distribution, "
         + "which defines how the files are organized within a table.")
 public class HoodieLayoutConfig extends HoodieConfig {
+  // Configs for Hudi storage layer
 
+  // Ethan: should this be a table config? I assume this cannot be changed once determined

Review Comment:
   yes. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -162,6 +168,9 @@ public class HoodieClusteringConfig extends HoodieConfig {
           + "DAY_ROLLING: clustering partitions on a rolling basis by the hour to avoid clustering all partitions each time, "
           + "which strategy sorts the partitions asc and chooses the partition of which index is divided by 24 and the remainder is equal to the current hour.");
 
+  // Ethan: all these fine-tuning knobs can be simplified if we have a config to indicate the purpose:

Review Comment:
   These groups were added to run clustering in smaller rounds, for large scale use-cases. We should make it simple for out of box medium scale data



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePreCommitValidatorConfig.java:
##########
@@ -38,13 +38,15 @@
     groupName = ConfigGroups.Names.SPARK_DATASOURCE,
     description = "The following set of configurations help validate new data before commits.")
 public class HoodiePreCommitValidatorConfig extends HoodieConfig {
+  // Configs for precommit validator
 
   public static final ConfigProperty<String> VALIDATOR_CLASS_NAMES = ConfigProperty
       .key("hoodie.precommit.validators")
       .defaultValue("")
       .withDocumentation("Comma separated list of class names that can be invoked to validate commit");
   public static final String VALIDATOR_TABLE_VARIABLE = "<TABLE_NAME>";
 
+  // Ethan: it seems that these SQL query configs are not used?  ALso, could we have one config for all SQL queries?

Review Comment:
   +1 for simplification



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -185,6 +192,7 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "**Note** This is being actively worked on. Please use "
           + "`hoodie.datasource.write.keygenerator.class` instead.");
 
+  // Ethan: look like we can remove this as it is mature?

Review Comment:
   restore does not use this. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -154,6 +161,7 @@ public class HoodieLockConfig extends HoodieConfig {
       .sinceVersion("0.8.0")
       .withDocumentation("Timeout in ms, to wait for establishing connection with Zookeeper.");
 
+  // Ethan: remove ZK_PORT so that ZK_CONNECT_URL also includes the port number?

Review Comment:
   I would leave this alone. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -212,6 +220,7 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs "
           + "etc in .hoodie directory under this base path directory.");
 
+  // Ethan: is this still necessary?

Review Comment:
   advanced config. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -177,6 +181,9 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "this queue has no need for additional memory and cpu resources due to lock or multithreading, but also lost some benefits such as speed limit. "
           + "Although DISRUPTOR is still experimental.");
 
+  // Ethan: Should this be automatically determined based on the partition and record key column(s)?
+  // e.g., single column -> simple key gen, multiple column -> complex key gen

Review Comment:
   +1.. other cases, users can explicitly configure



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org