You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "yihua (via GitHub)" <gi...@apache.org> on 2023/03/08 02:12:07 UTC
[GitHub] [hudi] yihua commented on a diff in pull request #7912: [HUDI-5739] Review Hudi Config Defaults

yihua commented on code in PR #7912:
URL: https://github.com/apache/hudi/pull/7912#discussion_r1128765984


##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -93,6 +98,7 @@ public class HoodieArchivalConfig extends HoodieConfig {
       .withDocumentation("When enable, hoodie will auto merge several small archive files into larger one. It's"
           + " useful when storage scheme doesn't support append operation.");
 
+  // Ethan: Once this is well tested, we should remove the feature flag

Review Comment:
   @nsivabalan could we remove this config now?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -117,18 +127,21 @@ public class HoodieCleanConfig extends HoodieConfig {
           + "failed writes to re-claim space. Choose to perform this rollback of failed writes eagerly before "
           + "every writer starts (only supported for single writer) or lazily by the cleaner (required for multi-writers)");
 
+  // Ethan: same here.  Can we reuse Spark parallelism or derive automatically?

Review Comment:
   Tracked in HUDI-5894



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -108,6 +110,7 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .sinceVersion("0.11.0")
       .withDocumentation("Partitions to run clustering");
 
+  // Ethan: enum instead of class name to determine strategy type?  So this can be engine-agnostic

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -117,18 +127,21 @@ public class HoodieCleanConfig extends HoodieConfig {
           + "failed writes to re-claim space. Choose to perform this rollback of failed writes eagerly before "
           + "every writer starts (only supported for single writer) or lazily by the cleaner (required for multi-writers)");
 
+  // Ethan: same here.  Can we reuse Spark parallelism or derive automatically?
   public static final ConfigProperty<String> CLEANER_PARALLELISM_VALUE = ConfigProperty
       .key("hoodie.cleaner.parallelism")
       .defaultValue("200")
       .withDocumentation("Parallelism for the cleaning operation. Increase this if cleaning becomes slow.");
 
+  // Ethan: wondering if this is still needed.

Review Comment:
   @nsivabalan do you think we still need this?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -201,6 +210,8 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .withDocumentation("Determines how to handle updates, deletes to file groups that are under clustering."
           + " Default strategy just rejects the update");
 
+  // Ethan: these configs of inline, async, schedule, execute can be merged into one mode config:

Review Comment:
   Tracked in HUDI-5896



##########
hudi-aws/src/main/java/org/apache/hudi/config/HoodieAWSConfig.java:
##########
@@ -41,20 +42,23 @@
  */
 @Immutable
 @ConfigClassProperty(name = "Amazon Web Services Configs",
-        groupName = ConfigGroups.Names.AWS,
-        description = "Amazon Web Services configurations to access resources like Amazon DynamoDB (for locks),"
-            + " Amazon CloudWatch (metrics).")
+    groupName = ConfigGroups.Names.AWS,
+    description = "Amazon Web Services configurations to access resources like Amazon DynamoDB (for locks),"
+        + " Amazon CloudWatch (metrics).")
 public class HoodieAWSConfig extends HoodieConfig {
+  // AWS Access Configs
+
+  // Ethan: These configs should only be necessary if no environmental variables are not set (to confirm with code)

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -196,6 +222,7 @@ public class HoodieCompactionConfig extends HoodieConfig {
       .withDocumentation("New optimized scan for log blocks that handles all multi-writer use-cases while appending to log files. "
           + "It also differentiates original blocks written by ingestion writers and compacted blocks written log compaction.");
 
+  // Ethan: These deprecated configs should be removed.  Similar in other classes

Review Comment:
   Leave these alone for 0.x.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##########
@@ -74,24 +75,28 @@ public class HoodieBootstrapConfig extends HoodieConfig {
       .sinceVersion("0.6.0")
       .withDocumentation("Class to use for reading the bootstrap dataset partitions/files, for Bootstrap mode FULL_RECORD");
 
+  // Ethan: can this be aligned with key gen for write (no need for a separate bootstrap key gen config)?
   public static final ConfigProperty<String> KEYGEN_CLASS_NAME = ConfigProperty
       .key("hoodie.bootstrap.keygen.class")
       .noDefaultValue()
       .sinceVersion("0.6.0")
       .withDocumentation("Key generator implementation to be used for generating keys from the bootstrapped dataset");
 
+  // Ethan: similar here
   public static final ConfigProperty<String> KEYGEN_TYPE = ConfigProperty
       .key("hoodie.bootstrap.keygen.type")
       .defaultValue(KeyGeneratorType.SIMPLE.name())
       .sinceVersion("0.9.0")
       .withDocumentation("Type of build-in key generator, currently support SIMPLE, COMPLEX, TIMESTAMP, CUSTOM, NON_PARTITION, GLOBAL_DELETE");
 
+  // Ethan: better default translator: if seeing hive-style partitioning, automatically parse the partition value?
   public static final ConfigProperty<String> PARTITION_PATH_TRANSLATOR_CLASS_NAME = ConfigProperty
       .key("hoodie.bootstrap.partitionpath.translator.class")
       .defaultValue(IdentityBootstrapPartitionPathTranslator.class.getName())
       .sinceVersion("0.6.0")
       .withDocumentation("Translates the partition paths from the bootstrapped data into how is laid out as a Hudi table.");
 
+  // Ethan: can this be aligned with Spark parallelism?

Review Comment:
   Tracked in HUDI-5894



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -259,6 +263,7 @@ public class HoodieIndexConfig extends HoodieConfig {
           + "Consistent hashing supports dynamic resizing of the number of bucket, solving potential data skew and file size "
           + "issues of the SIMPLE hashing engine. Consistent hashing only works with MOR tables, only use simple hashing on COW tables.");
 
+  // Ethan: is the default too large?

Review Comment:
   @yuzhaojing @minihippo could you confirm if the default is too large for medium-sized table?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieMemoryConfig.java:
##########
@@ -77,6 +82,7 @@ public class HoodieMemoryConfig extends HoodieConfig {
       .defaultValue(16 * 1024 * 1024)
       .withDocumentation("Property to control the max memory in bytes for dfs input stream buffer size");
 
+  // Ethan: auto configured using table base path?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -135,6 +141,7 @@ public class HoodieLockConfig extends HoodieConfig {
       .sinceVersion("0.8.0")
       .withDocumentation("For Hive based lock provider, the Hive metastore URI to acquire locks against.");
 
+  // Ethan: set this automatically based on the table base path?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -279,6 +288,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .defaultValue("0")
       .withDocumentation("Parallelism used for “delete” operation. Delete operations also performs shuffles, similar to upsert operation.");
 
+  // Ethan: does this need to be automatically set based on the input like other parallelism?

Review Comment:
   Tracked in HUDI-5894



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteCommitCallbackConfig.java:
##########
@@ -65,6 +66,7 @@ public class HoodieWriteCommitCallbackConfig extends HoodieConfig {
       .sinceVersion("0.6.0")
       .withDocumentation("Http callback API key. hudi_write_commit_http_callback by default");
 
+  // Ethan: default is too low? 30s instead?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -329,18 +341,22 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Write status objects hold metadata about a write (stats, errors), that is not yet committed to storage. "
           + "This controls the how that information is cached for inspection by clients. We rarely expect this to be changed.");
 
+  // Ethan: this should be an internal config.
   public static final ConfigProperty<String> AUTO_COMMIT_ENABLE = ConfigProperty
       .key("hoodie.auto.commit")
       .defaultValue("true")
       .withDocumentation("Controls whether a write operation should auto commit. This can be turned off to perform inspection"
           + " of the uncommitted write before deciding to commit.");
 
+  // Ethan: we never talk about this in docs :)
   public static final ConfigProperty<String> WRITE_STATUS_CLASS_NAME = ConfigProperty
       .key("hoodie.writestatus.class")
       .defaultValue(WriteStatus.class.getName())
       .withDocumentation("Subclass of " + WriteStatus.class.getName() + " to be used to collect information about a write. Can be "
           + "overridden to collection additional metrics/statistics about the data if needed.");
 
+  // Ethan: similarly, we should revisit all parallelism and see if they can be auto-configured
+  // instead of a static default.

Review Comment:
   Tacked in HUDI-5894



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -289,6 +299,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .defaultValue(String.valueOf(4 * 1024 * 1024))
       .withDocumentation("Size of in-memory buffer used for parallelizing network reads and lake storage writes.");
 
+  // Ethan: is 1024B too small as the default?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -501,6 +524,7 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "OPTIMISTIC_CONCURRENCY_CONTROL: Multiple writers can operate on the table and exactly one of them succeed "
           + "if a conflict (writes affect the same file group) is detected.");
 
+  // Ethan: need to check why we need both "hoodie.avro.schema" and "hoodie.write.schema".

Review Comment:
   Will leave the schema configs alone for now.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -564,12 +591,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .sinceVersion("0.11.0")
       .withDocumentation("Control to enable release all persist rdds when the spark job finish.");
 
+  // Ethan: shall we remove this and do auto adjustment everywhere?
   public static final ConfigProperty<Boolean> AUTO_ADJUST_LOCK_CONFIGS = ConfigProperty
       .key("hoodie.auto.adjust.lock.configs")
       .defaultValue(false)
       .sinceVersion("0.11.0")
       .withDocumentation("Auto adjust lock configurations when metadata table is enabled and for async table services.");
 
+  // Ethan: should we remove this and make the validation mandatory?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -375,6 +391,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .sinceVersion("0.9.0")
       .withDocumentation("The batch interval in milliseconds for marker creation batch processing");
 
+  // Ethan: similar here for revisiting parallelism

Review Comment:
   Tracked in HUDI-5894



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -40,6 +40,7 @@
         + "This table maintains the metadata about a given Hudi table (e.g file listings) "
         + " to avoid overhead of accessing cloud storage, during queries.")
 public final class HoodieMetadataConfig extends HoodieConfig {
+  //cfg

Review Comment:
   Tacked in HUDI-5900



##########
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java:
##########
@@ -162,6 +163,7 @@ public class HoodieTableConfig extends HoodieConfig {
       .noDefaultValue()
       .withDocumentation("Version of timeline used, by the table.");
 
+  // Ethan: to be merged with write payload class

Review Comment:
   Tracked in HUDI-5892



##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -96,16 +100,18 @@ public class DynamoDbBasedLockConfig extends HoodieConfig {
       .sinceVersion("0.10.0")
       .withDocumentation("For DynamoDB based lock provider, write capacity units when using PROVISIONED billing mode");
 
+  // Ethan: 10 min.  Is this timeout too long?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -96,16 +100,18 @@ public class DynamoDbBasedLockConfig extends HoodieConfig {
       .sinceVersion("0.10.0")
       .withDocumentation("For DynamoDB based lock provider, write capacity units when using PROVISIONED billing mode");
 
+  // Ethan: 10 min.  Is this timeout too long?
   public static final ConfigProperty<String> DYNAMODB_LOCK_TABLE_CREATION_TIMEOUT = ConfigProperty
       .key(DYNAMODB_BASED_LOCK_PROPERTY_PREFIX + "table_creation_timeout")
       .defaultValue(String.valueOf(10 * 60 * 1000))
       .sinceVersion("0.10.0")
       .withDocumentation("For DynamoDB based lock provider, the maximum number of milliseconds to wait for creating DynamoDB table");
 
+  // Ethan: Move the infer logic here

Review Comment:
   Tracked in HUDI-5892



##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -40,10 +40,13 @@
         + " between writers to a Hudi table. Concurrency between Hudi's own table services "
         + " are auto managed internally.")
 public class DynamoDbBasedLockConfig extends HoodieConfig {
+  // Config for DynamoDB based lock provider
 
   // configs for DynamoDb based locks
   public static final String DYNAMODB_BASED_LOCK_PROPERTY_PREFIX = LOCK_PREFIX + "dynamodb.";
 
+  // Ethan: Can we provide a default value here that is less likely to collide,
+  // So this config also becomes optional?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-aws/src/main/java/org/apache/hudi/config/DynamoDbBasedLockConfig.java:
##########
@@ -78,6 +81,7 @@ public class DynamoDbBasedLockConfig extends HoodieConfig {
       .withDocumentation("For DynamoDB based lock provider, the region used in endpoint for Amazon DynamoDB service."
                          + " Would try to first get it from AWS_REGION environment variable. If not find, by default use us-east-1");
 
+  // Ethan: Need to confirm these are good defaults for cost

Review Comment:
   Sg



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -38,14 +38,17 @@
     groupName = ConfigGroups.Names.WRITE_CLIENT,
     description = "Configurations that control archival.")
 public class HoodieArchivalConfig extends HoodieConfig {
+  //cfg
 
+  // Ethan: Should we remove this?  Archive should always be enabled?
   public static final ConfigProperty<String> AUTO_ARCHIVE = ConfigProperty
       .key("hoodie.archive.automatic")
       .defaultValue("true")
       .withDocumentation("When enabled, the archival table service is invoked immediately after each commit,"
           + " to archive commits if we cross a maximum value of commits."
           + " It's recommended to enable this, to ensure number of active commits is bounded.");
 
+  // Ethan: Is Async archive tested?

Review Comment:
   Essential/advanced config tracked in HUDI-5893. 



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -60,11 +63,13 @@ public class HoodieArchivalConfig extends HoodieConfig {
           + " keep the metadata overhead constant, even as the table size grows."
           + " This config controls the maximum number of instants to retain in the active timeline. ");
 
+  // Ethan: Can we use Spark parallelism or auto-derive the configs and remove this?

Review Comment:
   Tracked in HUDI-5894



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -38,14 +38,17 @@
     groupName = ConfigGroups.Names.WRITE_CLIENT,
     description = "Configurations that control archival.")
 public class HoodieArchivalConfig extends HoodieConfig {
+  //cfg
 
+  // Ethan: Should we remove this?  Archive should always be enabled?

Review Comment:
   Tracked in HUDI-5893



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieArchivalConfig.java:
##########
@@ -60,11 +63,13 @@ public class HoodieArchivalConfig extends HoodieConfig {
           + " keep the metadata overhead constant, even as the table size grows."
           + " This config controls the maximum number of instants to retain in the active timeline. ");
 
+  // Ethan: Can we use Spark parallelism or auto-derive the configs and remove this?
   public static final ConfigProperty<Integer> DELETE_ARCHIVED_INSTANT_PARALLELISM_VALUE = ConfigProperty
       .key("hoodie.archive.delete.parallelism")
       .defaultValue(100)
       .withDocumentation("Parallelism for deleting archived hoodie commits.");
 
+  // Ethan: pick one of "hoodie.keep.min.commits" and "hoodie.keep.max.commits" only?

Review Comment:
   Agree.  We can leave this as is.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##########
@@ -74,24 +75,28 @@ public class HoodieBootstrapConfig extends HoodieConfig {
       .sinceVersion("0.6.0")
       .withDocumentation("Class to use for reading the bootstrap dataset partitions/files, for Bootstrap mode FULL_RECORD");
 
+  // Ethan: can this be aligned with key gen for write (no need for a separate bootstrap key gen config)?

Review Comment:
   Tracked in HUDI-5895



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -58,12 +61,16 @@ public class HoodieCleanConfig extends HoodieConfig {
       .withDocumentation("Only applies when " + AUTO_CLEAN.key() + " is turned on. "
           + "When turned on runs cleaner async with writing, which can speed up overall write performance.");
 
+  // Ethan: when either "hoodie.cleaner.commits.retained", "hoodie.cleaner.hours.retained",

Review Comment:
   Yes.  Tracked in HUDI-5892.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -84,11 +91,13 @@ public class HoodieCleanConfig extends HoodieConfig {
       .withDocumentation("Controls how cleaning is scheduled. Valid options: "
           + Arrays.stream(CleaningTriggerStrategy.values()).map(Enum::name).collect(Collectors.joining(",")));
 
+  // Ethan: "hoodie.clean.trigger.max.commits" is what this means

Review Comment:
   Tracked in HUDI-5896



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieBootstrapConfig.java:
##########
@@ -46,6 +46,7 @@
         + "The bootstrap operation can flexibly avoid copying data over before you can use Hudi and support running the existing "
         + " writers and new hudi writers in parallel, to validate the migration.")
 public class HoodieBootstrapConfig extends HoodieConfig {
+  // Configs for Bootstrap feature

Review Comment:
   Tracked in HUDI-5895



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -41,10 +42,12 @@
  */
 @Immutable
 @ConfigClassProperty(name = "Clean Configs",
-        groupName = ConfigGroups.Names.WRITE_CLIENT,
-        description = "Cleaning (reclamation of older/unused file groups/slices).")
+    groupName = ConfigGroups.Names.WRITE_CLIENT,
+    description = "Cleaning (reclamation of older/unused file groups/slices).")
 public class HoodieCleanConfig extends HoodieConfig {
+  // Configs for Clean action
 
+  // Ethan: should we standardize on `.enable` instead of `.automatic`?

Review Comment:
   Tracked in HUDI-5896



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -117,18 +127,21 @@ public class HoodieCleanConfig extends HoodieConfig {
           + "failed writes to re-claim space. Choose to perform this rollback of failed writes eagerly before "
           + "every writer starts (only supported for single writer) or lazily by the cleaner (required for multi-writers)");
 
+  // Ethan: same here.  Can we reuse Spark parallelism or derive automatically?
   public static final ConfigProperty<String> CLEANER_PARALLELISM_VALUE = ConfigProperty
       .key("hoodie.cleaner.parallelism")
       .defaultValue("200")
       .withDocumentation("Parallelism for the cleaning operation. Increase this if cleaning becomes slow.");
 
+  // Ethan: wondering if this is still needed.
   public static final ConfigProperty<Boolean> ALLOW_MULTIPLE_CLEANS = ConfigProperty
       .key("hoodie.clean.allow.multiple")
       .defaultValue(true)
       .sinceVersion("0.11.0")
       .withDocumentation("Allows scheduling/executing multiple cleans by enabling this config. If users prefer to strictly ensure clean requests should be mutually exclusive, "
           + ".i.e. a 2nd clean will not be scheduled if another clean is not yet completed to avoid repeat cleaning of same files, they might want to disable this config.");
 
+  // Ethan: should the default behavior be deleting original parquet files and then we can remove this config?

Review Comment:
   Tracked in HUDI-5895



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -162,6 +168,9 @@ public class HoodieClusteringConfig extends HoodieConfig {
           + "DAY_ROLLING: clustering partitions on a rolling basis by the hour to avoid clustering all partitions each time, "
           + "which strategy sorts the partitions asc and chooses the partition of which index is divided by 24 and the remainder is equal to the current hour.");
 
+  // Ethan: all these fine-tuning knobs can be simplified if we have a config to indicate the purpose:

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -131,18 +134,21 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .withDocumentation("Turn on inline clustering - clustering will be run after each write operation is complete")
       .withAlternatives("hoodie.datasource.clustering.inline.enable");
 
+  // Ethan: this can apply to both inline or aysnc clustering.  Rename it for better readability.

Review Comment:
   Tracked in HUDI-5896



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -47,6 +47,7 @@
     description = "Configurations that control the clustering table service in hudi, "
         + "which optimizes the storage layout for better query performance by sorting and sizing data files.")
 public class HoodieClusteringConfig extends HoodieConfig {
+  // Configs for Clustering operations

Review Comment:
   Tracked in HUDI-5897



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java:
##########
@@ -218,6 +229,7 @@ public class HoodieClusteringConfig extends HoodieConfig {
       .withDocumentation("Enable running of clustering service, asynchronously as inserts happen on the table.")
       .withAlternatives("hoodie.datasource.clustering.async.enable");
 
+  // Ethan: this should be removed.  User should not tweak this to corrupt the hoodie_commit_time meta field.

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -45,7 +45,25 @@
     description = "Configurations that control compaction "
         + "(merging of log files onto a new base files).")
 public class HoodieCompactionConfig extends HoodieConfig {
-
+  // Configs for compaction
+
+  // Ethan: Similar to clustering, the compaction configs should be straightened up to be more concise
+  // Sth more intuitive like:
+  // hoodie.compact.schedule.mode -> [inline|async|disabled] // how to schedule compaction plan
+  // hoodie.compact.execute.mode -> [inline|async|disabled] // how to execute compaction plan
+  // // compaction parameters should be independent of inline/async
+  // hoodie.compact.trigger.strategy
+  // hoodie.compact.max.delta.seconds
+  // hoodie.compact.max.delta.commits
+  // ...
+  // instead of many configs that overlap with each other
+  // hoodie.compact.schedule.inline
+  // hoodie.compact.inline
+  // hoodie.datasource.compaction.async.enable
+  // // these are reused for inline/async
+  // hoodie.compact.inline.trigger.strategy
+  // hoodie.compact.inline.max.delta.seconds
+  // hoodie.compact.inline.max.delta.commits
   public static final ConfigProperty<String> INLINE_COMPACT = ConfigProperty
       .key("hoodie.compact.inline")
       .defaultValue("false")

Review Comment:
   Tracked in HUDI-5896



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -271,6 +276,7 @@ public class HoodieIndexConfig extends HoodieConfig {
       .withDocumentation("Only applies if index type is BUCKET. Determine the number of buckets in the hudi table, "
           + "and each partition is divided to N buckets.");
 
+  // Ethan: should the defaults based on "hoodie.bucket.index.num.buckets": e.g., min=0.5*num, max=4*num?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -45,7 +45,25 @@
     description = "Configurations that control compaction "
         + "(merging of log files onto a new base files).")
 public class HoodieCompactionConfig extends HoodieConfig {
-
+  // Configs for compaction
+
+  // Ethan: Similar to clustering, the compaction configs should be straightened up to be more concise
+  // Sth more intuitive like:
+  // hoodie.compact.schedule.mode -> [inline|async|disabled] // how to schedule compaction plan
+  // hoodie.compact.execute.mode -> [inline|async|disabled] // how to execute compaction plan
+  // // compaction parameters should be independent of inline/async
+  // hoodie.compact.trigger.strategy
+  // hoodie.compact.max.delta.seconds
+  // hoodie.compact.max.delta.commits
+  // ...
+  // instead of many configs that overlap with each other

Review Comment:
   Tracked in HUDI-5898



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -175,6 +198,9 @@ public class HoodieCompactionConfig extends HoodieConfig {
           + " record sizes. It's recommended to keep this turned on, since hand tuning is otherwise extremely"
           + " cumbersome.");
 
+  // Ethan: Should this be removed?  This should only be used by the first batch.  For first batch,

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -271,6 +276,7 @@ public class HoodieIndexConfig extends HoodieConfig {
       .withDocumentation("Only applies if index type is BUCKET. Determine the number of buckets in the hudi table, "
           + "and each partition is divided to N buckets.");
 
+  // Ethan: should the defaults based on "hoodie.bucket.index.num.buckets": e.g., min=0.5*num, max=4*num?

Review Comment:
   @yuzhaojing @minihippo any suggestion on the default value here?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java:
##########
@@ -132,13 +151,15 @@ public class HoodieCompactionConfig extends HoodieConfig {
           + "compaction during each compaction run. By default. Hudi picks the log file "
           + "with most accumulated unmerged data");
 
+  // Ethan: this seems to be internal
   public static final ConfigProperty<String> COMPACTION_LAZY_BLOCK_READ_ENABLE = ConfigProperty
       .key("hoodie.compaction.lazy.block.read")
       .defaultValue("true")
       .withDocumentation("When merging the delta log files, this config helps to choose whether the log blocks "
           + "should be read lazily or not. Choose true to use lazy block reading (low memory usage, but incurs seeks to each block"
           + " header) or false for immediate block read (higher memory usage)");
 
+  // Ethan: this seems to be internal

Review Comment:
   The reverse log traversal is introduced by #331.  @n3nash could you provide context around this?  Is it still useful?  How stable is it based on any previous production rollout?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieHBaseIndexConfig.java:
##########
@@ -36,6 +36,7 @@
         + "(when HBase based indexing is enabled), which tags incoming "
         + "records as either inserts or updates to older records.")
 public class HoodieHBaseIndexConfig extends HoodieConfig {
+  // Configs for HBase Index

Review Comment:
   Sg



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -177,6 +179,7 @@ public class HoodieIndexConfig extends HoodieConfig {
           + "When true, the incoming writes will cached to speed up index lookup by reducing IO "
           + "for computing parallelism or affected partitions");
 
+  // Ethan: should the parallelim below be dynamically determined by Spark?

Review Comment:
   Tracked in HUDI-5894



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java:
##########
@@ -25,12 +25,15 @@
  * Configs/params used for internal purposes.
  */
 public class HoodieInternalConfig extends HoodieConfig {
+  // Some internal configs (?)
 
   private static final long serialVersionUID = 0L;
 
+  // Ethan: why do we need this? BulkInsertPartitioner can already indicate `arePartitionRecordsSorted()`
   public static final String BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED = "hoodie.bulkinsert.are.partitioner.records.sorted";
   public static final Boolean DEFAULT_BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED = false;
 
+  // Ethan: why do we need this?

Review Comment:
   @nsivabalan I see this config is introduced by #2328 for bulk insert which you reviewed.  Do you know if this is still needed?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePreCommitValidatorConfig.java:
##########
@@ -38,13 +38,15 @@
     groupName = ConfigGroups.Names.SPARK_DATASOURCE,
     description = "The following set of configurations help validate new data before commits.")
 public class HoodiePreCommitValidatorConfig extends HoodieConfig {
+  // Configs for precommit validator
 
   public static final ConfigProperty<String> VALIDATOR_CLASS_NAMES = ConfigProperty
       .key("hoodie.precommit.validators")
       .defaultValue("")
       .withDocumentation("Comma separated list of class names that can be invoked to validate commit");
   public static final String VALIDATOR_TABLE_VARIABLE = "<TABLE_NAME>";
 
+  // Ethan: it seems that these SQL query configs are not used?  ALso, could we have one config for all SQL queries?

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -177,6 +181,9 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "this queue has no need for additional memory and cpu resources due to lock or multithreading, but also lost some benefits such as speed limit. "
           + "Although DISRUPTOR is still experimental.");
 
+  // Ethan: Should this be automatically determined based on the partition and record key column(s)?
+  // e.g., single column -> simple key gen, multiple column -> complex key gen

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -449,6 +469,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Table services such as compaction and clustering can fail and prevent syncing to "
           + "the metaclient. Set this to true to fail writes when table services fail");
 
+  // Ethan: do we still need all of the consistency check logic given major cloud storage are now strongly consistent?

Review Comment:
   Tracked in HUDI-5899



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -154,6 +161,7 @@ public class HoodieLockConfig extends HoodieConfig {
       .sinceVersion("0.8.0")
       .withDocumentation("Timeout in ms, to wait for establishing connection with Zookeeper.");
 
+  // Ethan: remove ZK_PORT so that ZK_CONNECT_URL also includes the port number?

Review Comment:
   Sg



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -408,6 +425,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When true, spins up an instance of the timeline server (meta server that serves cached file listings, statistics),"
           + "running on each writer's driver process, accepting requests during the write from executors.");
 
+  // Ethan: is this still useful?

Review Comment:
   Tracked in HUDI-5892 for now



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLayoutConfig.java:
##########
@@ -37,12 +38,15 @@
     description = "Configurations that control storage layout and data distribution, "
         + "which defines how the files are organized within a table.")
 public class HoodieLayoutConfig extends HoodieConfig {
+  // Configs for Hudi storage layer
 
+  // Ethan: should this be a table config? I assume this cannot be changed once determined

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -139,6 +141,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("Field used in preCombining before actual write. When two records have the same key value, "
           + "we will pick the one with the largest value for the precombine field, determined by Object.compareTo(..)");
 
+  // Ethan: regarding the namespace, should we use "hoodie.write." instead of "hoodie.datasource.write." as the prefix?

Review Comment:
   Tracked in HUDI-5896



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCleanConfig.java:
##########
@@ -102,6 +111,7 @@ public class HoodieCleanConfig extends HoodieConfig {
           + " in the timeline, since the last cleaner run. This is much more efficient than obtaining listings for the full"
           + " table for each planning (even with a metadata table).");
 
+  // Ethan: this should be an internal config, based on concurrency control mode.  Should not be tweaked by user.

Review Comment:
   Tracked in HUDI-5893



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -111,11 +116,12 @@ public class HoodieLockConfig extends HoodieConfig {
       .sinceVersion("0.8.0")
       .withDocumentation("For DFS based lock providers, path to store the locks under. use Table's meta path as default");
 
+  // Ethan: should the default also allow expiration? Need to dig into the code to understand what this controls.
   public static final ConfigProperty<Integer> FILESYSTEM_LOCK_EXPIRE = ConfigProperty
-        .key(FILESYSTEM_LOCK_EXPIRE_PROP_KEY)
-        .defaultValue(0)
-        .sinceVersion("0.12.0")
-        .withDocumentation("For DFS based lock providers, expire time in minutes, must be a nonnegative number, default means no expire");
+      .key(FILESYSTEM_LOCK_EXPIRE_PROP_KEY)
+      .defaultValue(0)
+      .sinceVersion("0.12.0")
+      .withDocumentation("For DFS based lock providers, expire time in minutes, must be a nonnegative number, default means no expire");

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -564,12 +591,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .sinceVersion("0.11.0")
       .withDocumentation("Control to enable release all persist rdds when the spark job finish.");
 
+  // Ethan: shall we remove this and do auto adjustment everywhere?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java:
##########
@@ -25,12 +25,15 @@
  * Configs/params used for internal purposes.
  */
 public class HoodieInternalConfig extends HoodieConfig {
+  // Some internal configs (?)
 
   private static final long serialVersionUID = 0L;
 
+  // Ethan: why do we need this? BulkInsertPartitioner can already indicate `arePartitionRecordsSorted()`
   public static final String BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED = "hoodie.bulkinsert.are.partitioner.records.sorted";
   public static final Boolean DEFAULT_BULKINSERT_ARE_PARTITIONER_RECORDS_SORTED = false;
 
+  // Ethan: why do we need this?

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieMemoryConfig.java:
##########
@@ -62,11 +64,14 @@ public class HoodieMemoryConfig extends HoodieConfig {
   // Minimum memory size (100MB) for the spillable map.
   public static final long DEFAULT_MIN_MEMORY_FOR_SPILLABLE_MAP_IN_BYTES = 100 * 1024 * 1024L;
 
+  // Ethan: Should this be automatically configured as a fraction of memory

Review Comment:
   Sg



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -471,12 +492,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When enabled, data validation checks are performed during merges to ensure expected "
           + "number of records after merge operation.");
 
+  // Ethan: should "hoodie.combine.before.insert" controls the behavior this one intends to control as well?

Review Comment:
   Tracked in HUDI-5892 for now



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -408,6 +425,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When true, spins up an instance of the timeline server (meta server that serves cached file listings, statistics),"
           + "running on each writer's driver process, accepting requests during the write from executors.");
 
+  // Ethan: is this still useful?

Review Comment:
   @bvaradar should we remove all relevant logic on reusing the timeline server?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -471,12 +492,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When enabled, data validation checks are performed during merges to ensure expected "
           + "number of records after merge operation.");
 
+  // Ethan: should "hoodie.combine.before.insert" controls the behavior this one intends to control as well?
   public static final ConfigProperty<String> MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE = ConfigProperty
       .key("hoodie.merge.allow.duplicate.on.inserts")
       .defaultValue("false")
       .withDocumentation("When enabled, we allow duplicate keys even if inserts are routed to merge with an existing file (for ensuring file sizing)."
           + " This is only relevant for insert operation, since upsert, delete operations will ensure unique key constraints are maintained.");
 
+  // Ethan: should we make the default 0, to reduce the write amplication and make MOR write fast?

Review Comment:
   Sg



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -212,6 +220,7 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "Hudi stores all the main meta-data about commits, savepoints, cleaning audit logs "
           + "etc in .hoodie directory under this base path directory.");
 
+  // Ethan: is this still necessary?

Review Comment:
   Tracked in HUDI-5893



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -564,12 +591,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .sinceVersion("0.11.0")
       .withDocumentation("Control to enable release all persist rdds when the spark job finish.");
 
+  // Ethan: shall we remove this and do auto adjustment everywhere?

Review Comment:
   @nsivabalan let's discuss this too



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -471,12 +492,14 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When enabled, data validation checks are performed during merges to ensure expected "
           + "number of records after merge operation.");
 
+  // Ethan: should "hoodie.combine.before.insert" controls the behavior this one intends to control as well?

Review Comment:
   I'll go through the code.  @nsivabalan do you have any concern around this?



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieLockConfig.java:
##########
@@ -66,14 +66,18 @@
     areCommonConfigs = true,
     description = "")
 public class HoodieLockConfig extends HoodieConfig {
+  // Configs for common and legacy lock providers
 
+  // Ethan: nit: this is initial delay
   public static final ConfigProperty<String> LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS = ConfigProperty
       .key(LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS_PROP_KEY)
       .defaultValue(DEFAULT_LOCK_ACQUIRE_RETRY_WAIT_TIME_IN_MILLIS)
       .sinceVersion("0.8.0")
       .withDocumentation("Initial amount of time to wait between retries to acquire locks, "
           + " subsequent retries will exponentially backoff.");
 
+  // Ethan: default should be larger? like 16s?  we already have exponential backoff. sth like 1 -> 2 -> 4 -> 8 -> 16 -> 16 ...
+  // currently, retry times is 15.  The change increases the retry period from ~1min to ~4min.

Review Comment:
   Tracked in HUDI-5782



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -185,6 +192,7 @@ public class HoodieWriteConfig extends HoodieConfig {
           + "**Note** This is being actively worked on. Please use "
           + "`hoodie.datasource.write.keygenerator.class` instead.");
 
+  // Ethan: look like we can remove this as it is mature?

Review Comment:
   Tracked in HUDI-5782.  Likely we'll leave this alone.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePayloadConfig.java:
##########
@@ -53,12 +56,13 @@ public class HoodiePayloadConfig extends HoodieConfig {
       .withDocumentation("Table column/field name to derive timestamp associated with the records. This can"
           + "be useful for e.g, determining the freshness of the table.");
 
+  // Ethan: remove this and align with the write payload class ("hoodie.datasource.write.payload.class")?

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##########
@@ -310,6 +321,7 @@ public class HoodieWriteConfig extends HoodieConfig {
       .withDocumentation("When inserted records share same key, controls whether they should be first combined (i.e de-duplicated) before"
           + " writing to storage.");
 
+  // Ethan: should we have one "hoodie.combine.before.write" for all change operations instead of one config per operation?

Review Comment:
   Sg.  Keep them as the same.



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodiePayloadConfig.java:
##########
@@ -40,7 +40,10 @@
     description = "Payload related configs, that can be leveraged to "
         + "control merges based on specific business fields in the data.")
 public class HoodiePayloadConfig extends HoodieConfig {
+  // Configs for Hudi payload class
 
+  // Ethan: shall we use preCombine ("hoodie.datasource.write.precombine.field") only and remove this?
+  // It acts as the same purpose as the precombine field.  This confuses user too.

Review Comment:
   Tracked in HUDI-5892



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieIndexConfig.java:
##########
@@ -259,6 +263,7 @@ public class HoodieIndexConfig extends HoodieConfig {
           + "Consistent hashing supports dynamic resizing of the number of bucket, solving potential data skew and file size "
           + "issues of the SIMPLE hashing engine. Consistent hashing only works with MOR tables, only use simple hashing on COW tables.");
 
+  // Ethan: is the default too large?

Review Comment:
   Tracked in HUDI-5782



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org