You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/10/05 20:00:28 UTC

[GitHub] [hudi] voonhous opened a new pull request, #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

voonhous opened a new pull request, #6856:
URL: https://github.com/apache/hudi/pull/6856

   ### Change Logs
   
   Update misleading `read.streaming.skip_compaction` config. 
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance impact._
   
   **Risk level: none | low | medium | high**
   
   _Choose one. If medium or high, explain what verification was done to mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
     ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
danny0405 commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1283727628

   > Understood understood, in such a case, should I still proceed with updating the description of this configuration?
   > 
   > Perhaps, we can modify it to:
   > 
   > To improve read performance, one can enable `read.streaming.skip_compaction` to skip reading and row-level filtering of parquet files.
   
   yeah, we can do that though ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1411418179

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969",
       "triggerID" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12094",
       "triggerID" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5b1de9e25fa1e58d90d315b52683b3a506ccbdd4",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14830",
       "triggerID" : "5b1de9e25fa1e58d90d315b52683b3a506ccbdd4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c1f659db4e4512f0a3dea07764a5542473b14da7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12094) 
   * 5b1de9e25fa1e58d90d315b52683b3a506ccbdd4 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14830) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1284903460

   @danny0405 : can you follow up on the review when you get a chance. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1265072776

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7fbe39a558949b0e0e8938546aad96e5ba0c1956 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r1095441321


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -297,19 +299,22 @@ private FlinkOptions() {
       .key("read.streaming.skip_compaction")
       .booleanType()
       .defaultValue(false)// default read as batch
-      .withDescription("Whether to skip compaction instants for streaming read,\n"
-          + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+      .withDescription("Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.\n"
+          + "There are two cases that this option can be used to avoid reading duplicates:\n"
+          + "1) you are definitely sure that the consumer reads [faster than/completes before] any compaction instants "
+          + "when " + HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() + " is set to false.\n"
           + "2) changelog mode is enabled, this option is a solution to keep data integrity");

Review Comment:
   What about this:
   
   ```text
   Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.
   
   Please set this configuration to true to avoid reading duplicates if:
   1) HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() is set to false 
   2) changelog mode is enabled, this option is a solution to keep data integrity
   ```
   
   Is this better enough?



##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -297,19 +299,22 @@ private FlinkOptions() {
       .key("read.streaming.skip_compaction")
       .booleanType()
       .defaultValue(false)// default read as batch
-      .withDescription("Whether to skip compaction instants for streaming read,\n"
-          + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+      .withDescription("Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.\n"
+          + "There are two cases that this option can be used to avoid reading duplicates:\n"
+          + "1) you are definitely sure that the consumer reads [faster than/completes before] any compaction instants "
+          + "when " + HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() + " is set to false.\n"
           + "2) changelog mode is enabled, this option is a solution to keep data integrity");

Review Comment:
   What about this:
   
   ```text
   Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.
   
   Please set this configuration to true to avoid reading duplicates if:
   1) HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() is set to false 
   2) changelog mode is enabled, this option is a solution to keep data integrity
   ```
   
   Is this better?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r1095442936


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -297,19 +299,22 @@ private FlinkOptions() {
       .key("read.streaming.skip_compaction")
       .booleanType()
       .defaultValue(false)// default read as batch
-      .withDescription("Whether to skip compaction instants for streaming read,\n"
-          + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+      .withDescription("Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.\n"
+          + "There are two cases that this option can be used to avoid reading duplicates:\n"
+          + "1) you are definitely sure that the consumer reads [faster than/completes before] any compaction instants "
+          + "when " + HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() + " is set to false.\n"
           + "2) changelog mode is enabled, this option is a solution to keep data integrity");
 
   // this option is experimental
   public static final ConfigOption<Boolean> READ_STREAMING_SKIP_CLUSTERING = ConfigOptions
           .key("read.streaming.skip_clustering")
           .booleanType()
           .defaultValue(false)
-          .withDescription("Whether to skip clustering instants for streaming read,\n"
-              + "to avoid reading duplicates");
+          .withDescription("Whether to skip clustering instants to avoid reading base files of clustering operations for streaming read "
+              + "to improve read performance.\n"
+              + "This option toggled to true to avoid duplicates when: \n"
+              + "1) you are definitely sure that the consumer reads [faster than/completes before] any clustering instants "

Review Comment:
   ```text
   Whether to skip clustering instants and avoid reading base files of clustering operations.
   
   When performing streaming reads, setting this to true will help to improve read performance.
   
   Please set this configuration to true to avoid reading duplicates if:
   1) HoodieClusteringConfig.PRESERVE_COMMIT_METADATA.key() is set to false 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
voonhous commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1272944045

   Understood understood, in such a case, should I still proceed with updating the description of such a configuration? 
   
   Perhaps, we can modify it to:
   
   To improve read performance, one can enable `read.streaming.skip_compaction` to skip reading and row-level filtering  of parquet files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 merged pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 merged PR #6856:
URL: https://github.com/apache/hudi/pull/6856


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r1095380309


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -297,19 +299,22 @@ private FlinkOptions() {
       .key("read.streaming.skip_compaction")
       .booleanType()
       .defaultValue(false)// default read as batch
-      .withDescription("Whether to skip compaction instants for streaming read,\n"
-          + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+      .withDescription("Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.\n"
+          + "There are two cases that this option can be used to avoid reading duplicates:\n"
+          + "1) you are definitely sure that the consumer reads [faster than/completes before] any compaction instants "
+          + "when " + HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() + " is set to false.\n"
           + "2) changelog mode is enabled, this option is a solution to keep data integrity");

Review Comment:
   CMIIW, if `read.streaming.skip_compaction=false` and `HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key()=false`, will there be duplicated data?
   
   I may have phrased it in a misleading manner, might need to remove if the above understanding is correct.



##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -297,19 +299,22 @@ private FlinkOptions() {
       .key("read.streaming.skip_compaction")
       .booleanType()
       .defaultValue(false)// default read as batch
-      .withDescription("Whether to skip compaction instants for streaming read,\n"
-          + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+      .withDescription("Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.\n"
+          + "There are two cases that this option can be used to avoid reading duplicates:\n"
+          + "1) you are definitely sure that the consumer reads [faster than/completes before] any compaction instants "
+          + "when " + HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() + " is set to false.\n"
           + "2) changelog mode is enabled, this option is a solution to keep data integrity");
 
   // this option is experimental
   public static final ConfigOption<Boolean> READ_STREAMING_SKIP_CLUSTERING = ConfigOptions
           .key("read.streaming.skip_clustering")
           .booleanType()
           .defaultValue(false)
-          .withDescription("Whether to skip clustering instants for streaming read,\n"
-              + "to avoid reading duplicates");
+          .withDescription("Whether to skip clustering instants to avoid reading base files of clustering operations for streaming read "
+              + "to improve read performance.\n"
+              + "This option toggled to true to avoid duplicates when: \n"
+              + "1) you are definitely sure that the consumer reads [faster than/completes before] any clustering instants "

Review Comment:
   Same comment as above, , if `read.streaming.skip_clustering=false` and HoodieClusteringConfig.PRESERVE_COMMIT_METADATA.key()=false, will there be duplicated data?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1411748896

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969",
       "triggerID" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12094",
       "triggerID" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5b1de9e25fa1e58d90d315b52683b3a506ccbdd4",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14830",
       "triggerID" : "5b1de9e25fa1e58d90d315b52683b3a506ccbdd4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f5bc53c815f1e7b367ac4b8a245439d0bc40970",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "0f5bc53c815f1e7b367ac4b8a245439d0bc40970",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5b1de9e25fa1e58d90d315b52683b3a506ccbdd4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14830) 
   * 0f5bc53c815f1e7b367ac4b8a245439d0bc40970 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r1095441321


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -297,19 +299,22 @@ private FlinkOptions() {
       .key("read.streaming.skip_compaction")
       .booleanType()
       .defaultValue(false)// default read as batch
-      .withDescription("Whether to skip compaction instants for streaming read,\n"
-          + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+      .withDescription("Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.\n"
+          + "There are two cases that this option can be used to avoid reading duplicates:\n"
+          + "1) you are definitely sure that the consumer reads [faster than/completes before] any compaction instants "
+          + "when " + HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() + " is set to false.\n"
           + "2) changelog mode is enabled, this option is a solution to keep data integrity");

Review Comment:
   What about this:
   
   ```text
   Whether to skip compaction instants and avoid reading compacted base files.
   When performing streaming reads, setting this to true will help to improve read performance.
   
   Please set this configuration to true to avoid reading duplicates if:
   1) HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() is set to false 
   2) changelog mode is enabled, this option is a solution to keep data integrity
   ```
   
   Is this better?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
voonhous commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r990894756


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -279,8 +279,9 @@ private FlinkOptions() {
       .defaultValue(false)// default read as batch
       .withDescription("Whether to skip compaction instants for streaming read,\n"
           + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+          + "1) `hoodie.compaction.preserve.commit.metadata` is set to `false` and you are definitely sure that the "
+          + "consumer reads faster than any compaction instants, usually with delta time compaction strategy that is "

Review Comment:
   Understand that `hoodie.compaction.preserve.commit.metadata` is `true` by default. 
   
   The default values for the configuration keys of interest are as such:
   ```properties
   hoodie.compaction.preserve.commit.metadata=true
   read.streaming.skip_compaction=false
   ```
   
   The phrasing of the `read.streaming.skip_compaction` configuration's description is VERY CONFUSING. 
   
   As of now, the description is as such:
   
   ```txt
   Whether to skip compaction instants for streaming read, there are two cases that this option can be used to avoid reading duplicates:
   
   1) you are definitely sure that the consumer reads faster than any compaction instants, usually with delta time compaction strategy that is long enough, for e.g, one week;
   
   2) changelog mode is enabled, this option is a solution to keep data integrity
   ```
   
   What I understand from this is:
   You can enable (set the configuration value to true) `read.streaming.skip_compaction` to prevent reading of duplicates under these 2 cases:
   
   1) you are definitely sure that the consumer reads faster than any compaction instants, usually with delta time compaction strategy that is long enough, for e.g, one week;
   
   # Consumer reads FASTER than compaction instants
   Consumer reads faster than any compaction instants would mean that compaction is slower than consumer read. 
   
   As such, compaction will complete after reading. A concrete example is shown below. I am not sure if I am understanding is correct. Please correct me for any conceptual mistakes made.
   
   ## Read iteration 1
   Say the commit timeline looks like this at read iteration 1:
   1. compaction plan at `instant-0` is created but still ongoing
   2. deltacommit at `instant-1` has completed
   
   The read consumer will only read out newly inserted rows in deltacommit @ `instant-1`.
   
   Issued instant is updated as such:
   ```txt
   issuedInstant=1
   ```
   
   **Timeline when performing read iteration 1:**
   
   ```
   0.compaction.requested
   0.compaction.inflight
   
   1.deltacommit.requested
   1.deltacommit.inflight
   1.deltacommit
   ```
   
   ### Read iteration 2
   on read iteration 2:
   1. compaction plan at `instant-0` has completed
   2. deltacommit at `instant-2` has completed
   
   Since the `issuedInstant=1`, InstantRange is as such:
   ```txt
   InstantRange[type=OPEN_CLOSE, start=1, end=2]
   ```
   
   The read consumer will read out newly inserted rows in deltacommit @ `instant-2`, at this point, the rows generated by the base parquet files at compaction `instant-0` will be ignored since the data file iterators will skip rows that have an `_hoodie_commit_instant` that lies outside the instantRange.
   
   **Timeline when performing read iteration 2:**
   
   ```
   0.compaction.requested
   0.compaction.inlifght
   0.commit
   
   1.deltacommit.requested
   1.deltacommit.inflight
   1.deltacommit
   
   2.deltacommit.requested
   2.deltacommit.inflight
   2.deltacommit
   ```
   
   ## Configuration description
   Circling back to the configuration description, when a consumer is reading faster than a compaction instant, the possibility of duplicated data being read out (due to them existing in the base parquet file and delta log file) is virtually 0. 
   
   `read.streaming.skip_compaction` SHOULD NOT be used to avoid duplicates if **consumer reads faster than any compaction instants**. Hence, explaining why i feel the original description is misleading.
   
   ## Proposed change
   That being said, my phrasing in the initial change is pretty misleading too.
   
   What I was trying to say is:
   
   The `read.streaming.skip_compaction` should only be enabled if `hoodie.compaction.preserve.commit.metadata` is modified to its non-default value of false IFF compaction plan completes before a deltacommit to be read in the next read iteration.
   
   Building upon the previous examples, suppose a read iteration 3 is performed with the following configurations:
   ```properties
   hoodie.compaction.preserve.commit.metadata=false (non-default value)
   read.streaming.skip_compaction=false (default value)
   ```
   
   **Timeline when performing read iteration 3:** 
   
   `instant-3` and `instant-4` has completed and will be read in during read iteration 3.
   
   ```txt
   3.compaction.requested
   3.compaction.inflight
   3.commit
   
   4.deltacommit.requested
   4.deltacommit.inflight
   4.deltacommit
   
   issuedInstant=2
   InstantRange[type=OPEN_CLOSE, start=2, end=4]
   ```
   
   At this point, the newly compacted rows (which have already been read during read iteration 2) in the base-parquet files generated in `instant-3` will have a `_hoodie_commit_time` of `3`. (commit metadata is not preserved, hence overwritten)
   
   This falls within the InstantRange in the current read iteration, causing the records that have been read in read iteration 2 to be read out again, causing duplicated data to be read out.
   
   As such, `read.streaming.skip_compaction` should be used to avoid reading duplicates under such a case when the user is definitely sure that the compaction instants are completing faster than / before the next deltacommit to be read in. 
   
   # Disclaimer
   I am ignoring changelog mode's description as I have yet to test this configuration under such a use case.
   
   We will also need to sync up the changes here with #6855 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1272754421

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969",
       "triggerID" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12094",
       "triggerID" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7fbe39a558949b0e0e8938546aad96e5ba0c1956 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969) 
   * c1f659db4e4512f0a3dea07764a5542473b14da7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12094) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1272829473

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969",
       "triggerID" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12094",
       "triggerID" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c1f659db4e4512f0a3dea07764a5542473b14da7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12094) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1265078737

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969",
       "triggerID" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7fbe39a558949b0e0e8938546aad96e5ba0c1956 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
voonhous commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r990894756


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -279,8 +279,9 @@ private FlinkOptions() {
       .defaultValue(false)// default read as batch
       .withDescription("Whether to skip compaction instants for streaming read,\n"
           + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+          + "1) `hoodie.compaction.preserve.commit.metadata` is set to `false` and you are definitely sure that the "
+          + "consumer reads faster than any compaction instants, usually with delta time compaction strategy that is "

Review Comment:
   Understand that `hoodie.compaction.preserve.commit.metadata` is `true` by default. 
   
   The default values for the configuration keys of interest are as such:
   ```properties
   hoodie.compaction.preserve.commit.metadata=true
   read.streaming.skip_compaction=false
   ```
   
   The phrasing of the `read.streaming.skip_compaction` configuration's description is VERY CONFUSING. 
   
   As of now, the description is as such:
   
   ```txt
   Whether to skip compaction instants for streaming read, there are two cases that this option can be used to avoid reading duplicates:
   
   1) you are definitely sure that the consumer reads faster than any compaction instants, usually with delta time compaction strategy that is long enough, for e.g, one week;
   
   2) changelog mode is enabled, this option is a solution to keep data integrity
   ```
   
   What I understand from this is:
   You can enable (set the configuration value to true) `read.streaming.skip_compaction` to prevent reading of duplicates under these 2 cases:
   
   1) you are definitely sure that the consumer reads faster than any compaction instants, usually with delta time compaction strategy that is long enough, for e.g, one week;
   
   # Consumer reads FASTER than compaction instants
   Consumer reads faster than any compaction instants would mean that compaction is slower than consumer read. 
   
   As such, compaction will complete after reading. A concrete example is shown below. I am not sure if I am understanding is correct. Please correct me for any conceptual mistakes made.
   
   ## Read iteration 1
   Say the commit timeline looks like this at read iteration 1:
   1. compaction plan at `instant-0` is created but still ongoing
   2. deltacommit at `instant-1` has completed
   
   The read consumer will only read out newly inserted rows in deltacommit @ `instant-1`.
   
   Issued instant is updated as such:
   ```txt
   issuedInstant=1
   ```
   
   **Timeline when performing read iteration 1:**
   
   ```
   0.compaction.requested
   0.compaction.inflight
   
   1.deltacommit.requested
   1.deltacommit.inflight
   1.deltacommit
   ```
   
   ### Read iteration 2
   on read iteration 2:
   1. compaction plan at `instant-0` has completed
   2. deltacommit at `instant-2` has completed
   
   Since the `issuedInstant=1`, InstantRange is as such:
   ```txt
   InstantRange[type=OPEN_CLOSE, start=1, end=2]
   ```
   
   The read consumer will read out newly inserted rows in deltacommit @ `instant-2`, at this point, the rows generated by the base parquet files at compaction `instant-0` will be ignored since the data file iterators will skip rows that have an `_hoodie_commit_instant` that lies outside the instantRange.
   
   TBH, I am not sure if the parquet files will be fetched. Regardless, even if they are fetched, the data file iterators will skip over the rows since they will NEVER fall within instantRange of the current read iteration.
   
   **Timeline when performing read iteration 2:**
   
   ```
   0.compaction.requested
   0.compaction.inlifght
   0.commit
   
   1.deltacommit.requested
   1.deltacommit.inflight
   1.deltacommit
   
   2.deltacommit.requested
   2.deltacommit.inflight
   2.deltacommit
   ```
   
   ## Configuration description
   Circling back to the configuration description, when a consumer is reading faster than a compaction instant, the possibility of duplicated data being read out (due to them existing in the base parquet file and delta log file) is virtually 0. 
   
   `read.streaming.skip_compaction` SHOULD NOT be used to avoid duplicates if **consumer reads faster than any compaction instants**. Hence, explaining why i feel the original description is misleading.
   
   ## Proposed change
   That being said, my phrasing in the initial change is pretty misleading too.
   
   What I was trying to say is:
   
   The `read.streaming.skip_compaction` should only be enabled if `hoodie.compaction.preserve.commit.metadata` is modified to its non-default value of false IFF compaction plan completes before a deltacommit to be read in the next read iteration.
   
   Building upon the previous examples, suppose a read iteration 3 is performed with the following configurations:
   ```properties
   hoodie.compaction.preserve.commit.metadata=false (non-default value)
   read.streaming.skip_compaction=false (default value)
   ```
   
   **Timeline when performing read iteration 3:** 
   
   `instant-3` and `instant-4` has completed and will be read in during read iteration 3.
   
   ```txt
   3.compaction.requested
   3.compaction.inflight
   3.commit
   
   4.deltacommit.requested
   4.deltacommit.inflight
   4.deltacommit
   
   issuedInstant=2
   InstantRange[type=OPEN_CLOSE, start=2, end=4]
   ```
   
   At this point, the newly compacted rows (which have already been read during read iteration 2) in the base-parquet files generated in `instant-3` will have a `_hoodie_commit_time` of `3`. (commit metadata is not preserved, hence overwritten)
   
   This falls within the InstantRange in the current read iteration, causing the records that have been read in read iteration 2 to be read out again, causing duplicated data to be read out.
   
   As such, `read.streaming.skip_compaction` should be used to avoid reading duplicates under such a case when the user is definitely sure that the compaction instants are completing faster than / before the next deltacommit to be read in. 
   
   # Disclaimer
   I am ignoring changelog mode's description as I have yet to test this configuration under such a use case.
   
   We will also need to sync up the changes here with #6855 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r1095442936


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -297,19 +299,22 @@ private FlinkOptions() {
       .key("read.streaming.skip_compaction")
       .booleanType()
       .defaultValue(false)// default read as batch
-      .withDescription("Whether to skip compaction instants for streaming read,\n"
-          + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+      .withDescription("Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.\n"
+          + "There are two cases that this option can be used to avoid reading duplicates:\n"
+          + "1) you are definitely sure that the consumer reads [faster than/completes before] any compaction instants "
+          + "when " + HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() + " is set to false.\n"
           + "2) changelog mode is enabled, this option is a solution to keep data integrity");
 
   // this option is experimental
   public static final ConfigOption<Boolean> READ_STREAMING_SKIP_CLUSTERING = ConfigOptions
           .key("read.streaming.skip_clustering")
           .booleanType()
           .defaultValue(false)
-          .withDescription("Whether to skip clustering instants for streaming read,\n"
-              + "to avoid reading duplicates");
+          .withDescription("Whether to skip clustering instants to avoid reading base files of clustering operations for streaming read "
+              + "to improve read performance.\n"
+              + "This option toggled to true to avoid duplicates when: \n"
+              + "1) you are definitely sure that the consumer reads [faster than/completes before] any clustering instants "

Review Comment:
   ```text
   Whether to skip clustering instants and avoid reading base files of clustering operations.
   When performing streaming reads, setting this to true will help to improve read performance.
   
   Please set this configuration to true to avoid reading duplicates if:
   1) HoodieClusteringConfig.PRESERVE_COMMIT_METADATA.key() is set to false 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r1095368804


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -297,19 +299,22 @@ private FlinkOptions() {
       .key("read.streaming.skip_compaction")
       .booleanType()
       .defaultValue(false)// default read as batch
-      .withDescription("Whether to skip compaction instants for streaming read,\n"
-          + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+      .withDescription("Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.\n"
+          + "There are two cases that this option can be used to avoid reading duplicates:\n"
+          + "1) you are definitely sure that the consumer reads [faster than/completes before] any compaction instants "
+          + "when " + HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() + " is set to false.\n"
           + "2) changelog mode is enabled, this option is a solution to keep data integrity");

Review Comment:
   No matter whether `HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key()` is true or false, the base files would be skipped, check the latest changes for `IncrementalInputSplits` again for details.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
voonhous commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r990894756


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -279,8 +279,9 @@ private FlinkOptions() {
       .defaultValue(false)// default read as batch
       .withDescription("Whether to skip compaction instants for streaming read,\n"
           + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+          + "1) `hoodie.compaction.preserve.commit.metadata` is set to `false` and you are definitely sure that the "
+          + "consumer reads faster than any compaction instants, usually with delta time compaction strategy that is "

Review Comment:
   Understand that `hoodie.compaction.preserve.commit.metadata` is `true` by default. 
   
   The default values for the configuration keys of interest are as such:
   ```properties
   hoodie.compaction.preserve.commit.metadata=true
   read.streaming.skip_compaction=false
   ```
   
   The phrasing of the `read.streaming.skip_compaction` configuration's description is VERY CONFUSING. 
   
   As of now, the description is as such:
   
   ```txt
   Whether to skip compaction instants for streaming read, there are two cases that this option can be used to avoid reading duplicates:
   
   1) you are definitely sure that the consumer reads faster than any compaction instants, usually with delta time compaction strategy that is long enough, for e.g, one week;
   
   2) changelog mode is enabled, this option is a solution to keep data integrity
   ```
   
   What I understand from this is:
   You can enable (set the configuration value to true) `read.streaming.skip_compaction` to prevent reading of duplicates under these 2 cases:
   
   1) you are definitely sure that the consumer reads faster than any compaction instants, usually with delta time compaction strategy that is long enough, for e.g, one week;
   
   # Consumer reads FASTER than compaction instants
   Consumer reads faster than any compaction instants would mean that compaction is slower than consumer read. 
   
   As such, compaction will complete after reading. A concrete example is shown below. I am not sure if I am understanding is correct. Please correct me for any conceptual mistakes made.
   
   ## Read iteration 1
   Say the commit timeline looks like this at read iteration 1:
   1. compaction plan at `instant-0` is created but still ongoing
   2. deltacommit at `instant-1` has completed
   
   The read consumer will only read out newly inserted rows in deltacommit @ `instant-1`.
   
   Issued instant is updated as such:
   ```txt
   issuedInstant=1
   ```
   
   ```
   0.compaction.requested
   0.compaction.inflight
   
   1.deltacommit.requested
   1.deltacommit.inflight
   1.deltacommit
   ```
   
   ### Read iteration 2
   on read iteration 2:
   1. compaction plan at `instant-0` has completed
   2. deltacommit at `instant-2` has completed
   
   Since the `issuedInstant=1`, InstantRange is as such:
   ```txt
   InstantRange[type=OPEN_CLOSE, start=1, end=2]
   ```
   
   The read consumer will read out newly inserted rows in deltacommit @ `instant-2`, at this point, the rows generated by the base parquet files at compaction `instant-0` will be ignored since the data file iterators will skip rows that have an `_hoodie_commit_instant` that lies outside the instantRange.
   
   ```
   0.compaction.requested
   0.compaction.inlifght
   0.commit
   
   1.deltacommit.requested
   1.deltacommit.inflight
   1.deltacommit
   
   2.deltacommit.requested
   2.deltacommit.inflight
   2.deltacommit
   ```
   
   ## Configuration description
   Circling back to the configuration description, when a consumer is reading faster than a compaction instant, the possibility of duplicated data being read out (due to them existing in the base parquet file and delta log file) is virtually 0. 
   
   `read.streaming.skip_compaction` SHOULD NOT be used to avoid duplicates if **consumer reads faster than any compaction instants**. Hence, explaining why i feel the original description is misleading.
   
   ## Proposed change
   That being said, my phrasing in the initial change is pretty misleading too.
   
   What I was trying to say is:
   
   The `read.streaming.skip_compaction` should only be enabled if `hoodie.compaction.preserve.commit.metadata` is modified to its non-default value of false IFF compaction plan completes before a deltacommit to be read in the next read iteration.
   
   Building upon the previous examples, suppose a read iteration 3 is performed with the following configurations:
   ```properties
   hoodie.compaction.preserve.commit.metadata=false (non-default value)
   read.streaming.skip_compaction=false (default value)
   ```
   
   ```txt
   3.compaction.requested
   3.compaction.inflight
   3.commit
   
   4.deltacommit.requested
   4.deltacommit.inflight
   4.deltacommit
   
   issuedInstant=2
   InstantRange[type=OPEN_CLOSE, start=2, end=4]
   ```
   
   At this point, the newly compacted rows (which have already been read during read iteration 2) in the base-parquet files generated in `instant-3` will have a `_hoodie_commit_time` of `3`. (commit metadata is not preserved, hence overwritten)
   
   This falls within the InstantRange in the current read iteration, causing the records that have been read in read iteration 2 to be read out again, causing duplicated data to be read out.
   
   As such, `read.streaming.skip_compaction` should be used to avoid reading duplicates under such a case when the user is definitely sure that the compaction instants are completing faster than the next deltacommit to be read in. 
   
   # Disclaimer
   I am ignoring changelog mode's description as I have yet to test this configuration under such a use case.
   
   We will also need to sync up the changes here with #6855 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
voonhous commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r990894756


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -279,8 +279,9 @@ private FlinkOptions() {
       .defaultValue(false)// default read as batch
       .withDescription("Whether to skip compaction instants for streaming read,\n"
           + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+          + "1) `hoodie.compaction.preserve.commit.metadata` is set to `false` and you are definitely sure that the "
+          + "consumer reads faster than any compaction instants, usually with delta time compaction strategy that is "

Review Comment:
   Understand that `hoodie.compaction.preserve.commit.metadata` is `true` by default. 
   
   The default values for the configuration keys of interest are as such:
   ```properties
   hoodie.compaction.preserve.commit.metadata=true
   read.streaming.skip_compaction=false
   ```
   
   The phrasing of the `read.streaming.skip_compaction` configuration's description is VERY CONFUSING. 
   
   As of now, the description is as such:
   
   ```txt
   Whether to skip compaction instants for streaming read, there are two cases that this option can be used to avoid reading duplicates:
   
   1) you are definitely sure that the consumer reads faster than any compaction instants, usually with delta time compaction strategy that is long enough, for e.g, one week;
   
   2) changelog mode is enabled, this option is a solution to keep data integrity
   ```
   
   What I understand from this is:
   You can enable (set the configuration value to true) `read.streaming.skip_compaction` to prevent reading of duplicates under these 2 cases:
   
   1) you are definitely sure that the consumer reads faster than any compaction instants, usually with delta time compaction strategy that is long enough, for e.g, one week;
   
   # Consumer reads FASTER than compaction instants
   Consumer reads faster than any compaction instants would mean that compaction is slower than consumer read. 
   
   As such, compaction will complete after reading. A concrete example is shown below. I am not sure if I am understanding is correct. Please correct me for any conceptual mistakes made.
   
   ## Read iteration 1
   Say the commit timeline looks like this at read iteration 1:
   1. compaction plan at `instant-0` is created but still ongoing
   2. deltacommit at `instant-1` has completed
   
   The read consumer will only read out newly inserted rows in deltacommit @ `instant-1`.
   
   Issued instant is updated as such:
   ```txt
   issuedInstant=1
   ```
   
   ```
   0.compaction.requested
   0.compaction.inflight
   
   1.deltacommit.requested
   1.deltacommit.inflight
   1.deltacommit
   ```
   
   ### Read iteration 2
   on read iteration 2:
   1. compaction plan at `instant-0` has completed
   2. deltacommit at `instant-2` has completed
   
   Since the `issuedInstant=1`, InstantRange is as such:
   ```txt
   InstantRange[type=OPEN_CLOSE, start=1, end=2]
   ```
   
   The read consumer will read out newly inserted rows in deltacommit @ `instant-2`, at this point, the rows generated by the base parquet files at compaction `instant-0` will be ignored since the data file iterators will skip rows that have an `_hoodie_commit_instant` that lies outside the instantRange.
   
   ```
   0.compaction.requested
   0.compaction.inlifght
   0.commit
   
   1.deltacommit.requested
   1.deltacommit.inflight
   1.deltacommit
   
   2.deltacommit.requested
   2.deltacommit.inflight
   2.deltacommit
   ```
   
   ## Configuration description
   Circling back to the configuration description, when a consumer is reading faster than a compaction instant, the possibility of duplicated data being read out (due to them existing in the base parquet file and delta log file) is virtually 0. 
   
   `read.streaming.skip_compaction` SHOULD NOT be used to avoid duplicates if **consumer reads faster than any compaction instants**. Hence, explaining why i feel the original description is misleading.
   
   ## Proposed change
   That being said, my phrasing in the initial change is pretty misleading too.
   
   What I was trying to say is:
   
   The `read.streaming.skip_compaction` should only be enabled if `hoodie.compaction.preserve.commit.metadata` is modified to its non-default value of false IFF compaction plan completes before a deltacommit to be read in the next read iteration.
   
   Building upon the previous examples, suppose a read iteration 3 is performed with the following configurations:
   ```properties
   hoodie.compaction.preserve.commit.metadata=false (non-default value)
   read.streaming.skip_compaction=false (default value)
   ```
   
   Timeline: 
   `instant-3` and `instant-4` has completed.
   
   ```txt
   3.compaction.requested
   3.compaction.inflight
   3.commit
   
   4.deltacommit.requested
   4.deltacommit.inflight
   4.deltacommit
   
   issuedInstant=2
   InstantRange[type=OPEN_CLOSE, start=2, end=4]
   ```
   
   At this point, the newly compacted rows (which have already been read during read iteration 2) in the base-parquet files generated in `instant-3` will have a `_hoodie_commit_time` of `3`. (commit metadata is not preserved, hence overwritten)
   
   This falls within the InstantRange in the current read iteration, causing the records that have been read in read iteration 2 to be read out again, causing duplicated data to be read out.
   
   As such, `read.streaming.skip_compaction` should be used to avoid reading duplicates under such a case when the user is definitely sure that the compaction instants are completing faster than / before the next deltacommit to be read in. 
   
   # Disclaimer
   I am ignoring changelog mode's description as I have yet to test this configuration under such a use case.
   
   We will also need to sync up the changes here with #6855 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1411414538

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969",
       "triggerID" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12094",
       "triggerID" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5b1de9e25fa1e58d90d315b52683b3a506ccbdd4",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5b1de9e25fa1e58d90d315b52683b3a506ccbdd4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * c1f659db4e4512f0a3dea07764a5542473b14da7 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12094) 
   * 5b1de9e25fa1e58d90d315b52683b3a506ccbdd4 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SteNicholas commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.
SteNicholas commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r1092782380


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -297,19 +299,22 @@ private FlinkOptions() {
       .key("read.streaming.skip_compaction")
       .booleanType()
       .defaultValue(false)// default read as batch
-      .withDescription("Whether to skip compaction instants for streaming read,\n"
-          + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+      .withDescription("Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.\n"
+          + "There are two cases that this option can be used to avoid reading duplicates:\n"
+          + "1) you are definitely sure that the consumer reads [faster than/completes before] any compaction instants "
+          + "when `" + HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() + "` is set to `false`.\n"
           + "2) changelog mode is enabled, this option is a solution to keep data integrity");
 
   // this option is experimental
   public static final ConfigOption<Boolean> READ_STREAMING_SKIP_CLUSTERING = ConfigOptions
           .key("read.streaming.skip_clustering")
           .booleanType()
           .defaultValue(false)
-          .withDescription("Whether to skip clustering instants for streaming read,\n"
-              + "to avoid reading duplicates");
+          .withDescription("Whether to skip clustering instants to avoid reading base files of clustering operations for streaming read "
+              + "to improve read performance.\n"
+              + "This option toggled to `true` to avoid duplicates when: \n"

Review Comment:
   ```suggestion
                 + "This option is toggled to true to avoid duplicates when: \n"
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SteNicholas commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.
SteNicholas commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1410210424

   @voonhous, do you follow up to update this pull request according to above review suggestion? If yes, please also update the description of the `read.streaming.skip_clustering`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1411537136

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969",
       "triggerID" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12094",
       "triggerID" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5b1de9e25fa1e58d90d315b52683b3a506ccbdd4",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14830",
       "triggerID" : "5b1de9e25fa1e58d90d315b52683b3a506ccbdd4",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5b1de9e25fa1e58d90d315b52683b3a506ccbdd4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14830) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1411986756

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969",
       "triggerID" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12094",
       "triggerID" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5b1de9e25fa1e58d90d315b52683b3a506ccbdd4",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14830",
       "triggerID" : "5b1de9e25fa1e58d90d315b52683b3a506ccbdd4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f5bc53c815f1e7b367ac4b8a245439d0bc40970",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14841",
       "triggerID" : "0f5bc53c815f1e7b367ac4b8a245439d0bc40970",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 0f5bc53c815f1e7b367ac4b8a245439d0bc40970 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14841) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
voonhous commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1272935450

   Understood. 
   
   May I ask when `read.streaming.skip_compaction=true` should be used? I can't seem to find any cases where duplicated data may be read out when `read.streaming.skip_compaction=true/false`. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1265241676

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969",
       "triggerID" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7fbe39a558949b0e0e8938546aad96e5ba0c1956 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
danny0405 commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1272928930

   Did you notice that recently we have fired a fix here: https://github.com/apache/hudi/commit/f171496de244992958fd3fd22fbcd2a7dc62c7a2, if `read.streaming.skip_compaction` is turned on, only log files are considered when reading.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r1095412714


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -297,19 +299,22 @@ private FlinkOptions() {
       .key("read.streaming.skip_compaction")
       .booleanType()
       .defaultValue(false)// default read as batch
-      .withDescription("Whether to skip compaction instants for streaming read,\n"
-          + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+      .withDescription("Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.\n"
+          + "There are two cases that this option can be used to avoid reading duplicates:\n"
+          + "1) you are definitely sure that the consumer reads [faster than/completes before] any compaction instants "
+          + "when " + HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() + " is set to false.\n"
           + "2) changelog mode is enabled, this option is a solution to keep data integrity");

Review Comment:
   >  if read.streaming.skip_compaction=false and HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key()=false, will there be duplicated data?
   
   Yes, it may introduce duplicates, let's make the doc more clear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "hudi-bot (via GitHub)" <gi...@apache.org>.
hudi-bot commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1411758602

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969",
       "triggerID" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=12094",
       "triggerID" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "triggerType" : "PUSH"
     }, {
       "hash" : "5b1de9e25fa1e58d90d315b52683b3a506ccbdd4",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14830",
       "triggerID" : "5b1de9e25fa1e58d90d315b52683b3a506ccbdd4",
       "triggerType" : "PUSH"
     }, {
       "hash" : "0f5bc53c815f1e7b367ac4b8a245439d0bc40970",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14841",
       "triggerID" : "0f5bc53c815f1e7b367ac4b8a245439d0bc40970",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5b1de9e25fa1e58d90d315b52683b3a506ccbdd4 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14830) 
   * 0f5bc53c815f1e7b367ac4b8a245439d0bc40970 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=14841) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "voonhous (via GitHub)" <gi...@apache.org>.
voonhous commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1411410995

   @SteNicholas Done, can you please help to take a look? 
   
   I update the PR's description to match up with the changes too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "danny0405 (via GitHub)" <gi...@apache.org>.
danny0405 commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r1095369349


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -297,19 +299,22 @@ private FlinkOptions() {
       .key("read.streaming.skip_compaction")
       .booleanType()
       .defaultValue(false)// default read as batch
-      .withDescription("Whether to skip compaction instants for streaming read,\n"
-          + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+      .withDescription("Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.\n"
+          + "There are two cases that this option can be used to avoid reading duplicates:\n"
+          + "1) you are definitely sure that the consumer reads [faster than/completes before] any compaction instants "
+          + "when " + HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() + " is set to false.\n"
           + "2) changelog mode is enabled, this option is a solution to keep data integrity");
 
   // this option is experimental
   public static final ConfigOption<Boolean> READ_STREAMING_SKIP_CLUSTERING = ConfigOptions
           .key("read.streaming.skip_clustering")
           .booleanType()
           .defaultValue(false)
-          .withDescription("Whether to skip clustering instants for streaming read,\n"
-              + "to avoid reading duplicates");
+          .withDescription("Whether to skip clustering instants to avoid reading base files of clustering operations for streaming read "
+              + "to improve read performance.\n"
+              + "This option toggled to true to avoid duplicates when: \n"
+              + "1) you are definitely sure that the consumer reads [faster than/completes before] any clustering instants "

Review Comment:
   We actually filter the clustering instants by files not rows, so same with compaction, the ` HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key()` has no effect here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
danny0405 commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r990583099


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -279,8 +279,9 @@ private FlinkOptions() {
       .defaultValue(false)// default read as batch
       .withDescription("Whether to skip compaction instants for streaming read,\n"
           + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+          + "1) `hoodie.compaction.preserve.commit.metadata` is set to `false` and you are definitely sure that the "
+          + "consumer reads faster than any compaction instants, usually with delta time compaction strategy that is "

Review Comment:
   Thanks for the enhancement, we need the compaction to preserve the commit time metadata field, and it is by default true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1272752482

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969",
       "triggerID" : "7fbe39a558949b0e0e8938546aad96e5ba0c1956",
       "triggerType" : "PUSH"
     }, {
       "hash" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "c1f659db4e4512f0a3dea07764a5542473b14da7",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 7fbe39a558949b0e0e8938546aad96e5ba0c1956 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11969) 
   * c1f659db4e4512f0a3dea07764a5542473b14da7 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] danny0405 commented on pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
danny0405 commented on PR #6856:
URL: https://github.com/apache/hudi/pull/6856#issuecomment-1272938059

   > I can't seem to find any cases where duplicated data may be read out when read.streaming.skip_compaction=true/false
   
   You are right, but when `read.streaming.skip_compaction` is false, the parquet base files would be read then all messages are dropped, which causes unnecessary resource waste and read performance regression.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] voonhous commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction config

Posted by GitBox <gi...@apache.org>.
voonhous commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r990894756


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -279,8 +279,9 @@ private FlinkOptions() {
       .defaultValue(false)// default read as batch
       .withDescription("Whether to skip compaction instants for streaming read,\n"
           + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+          + "1) `hoodie.compaction.preserve.commit.metadata` is set to `false` and you are definitely sure that the "
+          + "consumer reads faster than any compaction instants, usually with delta time compaction strategy that is "

Review Comment:
   Understand that `hoodie.compaction.preserve.commit.metadata` is `true` by default. 
   
   The default values for the configuration keys of interest are as such:
   ```properties
   hoodie.compaction.preserve.commit.metadata=true
   read.streaming.skip_compaction=false
   ```
   
   The phrasing of the `read.streaming.skip_compaction` configuration's description is VERY CONFUSING. 
   
   As of now, the description is as such:
   
   ```txt
   Whether to skip compaction instants for streaming read, there are two cases that this option can be used to avoid reading duplicates:
   
   1) you are definitely sure that the consumer reads faster than any compaction instants, usually with delta time compaction strategy that is long enough, for e.g, one week;
   
   2) changelog mode is enabled, this option is a solution to keep data integrity
   ```
   
   What I understand from this is:
   You can enable (set the configuration value to true) `read.streaming.skip_compaction` to prevent reading of duplicates under these 2 cases:
   
   1) you are definitely sure that the consumer reads faster than any compaction instants, usually with delta time compaction strategy that is long enough, for e.g, one week;
   
   # Consumer reads FASTER than compaction instants
   Consumer reads faster than any compaction instants would mean that compaction is slower than consumer read. 
   
   As such, compaction will complete after reading. A concrete example is shown below. I am not sure if I am understanding is correct. Please correct me for any conceptual mistakes made.
   
   ## Read iteration 1
   Say the commit timeline looks like this at read iteration 1:
   1. compaction plan at `instant-0` is created but still ongoing
   2. deltacommit at `instant-1` has completed
   
   The read consumer will only read out newly inserted rows in deltacommit @ `instant-1`.
   
   Issued instant is updated as such:
   ```txt
   issuedInstant=1
   ```
   
   ```
   0.compaction.requested
   0.compaction.inflight
   
   1.deltacommit.requested
   1.deltacommit.inflight
   1.deltacommit
   ```
   
   ### Read iteration 2
   on read iteration 2:
   1. compaction plan at `instant-0` has completed
   2. deltacommit at `instant-2` has completed
   
   Since the `issuedInstant=1`, InstantRange is as such:
   ```txt
   InstantRange[type=OPEN_CLOSE, start=1, end=2]
   ```
   
   The read consumer will read out newly inserted rows in deltacommit @ `instant-2`, at this point, the rows generated by the base parquet files at compaction `instant-0` will be ignored since the data file iterators will skip rows that have an `_hoodie_commit_instant` that lies outside the instantRange.
   
   ```
   0.compaction.requested
   0.compaction.inlifght
   0.commit
   
   1.deltacommit.requested
   1.deltacommit.inflight
   1.deltacommit
   
   2.deltacommit.requested
   2.deltacommit.inflight
   2.deltacommit
   ```
   
   ## Configuration description
   Circling back to the configuration description, when a consumer is reading faster than a compaction instant, the possibility of duplicated data being read out (due to them existing in the base parquet file and delta log file) is virtually 0. 
   
   `read.streaming.skip_compaction` SHOULD NOT be used to avoid duplicates if **consumer reads faster than any compaction instants**. Hence, explaining why i feel the original description is misleading.
   
   ## Proposed change
   That being said, my phrasing in the initial change is pretty misleading too.
   
   What I was trying to say is:
   
   The `read.streaming.skip_compaction` should only be enabled if `hoodie.compaction.preserve.commit.metadata` is modified to its non-default value of false IFF compaction plan completes before a deltacommit to be read in the next read iteration.
   
   Building upon the previous examples, suppose a read iteration 3 is performed with the following configurations:
   ```properties
   hoodie.compaction.preserve.commit.metadata=false (non-default value)
   read.streaming.skip_compaction=false (default value)
   ```
   
   ```txt
   3.compaction.requested
   3.compaction.inflight
   3.commit
   
   4.deltacommit.requested
   4.deltacommit.inflight
   4.deltacommit
   
   issuedInstant=2
   InstantRange[type=OPEN_CLOSE, start=2, end=4]
   ```
   
   At this point, the newly compacted rows (which have already been read during read iteration 2) in the base-parquet files generated in `instant-3` will have a `_hoodie_commit_time` of `3`. (commit metadata is not preserved, hence overwritten)
   
   This falls within the InstantRange in the current read iteration, causing the records that have been read in read iteration 2 to be read out again, causing duplicated data to be read out.
   
   As such, `read.streaming.skip_compaction` should be used to avoid reading duplicates under such a case when the user is definitely sure that the compaction instants are completing faster than / before the next deltacommit to be read in. 
   
   # Disclaimer
   I am ignoring changelog mode's description as I have yet to test this configuration under such a use case.
   
   We will also need to sync up the changes here with #6855 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SteNicholas commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.
SteNicholas commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r1092781898


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -297,19 +299,22 @@ private FlinkOptions() {
       .key("read.streaming.skip_compaction")
       .booleanType()
       .defaultValue(false)// default read as batch
-      .withDescription("Whether to skip compaction instants for streaming read,\n"
-          + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+      .withDescription("Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.\n"
+          + "There are two cases that this option can be used to avoid reading duplicates:\n"
+          + "1) you are definitely sure that the consumer reads [faster than/completes before] any compaction instants "
+          + "when `" + HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() + "` is set to `false`.\n"
           + "2) changelog mode is enabled, this option is a solution to keep data integrity");
 
   // this option is experimental
   public static final ConfigOption<Boolean> READ_STREAMING_SKIP_CLUSTERING = ConfigOptions
           .key("read.streaming.skip_clustering")
           .booleanType()
           .defaultValue(false)
-          .withDescription("Whether to skip clustering instants for streaming read,\n"
-              + "to avoid reading duplicates");
+          .withDescription("Whether to skip clustering instants to avoid reading base files of clustering operations for streaming read "
+              + "to improve read performance.\n"
+              + "This option toggled to `true` to avoid duplicates when: \n"
+              + "1) you are definitely sure that the consumer reads [faster than/completes before] any clustering instants "
+              + "when `" + HoodieClusteringConfig.PRESERVE_COMMIT_METADATA.key() + "` is set to `false`.\n");

Review Comment:
   ```suggestion
                 + "when " + HoodieClusteringConfig.PRESERVE_COMMIT_METADATA.key() + " is set to false.\n");
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] SteNicholas commented on a diff in pull request #6856: [HUDI-4968] Update misleading read.streaming.skip_compaction/skip_clustering config

Posted by "SteNicholas (via GitHub)" <gi...@apache.org>.
SteNicholas commented on code in PR #6856:
URL: https://github.com/apache/hudi/pull/6856#discussion_r1092781756


##########
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java:
##########
@@ -297,19 +299,22 @@ private FlinkOptions() {
       .key("read.streaming.skip_compaction")
       .booleanType()
       .defaultValue(false)// default read as batch
-      .withDescription("Whether to skip compaction instants for streaming read,\n"
-          + "there are two cases that this option can be used to avoid reading duplicates:\n"
-          + "1) you are definitely sure that the consumer reads faster than any compaction instants, "
-          + "usually with delta time compaction strategy that is long enough, for e.g, one week;\n"
+      .withDescription("Whether to skip compaction instants and avoid reading compacted base files for streaming read to improve read performance.\n"
+          + "There are two cases that this option can be used to avoid reading duplicates:\n"
+          + "1) you are definitely sure that the consumer reads [faster than/completes before] any compaction instants "
+          + "when `" + HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() + "` is set to `false`.\n"

Review Comment:
   ```suggestion
             + "when " + HoodieCompactionConfig.PRESERVE_COMMIT_METADATA.key() + " is set to false.\n"
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org