You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/07/23 10:32:25 UTC

[GitHub] [hudi] codope opened a new pull request, #6196: [HUDI-4071] Enable schema reconciliation by default

codope opened a new pull request, #6196:
URL: https://github.com/apache/hudi/pull/6196

   ## *Tips*
   - *Thank you very much for contributing to Apache Hudi.*
   - *Please review https://hudi.apache.org/contribute/how-to-contribute before opening a pull request.*
   
   ## What is the purpose of the pull request
   
   *(For example: This pull request adds quick-start document.)*
   
   ## Brief change log
   
   *(for example:)*
     - *Modify AnnotationLocation checkstyle rule in checkstyle.xml*
   
   ## Verify this pull request
   
   *(Please pick either of the following options)*
   
   This pull request is a trivial rework / code cleanup without any test coverage.
   
   *(or)*
   
   This pull request is already covered by existing tests, such as *(please describe tests)*.
   
   (or)
   
   This change added tests and can be verified as follows:
   
   *(example:)*
   
     - *Added integration tests for end-to-end.*
     - *Added HoodieClientWriteTest to verify the change.*
     - *Manually verified the change by running a job locally.*
   
   ## Committer checklist
   
    - [ ] Has a corresponding JIRA in PR title & commit
    
    - [ ] Commit message is descriptive of the change
    
    - [ ] CI is green
   
    - [ ] Necessary doc changes done or have another open PR
          
    - [ ] For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1241126360

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11244",
       "triggerID" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10262",
       "triggerID" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1dfb9ffa267bce2c73bdc10e285a3ab2d3e15939",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11252",
       "triggerID" : "1dfb9ffa267bce2c73bdc10e285a3ab2d3e15939",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 1dfb9ffa267bce2c73bdc10e285a3ab2d3e15939 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11252) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1240404127

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11244",
       "triggerID" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11244) 
   * 04542752f07caf843d43cc25efacfb487b5b79d3 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6196:
URL: https://github.com/apache/hudi/pull/6196#discussion_r974656615


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##########
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> RECONCILE_SCHEMA = ConfigProperty
       .key("hoodie.datasource.write.reconcile.schema")
-      .defaultValue(false)
+      .defaultValue(true)

Review Comment:
   @kazdy i think we just need to clearly disambiguate our configuration to make sure users can clearly understand what they can achieve and how (see my previous comment https://github.com/apache/hudi/pull/6196#discussion_r961984500): what you're describing could be achieved today enabling Reconciliation and Schema Evolution.
   
    - 



##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##########
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> RECONCILE_SCHEMA = ConfigProperty
       .key("hoodie.datasource.write.reconcile.schema")
-      .defaultValue(false)
+      .defaultValue(true)

Review Comment:
   @kazdy i think we just need to clearly disambiguate our configuration to make sure users can clearly understand what they can achieve and how (see my previous comment https://github.com/apache/hudi/pull/6196#discussion_r961984500): what you're describing could be achieved today enabling Reconciliation and Schema Evolution.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6196:
URL: https://github.com/apache/hudi/pull/6196#discussion_r961984500


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##########
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> RECONCILE_SCHEMA = ConfigProperty
       .key("hoodie.datasource.write.reconcile.schema")
-      .defaultValue(false)
+      .defaultValue(true)

Review Comment:
   Initially was not in favor of this change, but now thinking about it a little more and especially in the light of https://github.com/apache/hudi/pull/6358, i think this is the right thing to do: for ex, after #6358, we'd be allowing to go writes, which might have columns dropped in the new batch. Now, there are 2 scenarios based on whether the reconciliation is enabled or not:
   
   1. If reconciliation is _enabled_: we will be favoring table's schema and use it as a _writer-schema_. So in that case we will rewrite the incoming batch into the table's schema before applying it to the table.
   
   2. If reconciliation is _disabled_: we will be favoring incoming batch's schema and use it as a _writer-schema_. In this case, for ex, for COW, we will be reading the table in its existing schema, but the new base files will be written in the writer's schema (ie w/ the column dropped)
   
   Both of these approaches are legitimate and could be preferred in different circumstances. What's important here for us is to pick the right default setting that would minimize the _surprise effect_. 
   
   Having reflected on this for some time now i think, that enabling reconciliation by default makes more sense as it protects table's schema from accidental mishaps in the incoming batches. And if somebody prefers the flow #2 the could easily opt-in for it by simply disabling the reconciliation.
   
   WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6196:
URL: https://github.com/apache/hudi/pull/6196#discussion_r974651748


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##########
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> RECONCILE_SCHEMA = ConfigProperty
       .key("hoodie.datasource.write.reconcile.schema")
-      .defaultValue(false)
+      .defaultValue(true)

Review Comment:
   @yihua it's more of a discussion what should be the default behavior:
   
    - Should we (by default) favor existing table's schema a SoT and rewrite incoming batch into it (unless Schema Evolution is enabled, in that case we will try to evolve the schema)
    - Should we (by default) favor incoming batch's schema as the schema we want table to be rewritten in
   
   I still think that the #1 is a safer option as a default (optimizing for least amount of surprise to the user)
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on code in PR #6196:
URL: https://github.com/apache/hudi/pull/6196#discussion_r1004374358


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##########
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> RECONCILE_SCHEMA = ConfigProperty
       .key("hoodie.datasource.write.reconcile.schema")
-      .defaultValue(false)
+      .defaultValue(true)

Review Comment:
   @alexeykudinkin @kazdy , now schema evolution cannot read by hive and presto, but we aready has pr to support that
   https://github.com/apache/hudi/pull/6989
   https://github.com/prestodb/presto/pull/18557
   https://github.com/apache/hudi/pull/7045
   
   once those pr merged, i think it will be ok. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1193108610

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10254",
       "triggerID" : "59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10254) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1621780370

   Closing it as we first need to audit the full schema evolution scenarios.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1240320365

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6196:
URL: https://github.com/apache/hudi/pull/6196#discussion_r973511372


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##########
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> RECONCILE_SCHEMA = ConfigProperty
       .key("hoodie.datasource.write.reconcile.schema")
-      .defaultValue(false)
+      .defaultValue(true)

Review Comment:
   On the other hand, in its current form reconciliation doesn't allow to evolve the schema (unless comprehensive Schema Evolution is enabled) since it'll be essentially just favoring the table's schema always (there's no way for you to add new column for ex, other than switching off reconciliation)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1193125451

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10254",
       "triggerID" : "59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10254) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #6196:
URL: https://github.com/apache/hudi/pull/6196#discussion_r973558829


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##########
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> RECONCILE_SCHEMA = ConfigProperty
       .key("hoodie.datasource.write.reconcile.schema")
-      .defaultValue(false)
+      .defaultValue(true)

Review Comment:
   Is the difference mainly around the case of dropping a column?  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1193108028

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1240561233

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11244",
       "triggerID" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10262",
       "triggerID" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1dfb9ffa267bce2c73bdc10e285a3ab2d3e15939",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11252",
       "triggerID" : "1dfb9ffa267bce2c73bdc10e285a3ab2d3e15939",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11244) 
   * 04542752f07caf843d43cc25efacfb487b5b79d3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10262) 
   * 1dfb9ffa267bce2c73bdc10e285a3ab2d3e15939 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11252) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1240325589

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11244",
       "triggerID" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11244) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
yihua commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1224476707

   @alexeykudinkin could you also review this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope closed pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope closed pull request #6196: [HUDI-4071] Enable schema reconciliation by default
URL: https://github.com/apache/hudi/pull/6196


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] yihua commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
yihua commented on code in PR #6196:
URL: https://github.com/apache/hudi/pull/6196#discussion_r973559140


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##########
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> RECONCILE_SCHEMA = ConfigProperty
       .key("hoodie.datasource.write.reconcile.schema")
-      .defaultValue(false)
+      .defaultValue(true)

Review Comment:
   I’m thinking whether we should decouple the handling of drop column separately instead of turning on “schema reconciliation” by default, e.g., we should still allow new columns to be added instead of dropping them to favor table’s schema by default, while properly handling the column drop (maybe a different config?).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
codope commented on code in PR #6196:
URL: https://github.com/apache/hudi/pull/6196#discussion_r965587959


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##########
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> RECONCILE_SCHEMA = ConfigProperty
       .key("hoodie.datasource.write.reconcile.schema")
-      .defaultValue(false)
+      .defaultValue(true)

Review Comment:
   I agree. That was precisely the intention behind flipping this default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1240555646

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11244",
       "triggerID" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10262",
       "triggerID" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1dfb9ffa267bce2c73bdc10e285a3ab2d3e15939",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "1dfb9ffa267bce2c73bdc10e285a3ab2d3e15939",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11244) 
   * 04542752f07caf843d43cc25efacfb487b5b79d3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10262) 
   * 1dfb9ffa267bce2c73bdc10e285a3ab2d3e15939 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xushiyan commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
xushiyan commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1193221383

   some spark sql test failed. is it actually safe to enable this? not sure if this config could be unintended in some cases.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1193163266

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10254",
       "triggerID" : "59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10254) 
   * 04542752f07caf843d43cc25efacfb487b5b79d3 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1193163975

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10254",
       "triggerID" : "59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10262",
       "triggerID" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10254) 
   * 04542752f07caf843d43cc25efacfb487b5b79d3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10262) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1193174022

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10254",
       "triggerID" : "59f8a0ed0a5d7ded9f3ffad67587c50874b3fb12",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10262",
       "triggerID" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 04542752f07caf843d43cc25efacfb487b5b79d3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10262) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
alexeykudinkin commented on code in PR #6196:
URL: https://github.com/apache/hudi/pull/6196#discussion_r975719469


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##########
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> RECONCILE_SCHEMA = ConfigProperty
       .key("hoodie.datasource.write.reconcile.schema")
-      .defaultValue(false)
+      .defaultValue(true)

Review Comment:
   @kazdy correct, when Schema Evolution will become GA (cc @xiarixiaoyao) we will be flipping it to be on by default



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1240410317

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11244",
       "triggerID" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10262",
       "triggerID" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11244) 
   * 04542752f07caf843d43cc25efacfb487b5b79d3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10262) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] hudi-bot commented on pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
hudi-bot commented on PR #6196:
URL: https://github.com/apache/hudi/pull/6196#issuecomment-1240702987

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11244",
       "triggerID" : "5a9d4eb8ff3160e20c534d4eff1912a07ba4e9fd",
       "triggerType" : "PUSH"
     }, {
       "hash" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10262",
       "triggerID" : "04542752f07caf843d43cc25efacfb487b5b79d3",
       "triggerType" : "PUSH"
     }, {
       "hash" : "1dfb9ffa267bce2c73bdc10e285a3ab2d3e15939",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11252",
       "triggerID" : "1dfb9ffa267bce2c73bdc10e285a3ab2d3e15939",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 04542752f07caf843d43cc25efacfb487b5b79d3 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=10262) 
   * 1dfb9ffa267bce2c73bdc10e285a3ab2d3e15939 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=11252) 
   
   <details>
   <summary>Bot commands</summary>
     @hudi-bot supports the following commands:
   
    - `@hudi-bot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kazdy commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
kazdy commented on code in PR #6196:
URL: https://github.com/apache/hudi/pull/6196#discussion_r974709033


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##########
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> RECONCILE_SCHEMA = ConfigProperty
       .key("hoodie.datasource.write.reconcile.schema")
-      .defaultValue(false)
+      .defaultValue(true)

Review Comment:
   @alexeykudinkin afaik Schema Evolution config is there because it's an experimental feature and soon it will become GA? Then this config should be enabled by default or deprecated, will this logic hold then? I feel like hudi config is already very broad and therefore a bit hard to grasp and users would appreciate if it was one switch instead of a combination of two



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kazdy commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default

Posted by GitBox <gi...@apache.org>.
kazdy commented on code in PR #6196:
URL: https://github.com/apache/hudi/pull/6196#discussion_r973573340


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##########
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> RECONCILE_SCHEMA = ConfigProperty
       .key("hoodie.datasource.write.reconcile.schema")
-      .defaultValue(false)
+      .defaultValue(true)

Review Comment:
   I'll add my 5 cents as a hudi user :)
   afaik schema reconciliation was meant to apply latest table schema on the incoming batch, but then if new batch contained new columns and some columns were missing at the same time then its behaviour was that new column was added but missing ones were dropped (at least on read, but physically still existing in files).
   
   I feel like in Hudi we could have mergeSchema option for both df write and sql merge into (currently target table schema is applied) as in delta and now iceberg or in parquet datasource in spark, which would behave same as if reconciliation and schema evolution were enabled now. Then reconcile schema could behave differently.
   
   When we ingest data the team producing it not always inform me about the changes and it would be nice to have a mechanism that can handle this. Currently most hudi users I know just create uber schema and apply it to df before write, sometimes it's hard to because of how the org we work for functions.
   
   for some context:
   #5899 - for mergeSchame in MERGE INTO statement
   #5873 and #5452 - issues with reconcile schema



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org