You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/09/02 20:28:01 UTC

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6196: [HUDI-4071] Enable schema reconciliation by default

alexeykudinkin commented on code in PR #6196:
URL: https://github.com/apache/hudi/pull/6196#discussion_r961984500


##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java:
##########
@@ -38,7 +38,7 @@ public class HoodieCommonConfig extends HoodieConfig {
 
   public static final ConfigProperty<Boolean> RECONCILE_SCHEMA = ConfigProperty
       .key("hoodie.datasource.write.reconcile.schema")
-      .defaultValue(false)
+      .defaultValue(true)

Review Comment:
   Initially was not in favor of this change, but now thinking about it a little more and especially in the light of https://github.com/apache/hudi/pull/6358, i think this is the right thing to do: for ex, after #6358, we'd be allowing to go writes, which might have columns dropped in the new batch. Now, there are 2 scenarios based on whether the reconciliation is enabled or not:
   
   1. If reconciliation is _enabled_: we will be favoring table's schema and use it as a _writer-schema_. So in that case we will rewrite the incoming batch into the table's schema before applying it to the table.
   
   2. If reconciliation is _disabled_: we will be favoring incoming batch's schema and use it as a _writer-schema_. In this case, for ex, for COW, we will be reading the table in its existing schema, but the new base files will be written in the writer's schema (ie w/ the column dropped)
   
   Both of these approaches are legitimate and could be preferred in different circumstances. What's important here for us is to pick the right default setting that would minimize the _surprise effect_. 
   
   Having reflected on this for some time now i think, that enabling reconciliation by default makes more sense as it protects table's schema from accidental mishaps in the incoming batches. And if somebody prefers the flow #2 the could easily opt-in for it by simply disabling the reconciliation.
   
   WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org