You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "kazdy (Jira)" <ji...@apache.org> on 2022/09/17 11:18:00 UTC

[jira] [Created] (HUDI-4872) Support automatic schema evolution for SQL MERGE INTO with UPDATE */ INESRT *

kazdy created HUDI-4872:
---------------------------

             Summary: Support automatic schema evolution for SQL MERGE INTO with UPDATE */ INESRT *
                 Key: HUDI-4872
                 URL: https://issues.apache.org/jira/browse/HUDI-4872
             Project: Apache Hudi
          Issue Type: Improvement
            Reporter: kazdy


I tried using MERGE INTO with UPDATE * and INSERT * statement with full schema evolution enabled.
I noticed that during insert new columns from incoming batch (that do not exist in target table yet) are dropped and target schema is applied. No warnings nor failed writes.

Therefore can we as users automatically evolve schema on MERGE INTO operations?
I guess this should only be supported when we use update set * and insert * in merge operation.

*Expected behavior*

When incoming data is missing columns that already declared in target table these should be injected with default/null values.
When incoming data has new columns that are not yet declared in the target table, these should be added to the target table.
Case when incoming data has both missing columns and new columns, missing columns should be injected with null/ default values, new columns should be added to the target table.

New columns should be reflected in metastore table schema.

Should support complex types, and nested schemas.

Currently similar thing is supported for dataframe writes if both schema reconciliation and schema evolution configs are enabled, see HUDI-4276.

From user experience perspective it would be easier if I had _mergeSchema_ (as for parquet spark datasource) config to enable this feature for both spark sql and df write.

Thread from dev mailing list as a reference:
[https://lists.apache.org/thread/kr59hh7yqr2c1y33kzfv3n97h6ydbz9b]

GH issue: [https://github.com/apache/hudi/issues/5899]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)