You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "jhchee (via GitHub)" <gi...@apache.org> on 2023/04/19 14:38:10 UTC

[GitHub] [hudi] jhchee opened a new issue, #8502: [SUPPORT] Does merge into Spark SQL supports schema evolution

jhchee opened a new issue, #8502:
URL: https://github.com/apache/hudi/issues/8502

   **Describe the problem you faced**
   I have created a table with 2 columns namely `userId` and `updatedAt`. I'm passing new column in the `merge into` command. but gotten an exception.
   
   ```java
   spark.sql("" +
                   "MERGE INTO target USING source ON target.userId = source.userId " +
                   "WHEN MATCHED THEN UPDATE SET target.nested = struct(source.colA), target.updatedAt = source.updatedAt " +
                   "WHEN NOT MATCHED THEN INSERT (userId, nested, updatedAt) " +
                   "VALUES (source.userId, struct(source.colA), source.updatedAt)" +
                   "")
   ```
   
   ```
   Cannot resolve 'target.nested
   ```
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Merge into command with new column specified.
   2. Try setting `.config("hoodie.schema.on.read.enable", "true")` doesn't help.
   
   **Expected behavior**
   The schema should evolve and detect that this is a new column.
   A clear and concise description of what you expected to happen.
   
   **Environment Description**
   
   * Hudi version : 0.12.2
   
   * Spark version : 3.3.1
   
   * Hive version : - 
   
   * Hadoop version : -
   
   * Storage (HDFS/S3/GCS..) : -
   
   * Running on Docker? (yes/no) : -
   
   
   **Additional context**
   
   Add any other context about the problem here.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] kazdy commented on issue #8502: [SUPPORT] Does spark.sql("MERGE INTO") supports schema evolution write option

Posted by "kazdy (via GitHub)" <gi...@apache.org>.

kazdy commented on issue #8502:
URL: https://github.com/apache/hudi/issues/8502#issuecomment-1523888006

   @ad1happy2go  
   Not sure if this is really something blocked by spark sql parser, as an example Delta Lake supports schema evolution in MERGE INTO (both for partial updates as well as for update * and insert *):
   https://docs.delta.io/latest/delta-update.html#-merge-schema-evolution
   
   Would be great to have something similar in Hudi. Currently, Hudi tries to use target table schema during MERGE INTO (and drops incoming columns if schema is wider for example). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #8502: [SUPPORT] Does spark.sql("MERGE INTO") supports schema evolution write option

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.

ad1happy2go commented on issue #8502:
URL: https://github.com/apache/hudi/issues/8502#issuecomment-1621375702

   @kazdy @jhchee You are correct, this should be supported for MERGE INTO.. I confirmed master also doesn't support it. Attaching the same code which should work.
   
   ```
   create table test_insert3 (
       id int,
   name string,
   updated_at timestamp
   ) using hudi
   options (
       type = 'cow',
       primaryKey = 'id',
       preCombineField = 'updated_at'
   ) location 'file:///tmp/test_insert3';
   
   merge into test_insert3 as target
   using (
       select 1 as id, 'c' as name, 1 as new_col, current_timestamp as updated_at
   union select 1 as id,'d' as name, 1 as new_col, current_timestamp as updated_at
   union select 1 as id,'e' as name, 1 as new_col, current_timestamp as updated_at
   ) source
   on target.id = source.id
   when matched then update set target.new_col = source.new_col
   when not matched then insert *;
   ```
   
   Create JIRA to track - https://issues.apache.org/jira/browse/HUDI-6483
   
   Feel free to contribute.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [hudi] ad1happy2go commented on issue #8502: [SUPPORT] Does spark.sql("MERGE INTO") supports schema evolution write option

Posted by "ad1happy2go (via GitHub)" <gi...@apache.org>.

ad1happy2go commented on issue #8502:
URL: https://github.com/apache/hudi/issues/8502#issuecomment-1522911733

   @jhchee Spark sql parser doesn't supports this so not sure if we can do anything on our end. All configs comes into play during the execution of sql. 
   
   you can do ALTER table first and add column before calling the merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org