You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/05/13 03:58:17 UTC

[GitHub] [hudi] neerajpadarthi commented on issue #5519: [SUPPORT] Schema Evolution - Error with datatype promotion

neerajpadarthi commented on issue #5519:
URL: https://github.com/apache/hudi/issues/5519#issuecomment-1125632878

   @xiarixiaoyao Thanks for checking. 
   
   I have checked below scenario with alter statement using 0.9V but ended up with errors. Added my observations for your reference.  
   
   1. Created table with bulk_insert parallelism - 5 (using schema colx Int)
   > (S3 - It created 5 file groups and 1 commit file) 
   > (Glue - It created a table with colx having Int datatype)
   > (Accessed Via Spark DF - Was able to query and colx is having integer datatype)
   > (Accessed Via Spark Sql - Was able to query and colx is having integer datatype)
   
   2. Performed upsert delta with col-x schema Long 
   > (on S3 - It recreated impacted file groups(3 file groups) and added 1 commit file) 
   > (on Glue - New schema version got created and schema for colx got updated to bigint datatype)
   > (Accessed Via Spark DF - Query failed and colx is having long datatype) - Same as above error 
   > (Accessed Via Spark Sql - Was able to query old records(int) but failed when querying upserted records(long) and colx is having Integer datatype)  - Failed with org.apache.parquet.column.Dictionary.decodeToInt
   
   3. Altered the table to Long datatype using spark.sql 
   > ( on S3 - It created 1 commit file) 
   > (on Glue - New schema version got created but no change in schema attributes)
   > (Accessed Via Spark DF - Query failed and colx is having long datatype) - Same as above error 
   > (Accessed Via Spark Sql - Was able to query upserted records(long) but failed when querying old records(Int) and colx is updated to long datatype)  - Failed with org.apache.parquet.column.Dictionary.decodeToLong
   
   Summary -  Observed schema inconsistencies with spark.sql and spark DF operation. No change with Alter statement. When reading portions of the table its succeeding but it fails when complete table is read. I think as all file groups didn't update so its failing.   
   
   Q. I am currently using 0.9V. Is this an issue with this version? Do I need to migrate to 0.11V to validate schema promotion? 
   
   Please let me know if I am missing something. Thanks in Advance. 
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org