You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/15 08:47:58 UTC

[GitHub] [hudi] kazdy opened a new issue, #5873: [SUPPORT] Reconcile schema - missing field dropped from metadata

kazdy opened a new issue, #5873:
URL: https://github.com/apache/hudi/issues/5873

   **Describe the problem you faced**
   I'm using schema on read (full schema evolution feature) and reconcile schema feature to evolve hudi table schema, it's synchronized with Glue Data Catalog. COW table.
   
   I add a column (col_a) in the middle of the table in one batch (upsert operation).
   In the next batch (upsert) I add new  column at the end of the table (col_b) but col_a is missing in data frame.
   Then I query the table via Athena or via Spark SQL, then col_a is dropped and not visible.
   
   I can upsert next batch with df that contains both col_a and col_b, then all data is visible in Spark and Athena.
   
   I would expect that during the schema reconciliation phase Hudi would handle this case and preserve col_1 with a null value.
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   Operations, step by step
   | Batch seq | Operation | DF schema                                   | Table Schema                                | Expected Table Schema                                      |
   |-----------|-----------|---------------------------------------------|---------------------------------------------|------------------------------------------------------------|
   | 0         | insert    | col_1: string,col_2: string                 | col_1: string,col_2: string                 | col_1: string,col_2: string                                |
   | 1         | upsert    | col_1: string, col_a: string, col_2: string | col_1: string,col_a: string,col_2: string   | col_1: string,col_a: string,col_2: string                  |
   | 2         | upsert    | col_1: string, col_2: string, col_b: string | col_1: string, col_2: string, col_b: string | col_1: string, col_a: string, col_2: string, col_b: string |
   
   **Expected behavior**
   
   In batch 2 table should have schema:
   col_1: string, col_a: string, col_2: string, col_b: string
   
   with col_a preserved with null values where column is missing
   
   **Environment Description**
   
   * Hudi version : 0.11.0 OSS
   
   * Spark version : 3.2.0-amzn
   
   * Hive version : 3.2.1
   
   * Hadoop version : 3.2.1
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : yes/ emr on eks 6.6
   
   
   **Additional context**
   
   
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on issue #5873: [SUPPORT] Reconcile schema - missing field dropped from metadata

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on issue #5873:
URL: https://github.com/apache/hudi/issues/5873#issuecomment-1157651544

   @kazdy 
   For the third update, col1_a is missing.  Hudi doesn't know whether this field is missing or the user intentionally deleted it.
   by default hudi treat it as delete.
   
   If you really need this feature, you can make a JIRA order, and I will solve it.  @codope  WDYT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #5873: [SUPPORT] Reconcile schema - missing field dropped from metadata

Posted by GitBox <gi...@apache.org>.
codope commented on issue #5873:
URL: https://github.com/apache/hudi/issues/5873#issuecomment-1158945262

   > @kazdy ok, Thank you for your answer, let me fix this problem in next few days
   
   @xiarixiaoyao Sounds good. Closing this ticket. I've made it a blocker of 0.12 release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on issue #5873: [SUPPORT] Reconcile schema - missing field dropped from metadata

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on issue #5873:
URL: https://github.com/apache/hudi/issues/5873#issuecomment-1157696476

   @kazdy  ok, 
   Thank you for your answer, let me fix this problem in next few days


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] xiarixiaoyao commented on issue #5873: [SUPPORT] Reconcile schema - missing field dropped from metadata

Posted by GitBox <gi...@apache.org>.
xiarixiaoyao commented on issue #5873:
URL: https://github.com/apache/hudi/issues/5873#issuecomment-1158414017

   @kazdy  yes  could you pls open a jira, thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on issue #5873: [SUPPORT] Reconcile schema - missing field dropped from metadata

Posted by GitBox <gi...@apache.org>.
codope commented on issue #5873:
URL: https://github.com/apache/hudi/issues/5873#issuecomment-1156272449

   @xiarixiaoyao Can you please look into this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope closed issue #5873: [SUPPORT] Reconcile schema - missing field dropped from metadata

Posted by GitBox <gi...@apache.org>.
codope closed issue #5873: [SUPPORT] Reconcile schema - missing field dropped from metadata
URL: https://github.com/apache/hudi/issues/5873


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kazdy commented on issue #5873: [SUPPORT] Reconcile schema - missing field dropped from metadata

Posted by GitBox <gi...@apache.org>.
kazdy commented on issue #5873:
URL: https://github.com/apache/hudi/issues/5873#issuecomment-1158745427

   @xiarixiaoyao I've created https://issues.apache.org/jira/browse/HUDI-4276
   thanks :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kazdy commented on issue #5873: [SUPPORT] Reconcile schema - missing field dropped from metadata

Posted by GitBox <gi...@apache.org>.
kazdy commented on issue #5873:
URL: https://github.com/apache/hudi/issues/5873#issuecomment-1157669265

   @xiarixiaoyao I was hoping that with schema reconciliation "default values will be injected to missing fields" as per the docs:
   
   > When a new batch of write has records with old schema, but latest table schema got evolved, this config will upgrade the records to leverage latest table schema(default values will be injected to missing fields). If not, the write batch would fail.
   
   The scenario I described does not happen when I have a missing column but no new column in the same batch. Then Hudi injects null to the missing column and the column is not removed from the table in metastore.
   
   The behavior I'm looking for is like this:
   incoming data doesn’t contain every column in the table –> those columns will simply be assigned null/default values
   This is what other similar frameworks allow users to do, so I guess Hudi can do the same possibly as an option guarded by a config if someone prefers to enforce schema more strictly.
   
   I also found a comment from another Hudi issue, that makes me think that my scenario should work:
   @TarunMootala can you upgrade Hudi to 0.10.1. this can reconcile the schema wherever the new field is put in. Spark-SQL is still having some problems that the new middle field can't be shown.
   But I test in mater branch, all of the problems above have gone.
   
   _Originally posted by @YannByron in https://github.com/apache/hudi/issues/4914#issuecomment-1063623677_
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] kazdy commented on issue #5873: [SUPPORT] Reconcile schema - missing field dropped from metadata

Posted by GitBox <gi...@apache.org>.
kazdy commented on issue #5873:
URL: https://github.com/apache/hudi/issues/5873#issuecomment-1158134991

   @xiarixiaoyao that would be great :)
   Do you want me to open jira ticket for this?
   
   I did some more research and it seems like for schema reconciliation hudi takes the latest schema and applies it to the incoming batch.
   
   To be more precise as for what I'm looking for is to have this:
   1. incoming data doesn’t contain every column in the table –> those columns will simply be assigned null/default values
   2. incoming data contains new column in the table -> those columns will be added to the table schema 
   
   Do you think this is something that schema reconciliation feature should support, or something new?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org