You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/06/08 06:17:58 UTC

[GitHub] [hudi] a0x commented on issue #5792: [SUPPORT] Update hudi table(using SparkSQL) failed when the column contains `null` value in other records

a0x commented on issue #5792:
URL: https://github.com/apache/hudi/issues/5792#issuecomment-1149507752

   Here is my analysis.
   
   The key exception is **`java.lang.RuntimeException: Null-value for required field: note`**, which means the field `note` is not nullable. But I added `null` value in the first place, so it doesn't make any sense.
   
   After digging into the log and the parquet file, I found something interesting.
   
   1. After the last update was triggered, some data was written into the storage. (the last update was triggered at **June 08 2022, 12:48:35 PM**,  which is not shown in the previous picture) 
       <img width="1333" alt="image" src="https://user-images.githubusercontent.com/3829546/172542368-3ae79dda-4d2a-46bc-a205-f6de90febe8c.png">
       So I checked those files, and found they were exactly as the previous paragraph:
       <img width="1627" alt="image" src="https://user-images.githubusercontent.com/3829546/172543858-5c800cdb-a785-4053-a715-5e617908b37f.png">
   2. In the stacktrace log, there's a full schema converted by avro, which is 
       ```json
       {
         "type" : "record",
         "name" : "update_null_test_cow_record",
         "namespace" : "hoodie.update_null_test_cow",
         "fields" : [ {
           "name" : "_hoodie_commit_time",
           "type" : [ "null", "string" ],
           "doc" : "",
           "default" : null
         }, {
           "name" : "_hoodie_commit_seqno",
           "type" : [ "null", "string" ],
           "doc" : "",
           "default" : null
         }, {
           "name" : "_hoodie_record_key",
           "type" : [ "null", "string" ],
           "doc" : "",
           "default" : null
         }, {
           "name" : "_hoodie_partition_path",
           "type" : [ "null", "string" ],
           "doc" : "",
           "default" : null
         }, {
           "name" : "_hoodie_file_name",
           "type" : [ "null", "string" ],
           "doc" : "",
           "default" : null
         }, {
           "name" : "id",
           "type" : [ "null", "long" ],
           "default" : null
         }, {
           "name" : "name",
           "type" : [ "null", "string" ],
           "default" : null
         }, {
           // 'note' column is not union type, and has no default `null` value
           "name" : "note",
           "type" : "string"
         }, {
           "name" : "ts",
           "type" : [ "null", "long" ],
           "default" : null
         }, {
           "name" : "dt",
           "type" : [ "null", "string" ],
           "default" : null
         } ]
       }
       ```
   
   I think this auto-generated schema is the direct reason for this failure.
   
   So how can I fix it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org