You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/06/16 19:57:12 UTC

[GitHub] [hudi] harishchanderramesh opened a new issue #1741: How to ignore the null columns in upsert on MoR tables? - Spark streaming

harishchanderramesh opened a new issue #1741:
URL: https://github.com/apache/hudi/issues/1741


   **_Tips before filing an issue_**
   
   - Have you gone through our [FAQs](https://cwiki.apache.org/confluence/display/HUDI/FAQ)?
   
   - Join the mailing list to engage in conversations and get faster support at dev-subscribe@hudi.apache.org.
   
   - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly.
   
   **Describe the problem you faced**
   
   I am using spark structured streaming to upsert on MoR hudi table on S3 from Kafka.
   During upsert, i dont want to update a column if the source value is null.
   How do I do this?
   
   In delta io, i used to do something like this.
   
   ```
   DeltaTable.forPath(spark, S3_DIR).alias("t").merge(df8.alias("s"), "s.id = t.id").whenMatchedUpdate(
   set =
   {
   "column1": coalesce("s.column1_new","t.column1")
   ,"column2": coalesce("s.column2_new","t.column2")
   ,"column3": coalesce("s.column3_new","t.column3")
   ,"column4": coalesce("s.column4_new","t.column4")
   })
   .whenNotMatchedInsert(values =
   {
   "id": "s.id"
   ,"column1": "s.column1_new"
   ,"column2": "s.column2_new"
   ,"column3": "s.column3_new"
   ,"column4": "s.column4_new"
   }
   ).execute()
   ```
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Create a hudi table
   2. Insert values for all the columns with 2 rows.
   3. Upsert the hudi table with a dataframe that has value for 1 column and null value for other columns
   4. the upsert should ignore the null from source and retain the not null value as is in hudi table
   
   **Expected behavior**
   I want to ignore nulls from source while doing upsert.
   I want the target field to be not affected if the source is null.
   And I want to do this using Spark Streaming.
   
   **Environment Description**
   
   * Hudi version : 0.5.2
   
   * Spark version : 2.4.5
   
   * Hive version : 2.3.6
   
   * Hadoop version : Amazon 2.8.5
   
   * Storage (HDFS/S3/GCS..) : S3
   
   * Running on Docker? (yes/no) : no
   
   
   **Additional context**
   
   I am trying to move from Delta to Hudi.
   In delta i was able to do this easily, whereas in hudi i dont find a doc online to accomplish this.
   
   **Stacktrace**
   
   ```Add the stacktrace of the error.```
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1741: How to ignore the null columns in upsert on MoR tables? - Spark streaming

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1741:
URL: https://github.com/apache/hudi/issues/1741#issuecomment-645539789


   Hudi gives a programmable interface for this. You can plugin your own payload class (subclass of HoodieRecordPayload) where you can implement custom merge logic. For example : org.apache.hudi.common.model.OverwriteWithLatestAvroPayload is the default one  which always picks the latest record. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #1741: How to ignore the null columns in upsert on MoR tables? - Spark streaming

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #1741:
URL: https://github.com/apache/hudi/issues/1741#issuecomment-645566670


   @harishchanderramesh : You can create a jar with just your payload class implementation and include it in spark-submit command


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1741: How to ignore the null columns in upsert on MoR tables? - Spark streaming

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1741:
URL: https://github.com/apache/hudi/issues/1741#issuecomment-653264670


   I think you are now able to make progress on this, @harishchanderramesh ?  let us know if this can be closed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] harishchanderramesh commented on issue #1741: How to ignore the null columns in upsert on MoR tables? - Spark streaming

Posted by GitBox <gi...@apache.org>.
harishchanderramesh commented on issue #1741:
URL: https://github.com/apache/hudi/issues/1741#issuecomment-645555177


   Thanks @bvaradar for the reply.
   how do i plugin my own payload class?
   
   Any examples that i can refer to?
   
   Sorry if i sound too novice.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] harishchanderramesh commented on issue #1741: How to ignore the null columns in upsert on MoR tables? - Spark streaming

Posted by GitBox <gi...@apache.org>.
harishchanderramesh commented on issue #1741:
URL: https://github.com/apache/hudi/issues/1741#issuecomment-647143091


   Yup, give me some time. I will update here on the progress. Thanks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bhasudha commented on issue #1741: How to ignore the null columns in upsert on MoR tables? - Spark streaming

Posted by GitBox <gi...@apache.org>.
bhasudha commented on issue #1741:
URL: https://github.com/apache/hudi/issues/1741#issuecomment-647139067


   I believe @harishchanderramesh is trying to get this work. Once he confirms this can be closed. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #1741: How to ignore the null columns in upsert on MoR tables? - Spark streaming

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #1741:
URL: https://github.com/apache/hudi/issues/1741#issuecomment-646993572


   @harishchanderramesh if you are looking for the specific config, it's https://hudi.apache.org/docs/configurations.html#PAYLOAD_CLASS_OPT_KEY
   
   if you are already deploying your app in a jar, all you need to do is to write the class and specify its name in tthe config. Hope  that helps 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] harishchanderramesh commented on issue #1741: How to ignore the null columns in upsert on MoR tables? - Spark streaming

Posted by GitBox <gi...@apache.org>.
harishchanderramesh commented on issue #1741:
URL: https://github.com/apache/hudi/issues/1741#issuecomment-653736036


   Resolving this. @vinothchandar 
   Thanks @bhasudha for the support on this request.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] harishchanderramesh closed issue #1741: How to ignore the null columns in upsert on MoR tables? - Spark streaming

Posted by GitBox <gi...@apache.org>.
harishchanderramesh closed issue #1741:
URL: https://github.com/apache/hudi/issues/1741


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org