You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2021/02/03 01:31:03 UTC

[GitHub] [hudi] sleapfish opened a new issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

sleapfish opened a new issue #2522:
URL: https://github.com/apache/hudi/issues/2522


   **Problem**
   
   When the source data set has unchanged rows, Hudi will upsert the target table rows and include those records in the new commit. If you have a CDC/incremental logic where you might have identical records from previous insert, new records, and changed records. Hudi would upsert all new, changed and unchanged records - and they would all be part of a new commit.
   
   Now when you want to query increments, the result will include lot of unnecessary (unchanged) rows as well. I would like to avoid that. Is there a way to somehow drop unchanged rows from source?
   
   **To Reproduce**
   
   Steps to reproduce the behavior:
   
   1. Fully load Hudi table
   
   Target example:
   ```
   ---------------------------------------------------------------------
   |     row_key    |     att_1      |      att_2     |    commit      |
   ---------------------------------------------------------------------
   |        1       |      1_1       |       1_2      |        0       |
   ---------------------------------------------------------------------
   |        2       |      2_1       |       2_2      |        0       |
   ---------------------------------------------------------------------
   ```
   2. Incrementally upsert new data set (Incremental data set should include unchanged records)
   
   Incremental data:
   ```
   ----------------------------------------------------
   |     row_key    |     att_1      |      att_2     |  
   ----------------------------------------------------
   |        1       |      1_1       |       1_2      |
   ----------------------------------------------------
   |        2       |      2_1       |    changed     |
   ----------------------------------------------------
   |        3       |      3_1       |       3_2      |
   ----------------------------------------------------
   |        4       |      4_1       |       4_2      |
   ----------------------------------------------------
   ```
   3. Incrementally query Hudi table for the latest commit
   
   Target example:
   ```
   ---------------------------------------------------------------------
   |     row_key    |     att_1      |      att_2     |    commit      |
   ---------------------------------------------------------------------
   |        1       |      1_1       |       1_2      |        1       |
   ---------------------------------------------------------------------
   |        2       |      2_1       |    changed     |        1       |
   ---------------------------------------------------------------------
   |        3       |      3_1       |       3_2      |        1       |
   ---------------------------------------------------------------------
   |        4       |      4_1       |       4_2      |        1       |
   ---------------------------------------------------------------------
   ```
   **Expected behavior**
   
   Target example:
   ```
   ---------------------------------------------------------------------
   |     row_key    |     att_1      |      att_2     |    commit      |
   ---------------------------------------------------------------------
   |        1       |      1_1       |       1_2      |        0       |
   ---------------------------------------------------------------------
   |        2       |      2_1       |    changed     |        1       |
   ---------------------------------------------------------------------
   |        3       |      3_1       |       3_2      |        1       |
   ---------------------------------------------------------------------
   |        4       |      4_1       |       4_2      |        1       |
   ---------------------------------------------------------------------
   ```
   
   **Environment Description**
   
   * Hudi version : 0.5.3
   * Spark version : 2.4.5
   * Storage (HDFS/S3/GCS..) : S3
   
   Thank you in advance!
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-772662864


   @sleapfish : do you mean to say that, you can't control your source and hence it could fetch unchanged records as well to do upserts to Hudi? And with this, you want to ignore records already in Hudi(matching all values for an incoming row) and upsert only those records that has any changes? 
   Would you mind going over this. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-847301365


   https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-CanIimplementmyownlogicforhowinputrecordsaremergedwithrecordonstorage
   Do you think this would help ? If not, let me know. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan closed issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
nsivabalan closed issue #2522:
URL: https://github.com/apache/hudi/issues/2522


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-775862834


   @vinothchandar : yes, sounds good. If we were you use just one column, then we don't need any new payload impl. Existing DefaultHoodieRecordPayload should suffice. If not, have to add a new impl. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-781032755


   Feel free to reach out to us if you need any more info. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-774418860


   We can always add a standard record payload implementation for this. Comparing every column value is also expensive, so what we support is comparing based on a certain field as an ordering value. For e.g if you were to provide a SCN or something to compare the incoming row against the record on disk, then the `DefaultHoodieRecordPayload` could handle that. 
   
   @nsivabalan correct me if I am wrong. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-776037981


   or extend `DefaulHoodieRecordPayload` 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nmahmood630 commented on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
nmahmood630 commented on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-847283383


   Can you please provide some example implementations of extending the DefaulHoodieRecordPayload to create our own implementation? I have a similar use-case where many upserts that occur will be identical to the records already in the Hudi table, and I would like the incremental query to only include the records that actually changed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-774139435


   Yes, I could think of an option, but you might have to define your own implementation for [HoodieRecordPayload](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java). 
   basically combineAndGetUpdateValue(oldValue, schema) in HoodieRecordPayload can return Option.empty() by comparing old and new value if they are equal. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-774139435


   Yes, I could think of an option, but you might have to define your own implementation for [HoodieRecordPayload](https://github.com/apache/hudi/blob/master/hudi-common/src/main/java/org/apache/hudi/common/model/HoodieRecordPayload.java). 
   basically combineAndGetUpdateValue(oldValue, schema) in HoodieRecordPayload can return Option.empty() by comparing old and new value if they are equal. 
   @bhasudha @n3nash : do we have any other option here ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sleapfish edited a comment on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
sleapfish edited a comment on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-772668176


   @nsivabalan That is correct! But maybe if you have control over source but let's say that you do something like extracting data from it with a rolling window of -3 days. And there is a case where some of the records from -3 days could change, but most of the records wouldn't. I want to commit only changed/new records to target Hudi table.
   
   When I do incremental query I don't want all 3 days worth of data in it, even tho only small portion of it actually changed


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] atharshah-ea edited a comment on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
atharshah-ea edited a comment on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-862002125


   Hi, also looking for an example of how to specify the DefaultHoodieRecordPayload. Setting the following option did not work for us:
    **'hoodie.datasource.write.payload.class': 'org.apache.hudi.DefaultHoodieRecordPayload'**
   
   Output:
   could not create payload for class: org.apache.hudi.default hoodie record payload
   
   @nsivabalan @vinothchandar 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] atharshah-ea commented on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
atharshah-ea commented on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-862002125


   Hi, also looking for an example of how to specify the DefaultHoodieRecordPayload. Setting the following option did not work for us:
    **'hoodie.datasource.write.payload.class': 'org.apache.hudi.DefaultHoodieRecordPayload'**
   
   Output:
   could not create payload for class: org.apache.hudi.default hoodie record payload


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nmahmood630 commented on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
nmahmood630 commented on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-848381823


   Can you also please provide additional information regarding how we 'could achieve this using existing recordPayload with one column to determine source ordering.'?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sleapfish commented on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
sleapfish commented on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-772668176


   @nsivabalan That is correct! But maybe if you have control over source but let's say that you do something like extracting data from it with a rolling windows of -3 days. And there is a case where some of the records from -3 days could change, but most of the records wouldn't. I want to commit only changed/new records to target Hudi table.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nmahmood630 commented on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
nmahmood630 commented on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-848077504


   Just seeing the interface definition isn't that helpful. My project is written in python/pyspark and ran on AWS Glue. How would I include this file to provide my own implementation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tooptoop4 commented on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
tooptoop4 commented on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-774207084


   would be great if there was predefined generic way without defining your own implementation 💯 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-772662864


   @sleapfish : do you mean to say that, you can't control your source and hence it could fetch unchanged records as well to do upserts to Hudi? And with this, you want to ignore records already in Hudi(matching all values for an incoming row) and upsert only those records that has any changes? 
   Would you mind confirming this. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-781032648


   @sleapfish : closing this out as we could achieve this using existing recordPayload with one column to determine source ordering. We have a tracking ticket for multiple preCombine keys if you are interested https://issues.apache.org/jira/browse/HUDI-1573


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nmahmood630 edited a comment on issue #2522: [SUPPORT] Avoid UPSERT unchanged records from source

Posted by GitBox <gi...@apache.org>.
nmahmood630 edited a comment on issue #2522:
URL: https://github.com/apache/hudi/issues/2522#issuecomment-848077504


   Just seeing the interface definition isn't that helpful. My project is written in python/pyspark and ran on AWS Glue where I'm uploading the Hudi JAR to S3 for the Glue job to pull. How would I include this file to provide my own implementation?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org