You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/11/26 13:17:21 UTC

[GitHub] [hudi] sanket-khedikar opened a new issue #2284: [Info] : SCD 2 is available in Hudi or not?

sanket-khedikar opened a new issue #2284:
URL: https://github.com/apache/hudi/issues/2284


   Hi Team,
   
   We have started using hudi for our one of clean layer framework and we came across one scenario where we have to maintain history of data for some tables. But it seems Hudi doesn't provide SCD2 functionality yet.
   
   As per our RnD, Hudi support SCD1 i.e. Overwriting the older record with latest records. In case of Hudi, we are not actually overwriting but while fetching we are getting only latest record using upsert operation.
   
   SCD2 : Slowly changing Dimension Type 2: Here we maintain the history of data.
   
   If SCD2 is possible in Hudi, can someone share the method how we can achieve it?
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
n3nash commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-855192344


   Filed a JIRA for us to document this on Hudi -> https://issues.apache.org/jira/browse/HUDI-1973


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-812713158


   CC @n3nash 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-736709407


   >But it seems Hudi doesn't provide SCD2 functionality yet.
   
   @sanket-khedikar Hudi does let you control retention of older versions of files using the cleaner configuration.
   https://hudi.apache.org/docs/configurations.html#retainCommits 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sleapfish commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
sleapfish commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-771905029


   I would also appreciate such feature, I believe it's pretty common use case and having this would make a lot of difference. @bvaradar  do you have any examples or if you can point me out how can I implement custom merging logic with HoodieRecordPayload.java?
   
   Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-774317208


   thanks @sleapfish for clarifying. 
   if I am not wrong, I don't think in any code flow we update existing records inplace in hudi as of today. everything is like an append and a newer version of the record. @vinothchandar @bvaradar @n3nash : your thoughts. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] n3nash closed issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
n3nash closed issue #2284:
URL: https://github.com/apache/hudi/issues/2284


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-766436747


   @sanket-khedikar : can you please respond if the suggested approaches work for you. or you still need more enhancements from Hudi? If it's solved, would appreciate if you can close this ticket.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sleapfish commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
sleapfish commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-773903253


   @nsivabalan You are right. 
   
   I just want to add couple of things to this:
   
   - Ideally this should support specifying SCD columns that you want to track
     - For example: data set has row_key, col1 and col2. You want to track changes for col2 only. If the incoming source data set includes existing row_key and only col1 has changed then do simple UPSERT (no history required). But, if col2 has changed then do SCD UPSERT.
   - It shouldn't be triggered if none of the columns have changed
   - hudi_commit_time of ended (historical record) should probably be t5 in your case as well (since the record got updated) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sleapfish edited a comment on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
sleapfish edited a comment on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-789167720


   > 
   > 
   > IIUC SCD2 requires all versions of a given record be maintained inside the table? Hudi does allow you to keep history of changes to the table, upto a certain time in the past (configured via cleaner settings). If we never cleaned the table, then all changes from time 0 to now, will be available. I need to think through what exactly the problems would be if we did that. I can think of the file listing time grow over time, but then with 0.7.0+ we have the metadata table to alleviate that. Also if these dimension tables are typically much smaller, it may not be an issue per se.
   > 
   > The bones are there for this to work. we will have to spend sometime to fully declare a table table can store infinite amount of changes without every cleaning (i.e get rid of) older versions.
   
   You don't necessarily need to keep older versions, since each change to a SCD 2 table will result in updating (upserting) the older version. Hence, it will now be part of the new commit, with updated effectiveTo and isActive fields.
   
   Whenever a change happens you will have:
   - 1 UPDATE (of an older version, but part of new commit - SET effectiveTo and isActive fields)
     - You can get this record by primary key + isActive = True (or effectiveTo = null)
   - 1 INSERT (new version with effectiveTo = null & isActive = True)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-774317208


   thanks @sleapfish for clarifying. 
   I don't think in any code flow we update existing records inplace in hudi as of today. everything is like an append and a newer version of the record. @vinothchandar @bvaradar @n3nash : your thoughts. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sleapfish commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
sleapfish commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-771905029


   I would also appreciate such feature, I believe it's pretty common use case and having this would make a lot of difference. @bvaradar  do you have any examples or if you can point me out how can I implement custom merging logic with HoodieRecordPayload.java?
   
   Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-773374732


   hey folks, let me try to understand your use-case better. I am not aware of SCD2 and found [this](https://adatis.co.uk/introduction-to-slowly-changing-dimensions-scd-types/) through my friend (google ;) ). I will illustrate w/ an example and let me know if I my understanding is right. 
   
   At t1 (C1 commit) 
   // incoming record
   recId | name | .... all cols ... | effective from | effective to
    ----- |-------|------------|---------------|------------ 
   rec1 |  bob   | ......................| t1                        |      null.        |
   
   this record will be stored as is in hudi w/ some additional hudi meta fields
   recId | name | .... all cols ... | effective from | effective to| hudi_commit_time | ... other meta fields 
    ----- |-------|------------|---------------|------------ | ------------------| -------------------
   rec1 | bob  | ......................| t1                        |      null.        |     t1                            |        .....................       
   
   At t5(C2 commit)
   // incoming record
   
    recId | name | .... all cols ... | effective from | effective to
    ----- |-------|------------|---------------|------------ 
    rec1 |  bob   | ......................| t5                        |      null        
   
   // when we merge this w/ hudi, you want to have the following rows in hudi
   recId | name | .... all cols ... | effective from | effective to| hudi_commit_time | ... other meta fields 
    ----- |-------|------------|---------------|------------ | ------------------| -------------------
   rec1 |  bob  | ......................| t1                        |      t5           |            t1                      |        .....................       |
   rec1 |  bob   | ......................| t5                        |      null        |           t5                      |        .....................       |
   
   Let me know if this is what you are looking for. We can discuss further. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-773374732


   hey folks, let me try to understand your use-case better. I am not aware of SCD2 and found [this](https://adatis.co.uk/introduction-to-slowly-changing-dimensions-scd-types/) through my friend (google ;) ). I will illustrate w/ an example and let me know if I my understanding is right. 
   
   At t1 (C1 commit) 
   // incoming record
   recId | name | .... all cols ... | effective from | effective to | isActive
    ----- |-------|------------|---------------|------------ | ----
   rec1 |  bob   | ......................| t1                        |      null        |  true
   
   this record will be stored as is in hudi w/ some additional hudi meta fields
   recId | name | .... all cols ... | effective from | effective to| isActive| hudi_commit_time | ... other meta fields 
    ----- |-------|------------|---------------|------------ | ------- |-----------| -------------------
   rec1 | bob  | ......................| t1                        |      null.        |   true |   t1                            |        .....................       
   
   At t5(C2 commit)
   // incoming record
   
    recId | name | .... all cols ... | effective from | effective to| isActive
    ----- |-------|------------|---------------|------------ | ------
    rec1 |  bob   | ......................| t5                        |      null           | true
   
   // when we merge this w/ hudi, you want to have the following rows in hudi
   recId | name | .... all cols ... | effective from | effective to| isActive | hudi_commit_time | ... other meta fields 
    ----- |-------|------------|---------------|------------ | -----|-------------| -------------------
   rec1 |  bob  | ......................| t1                        |      t5           |       false |     t1                      |        .....................       |
   rec1 |  bob   | ......................| t5                        |      null        |      true |     t5                      |        .....................       |
   
   Let me know if this is what you are looking for. We can discuss further. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sleapfish commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
sleapfish commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-773903253


   @nsivabalan You are right. 
   
   I just want to add couple of things to this:
   
   - Ideally this should support specifying SCD columns that you want to track
     - For example: data set has row_key, col1 and col2. You want to track changes for col2 only. If the incoming source data set includes existing row_key and only col1 has changed then do simple UPSERT (no history required). But, if col2 has changed then do SCD UPSERT.
   - It shouldn't be triggered if none of the columns have changed
   - hudi_commit_time of ended (historical record) should probably be t5 in your case as well (since the record got updated) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] git-raj commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
git-raj commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-766523668


   using AWS Glue pySpark and Hudi and S3 as data store: i'm trying to do the traditional SCD Type 2 where old record gets updated with the insert datetime on 'effective to' field, 'isActive' field becomes 'false', and new row is inserted with the insert datetime in 'effective from' field with 'isActive' becoming 'true'. Any solution post, or pointers to solve that if possible is highly appreciated.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] tooptoop4 commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
tooptoop4 commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-823247153


   https://aws.amazon.com/blogs/big-data/build-slowly-changing-dimensions-type-2-scd2-with-apache-spark-and-apache-hudi-on-amazon-emr/


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] bvaradar commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
bvaradar commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-735937171


   Hudi provides custom merging semantics. You can plugin your own payload implementation that instead of overwriting, can have custom merging logic (HoodieRecordPayload.java). Can you explore that and see if it satisfies your requirement.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sleapfish commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
sleapfish commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-789167720


   > 
   > 
   > IIUC SCD2 requires all versions of a given record be maintained inside the table? Hudi does allow you to keep history of changes to the table, upto a certain time in the past (configured via cleaner settings). If we never cleaned the table, then all changes from time 0 to now, will be available. I need to think through what exactly the problems would be if we did that. I can think of the file listing time grow over time, but then with 0.7.0+ we have the metadata table to alleviate that. Also if these dimension tables are typically much smaller, it may not be an issue per se.
   > 
   > The bones are there for this to work. we will have to spend sometime to fully declare a table table can store infinite amount of changes without every cleaning (i.e get rid of) older versions.
   
   You don't necessarily need to keep older versions, since each change to a SCD 2 table will result in updating (upserting) the older version. Hence, it will now be part of the new commit, with updated effectiveTo and isActive fields.
   
   Whenever a change happens you will have:
   - 1 INSERT (new version with effectiveTo = null & isActive = True)
   - 1 UPDATE (of an older version, but part of new commit)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-786611593


   Unfortunately, I don't think hudi has support for updating already written records. All we can do is to compare old and new incoming records based on a field (PreCombine) and construct the new payload(akka row). 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] sleapfish edited a comment on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
sleapfish edited a comment on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-789167720


   > 
   > 
   > IIUC SCD2 requires all versions of a given record be maintained inside the table? Hudi does allow you to keep history of changes to the table, upto a certain time in the past (configured via cleaner settings). If we never cleaned the table, then all changes from time 0 to now, will be available. I need to think through what exactly the problems would be if we did that. I can think of the file listing time grow over time, but then with 0.7.0+ we have the metadata table to alleviate that. Also if these dimension tables are typically much smaller, it may not be an issue per se.
   > 
   > The bones are there for this to work. we will have to spend sometime to fully declare a table table can store infinite amount of changes without every cleaning (i.e get rid of) older versions.
   
   @vinothchandar You don't necessarily need to keep older versions, since each change to a SCD 2 table will result in updating (upserting) the older version. Hence, it will now be part of the new commit, with updated effectiveTo and isActive fields.
   
   Whenever a change happens you will have:
   - 1 UPDATE (of an older version, but part of new commit - SET effectiveTo and isActive fields)
     - You can get this record by primary key + isActive = True (or effectiveTo = null)
   - 1 INSERT (new version with effectiveTo = null & isActive = True)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-773374732


   hey folks, let me try to understand your use-case better. I am not aware of SCD2 and found [this](https://adatis.co.uk/introduction-to-slowly-changing-dimensions-scd-types/) through my friend (google ;) ). I will illustrate w/ an example and let me know if I my understanding is right. 
   
   At t1 (C1 commit) 
   // incoming record
   recId | name | .... all cols ... | effective from | effective to
    ----- |-------|------------|---------------|------------ 
   rec1 |  bob   | ......................| t1                        |      null.        |
   
   this record will be stored as is in hudi w/ some additional hudi meta fields
   recId | name | .... all cols ... | effective from | effective to| hudi_commit_time | ... other meta fields 
    ----- |-------|------------|---------------|------------ | ------------------| -------------------
   rec1 | bob  | ......................| t1                        |      null.        |     t1                            |        .....................       
   
   At t5(C2 commit)
   // incoming record
   
    recId | name | .... all cols ... | effective from | effective to
    ----- |-------|------------|---------------|------------ 
    rec1 |  bob   | ......................| t5                        |      null        
   
   // when we merge this w/ hudi, you want to have the following rows in hudi
   recId | name | .... all cols ... | effective from | effective to| hudi_commit_time | ... other meta fields 
    ----- |-------|------------|---------------|------------ | ------------------| -------------------
   rec1 |  bob  | ......................| t1                        |      t5           |            t1                      |        .....................       |
   rec1 |  bob   | ......................| t5                        |      null        |           t5                      |        .....................       |
   
   Let me know if this is what you are looking for. We can discuss further. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] saumyasuhagiya commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
saumyasuhagiya commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-784979698


   @nsivabalan Hi.. just wanted to check any update on this.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-766436747


   @sanket-khedikar : can you please respond if the suggested approaches work for you. or you still need more enhancements from Hudi? If it's solved, would appreciate if you can close this ticket.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] git-raj commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
git-raj commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-766523668


   using AWS Glue pySpark and Hudi and S3 as data store: i'm trying to do the traditional SCD Type 2 where old record gets updated with the insert datetime on 'effective to' field, 'isActive' field becomes 'false', and new row is inserted with the insert datetime in 'effective from' field with 'isActive' becoming 'true'. Any solution post, or pointers to solve that if possible is highly appreciated.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
nsivabalan commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-774317208


   I don't think in any code flow we update existing records inplace in hudi as of today. everything is like an append and a newer version of the record. @vinothchandar @bvaradar @n3nash : your thoughts. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] vinothchandar commented on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
vinothchandar commented on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-789132283


   IIUC SCD2 requires all versions of a given record be maintained inside the table? Hudi does allow you to keep history of changes to the table, upto a certain time in the past (configured via cleaner settings). If we never cleaned the table, then all changes from time 0 to now, will be available. I need to think through what exactly the problems would be if we did that. I can think of the file listing time grow over time, but then with 0.7.0+ we have the metadata table to alleviate that. Also if these dimension tables are typically much smaller, it may not be an issue per se. 
   
   The bones are there for this to work. we will have to spend sometime to fully declare a table table can store infinite amount of changes without every cleaning (i.e get rid of) older versions. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [hudi] nsivabalan edited a comment on issue #2284: [SUPPORT] : Is there a option to achieve SCD 2 in Hudi?

Posted by GitBox <gi...@apache.org>.
nsivabalan edited a comment on issue #2284:
URL: https://github.com/apache/hudi/issues/2284#issuecomment-773374732


   hey folks, let me try to understand your use-case better. I am not aware of SCD2 and found [this](https://adatis.co.uk/introduction-to-slowly-changing-dimensions-scd-types/) through my friend (google ;) ). I will illustrate w/ an example and let me know if I my understanding is right. 
   
   At t1 (C1 commit) 
   // incoming record
   recId | name | .... all cols ... | effective from | effective to | isActive
    ----- |-------|------------|---------------|------------ | ----
   rec1 |  bob   | ......................| t1                        |      null        |  true
   
   this record will be stored as is in hudi w/ some additional hudi meta fields
   recId | name | .... all cols ... | effective from | effective to| isActive| hudi_commit_time | ... other meta fields 
    ----- |-------|------------|---------------|------------ | ------- |-----------| -------------------
   rec1 | bob  | ......................| t1                        |      null.        |   true |   t1                            |        .....................       
   
   At t5(C2 commit)
   // incoming record
   
    recId | name | .... all cols ... | effective from | effective to| isActive
    ----- |-------|------------|---------------|------------ | ------
    rec1 |  bob   | ......................| t5                        |      null           | true
   
   // when we merge this w/ hudi, you want to have the following rows in hudi
   recId | name | .... all cols ... | effective from | effective to| isActive | hudi_commit_time | ... other meta fields 
    ----- |-------|------------|---------------|------------ | -----|-------------| -------------------
   rec1 |  bob  | ......................| t1                        |      t5           |       false |     t1                      |        .....................       |
   rec1 |  bob   | ......................| t5                        |      null        |      true |     t5                      |        .....................       |
   
   Let me know if this is what you are looking for. We can discuss further. 
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org