You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/08/24 20:36:31 UTC
[GitHub] [hudi] nsivabalan commented on issue #1955: [SUPPORT] DMS partition treated as part of pk

nsivabalan commented on issue #1955:
URL: https://github.com/apache/hudi/issues/1955#issuecomment-679352339


   @tooptoop4 : can you clarify what you mean by this.
   ```
   ie for each version_no,group_company combo, i want to get the latest row by TimeCreated (ie the source-ordering-field) and then partition on whatever sys_user that latest row has.
   ```
   But in general, yes, if you use global index with the update partition path set, you should not see any duplicates in your entire hoodie dataset. 
   
   I can try to illustrate with an eg. Lets say each row consists only 4 vals, v_no(version no), cmp (group_company), time_cr, sys_user.
   Incase of regular index, combination of record keys and partition path forms unique keys. 
   
   If you are using regular index and ingest 
   v_1, c_1, t_1, u_1
   v_2, c_1, t_1, u_1
   v_1, c_1, t_1, u_2
   v_1, c_1, t_1, u_3
   
   This will result in 2 rows going to partition u_1, 1 row to partition u_2, and one row to u_3. 
   
   In 2nd batch of updates, lets say you ingest few more rows. 
   v_1, c_1, t_2, u_1
   v_3, c_1, t_2, u_1
   v_1, c_2, t_2, u_2
   v_1, c_3, t_2, u_3
   
   Here is the result
   u_1:
   v_1, c_1, t_2, u_1 (updated with latest value)
   v_2, c_1, t_1, u_1
   v_3, c_1, t_2, u_1 (insert from 2nd batch)
   u_2:
   v_1, c_2, t_2, u_2 (updated with latest value)
   u_3:
   v_1, c_1, t_1, u_3
   v_1, c_3, t_2, u_3(insert from 2nd batch)
   
   Incase of global index, only record keys are unique. 
   Lets see an example with global bloom, but with the update partition path config not set.
   
   If 1st batch of ingest contains
   v_1, c_1, t_1, u_1
   v_1, c_2, t_1, u_1
   v_2, c_1, t_1, u_2
   v_3, c_1, t_1, u_3
   
   result will be. 
   
   v_1, c_1, t_1, u_1 
   v_1, c_2, t_1, u_1
   v_2, c_1, t_1, u_2 
   v_3, c_1, t_1, u_3
   
   And 2nd batch of ingest contains 
   v_1, c_1, t_2, u_1 (updating with latest time)
   v_1, c_2, t_2, u_2 (moving v1,c2 from u_1 to u_2). expectation is that, this will update U_1 only, since the config is not set. and hence new partition path i.e. u_2 will be ignored. 
   v_2, c_2, t_2, u_2 (new insert)
   v_1, c_3, t_2, u_3 (new insert)
   
   So, the result will be
   v_1, c_1, t_2, u_1 (updated with latest time)
   v_1, c_2, t_2, u_1 (updated with latest time even though incoming record was sent to u_2)
   v_2, c_1, t_1, u_2 
   v_2, c_2, t_2, u_2 (new insert)
   v_3, c_1, t_1, u_3
   v_1, c_3, t_2, u_3 (new insert)
   
   We can go the same with the config value set. 
   
   result from first batch:
   v_1, c_1, t_1, u_1 
   v_1, c_2, t_1, u_1
   v_2, c_1, t_1, u_2 
   v_3, c_1, t_1, u_3
   
   And 2nd batch of ingest contains 
   v_1, c_1, t_2, u_1 (updating with latest time)
   v_1, c_2, t_2, u_2 (moving v1,c2 from u_1 to u_2). expectation is that, this will insert a new record to u_2 and will delete corres record from u_1, since the config is set.
   v_2, c_2, t_2, u_2 (new insert)
   v_1, c_3, t_2, u_3 (new insert)
   
   So, the result will be
   v_1, c_1, t_2, u_1 (updated with latest time)
   v_1, c_2, t_2, u_2 (updated with latest time and old record is deleted)
   v_2, c_1, t_1, u_2 
   v_2, c_2, t_2, u_2 (new insert)
   v_3, c_1, t_1, u_3
   v_1, c_3, t_2, u_3 (new insert)
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org