You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/07/03 13:21:58 UTC

[GitHub] [hudi] nsivabalan commented on pull request #1469: [HUDI-686] Implement BloomIndexV2 that does not depend on memory caching

nsivabalan commented on pull request #1469:
URL: https://github.com/apache/hudi/pull/1469#issuecomment-653546494


   @lamber-ken @vinothchandar : I took a stab at the global bloom index V2. I don't have permissions to lamberken's repo and hence couldn't update his branch. Here is my [branch](https://github.com/nsivabalan/hudi/tree/bloomIndexV2) and [commit](https://github.com/nsivabalan/hudi/commit/7f59a67743bbeee162181e2a2ca725fe9656cb8f) link. Please check it out. Have added and fixed tests for the same. 
   
   Also, I have two questions/clarifications.
   1: with regular bloom index V2, why do we need to sort based on both partition path and record keys. Why not just partition path suffice? 
   2: Correct me if I am wrong. But there is one corner case where both bloom index V2 and global version needs to be fixed. But it might incur an additional left outer join. So, wanted to confirm if its feasible. 
   Let's say for an incoming record, there is 1 or more files returned after range and bloom look up. But in key checker, lets say none of the files had the record key. In this scenario, the output of tag location may not have the record only. 
   
   If this is a feasible case, then the fix I could think of is.
   Do not return empty candidates from LazyRangeAndBloomChecker. So that result after LazyKeyChecker will not contain such records. With this fix, LazyKeyChecker will return only existing records in storage. Once we have the result from LazyKeyChecker, we might have to do left outer join with incoming records to find those non existent records and add them to final tagged record list. 
   
   Similar fix needs to be done with global version as well. 
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org