You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@hudi.apache.org by "sivabalan narayanan (Jira)" <ji...@apache.org> on 2020/07/08 14:47:00 UTC

[jira] [Created] (HUDI-1083) Minor optimization in Determining insert bucket location for a given key

sivabalan narayanan created HUDI-1083:
-----------------------------------------

             Summary: Minor optimization in Determining insert bucket location for a given key
                 Key: HUDI-1083
                 URL: https://issues.apache.org/jira/browse/HUDI-1083
             Project: Apache Hudi
          Issue Type: Improvement
          Components: Writer Core
            Reporter: sivabalan narayanan


As of now, this is how bucket for a given key is determined.

In every partition, we find all insert buckets and assign weights. 

for eg: 0.2, 0.3, 0.5 for a given partition with 100 records to be inserted means, 20 will go into B0, 30 will go into B1 and 50 will go into B2.

within getPartition(Object key), we linearly walk through the bucket weights and find the right bucket for a given key. for instance if mod (hash value) is 90/100 = 0.9, we keep adding the bucket weights until the value exceeds 0.9.

Instead we could calculate cumulative weights upfront and do a binary search within getPartition()

so, 0.2, 0.5, 1

so with mod(hash value), we could do binary search and find the right bucket and would cut cost from O(N) to log N. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)