You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "leesf (Jira)" <ji...@apache.org> on 2020/08/23 00:19:00 UTC

[jira] [Closed] (HUDI-1083) Minor optimization in Determining insert bucket location for a given key

     [ https://issues.apache.org/jira/browse/HUDI-1083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

leesf closed HUDI-1083.
-----------------------

> Minor optimization in Determining insert bucket location for a given key
> ------------------------------------------------------------------------
>
>                 Key: HUDI-1083
>                 URL: https://issues.apache.org/jira/browse/HUDI-1083
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: Writer Core
>            Reporter: sivabalan narayanan
>            Assignee: shenh062326
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.6.1
>
>
> As of now, this is how bucket for a given key is determined.
> In every partition, we find all insert buckets and assign weights. 
> for eg: 0.2, 0.3, 0.5 for a given partition with 100 records to be inserted means, 20 will go into B0, 30 will go into B1 and 50 will go into B2.
> within getPartition(Object key), we linearly walk through the bucket weights and find the right bucket for a given key. for instance if mod (hash value) is 90/100 = 0.9, we keep adding the bucket weights until the value exceeds 0.9.
> Instead we could calculate cumulative weights upfront and do a binary search within getPartition()
> so, 0.2, 0.5, 1
> so with mod(hash value), we could do binary search and find the right bucket and would cut cost from O(N) to log N. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)