You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2020/05/10 13:28:00 UTC

[jira] [Resolved] (HUDI-432) Benchmark HFile for scan vs seek

     [ https://issues.apache.org/jira/browse/HUDI-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinoth Chandar resolved HUDI-432.
---------------------------------
    Resolution: Fixed

> Benchmark HFile for scan vs seek
> --------------------------------
>
>                 Key: HUDI-432
>                 URL: https://issues.apache.org/jira/browse/HUDI-432
>             Project: Apache Hudi (incubating)
>          Issue Type: Sub-task
>          Components: Performance, Storage Management
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>             Fix For: 0.6.0
>
>         Attachments: HFile benchmark.xlsx, HFile benchmark_withS3.xlsx, Screen Shot 2020-01-03 at 6.44.25 PM.png, Screen Shot 2020-03-09 at 12.22.54 AM.png
>
>
> We want to benchmark HFile scan vs seek as we intend to use HFile to record indexing. HFile will be used inline in hudi log for index purposes. 
> So, as part of benchmarking, we want to see when does scan out performs seek. 
> This is our experiment set up.
> keysToRead = no of keys to be looked up. // differs for different exp runs like 100k, 200k, 500k, 1M. 
> N = no of iterations
>  
> {code:java}
> 1M entries were written to a single HFile as key value pairs. 
> Also, stored the keys in a separate file(key_file).
> keyList = read all keys from key_file
> for N no of iterations
> {
>     shuffle keyList 
>     trim the list to keysToRead 
>     start timer HFile 
>     read benchmark(scan/seek) 
>     end timer
> }
> found avg for all timers captured
> {code}
>  
>  
> Result:
> Scan outperforms seek somewhere around 350k to 400k look ups out of 1M entries with optimized configs.
>   !Screen Shot 2020-01-03 at 6.44.25 PM.png!
> Results can be found here: [^HFile benchmark.xlsx]
> Source for benchmarking can be found here: 
> [https://github.com/nsivabalan/hudi/commit/94bef5ded3d70308e52b98e06b41e2cb999b5301]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)