You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2020/05/10 13:28:00 UTC
[jira] [Resolved] (HUDI-432) Benchmark HFile for scan vs seek
[ https://issues.apache.org/jira/browse/HUDI-432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar resolved HUDI-432.
---------------------------------
Resolution: Fixed
> Benchmark HFile for scan vs seek
> --------------------------------
>
> Key: HUDI-432
> URL: https://issues.apache.org/jira/browse/HUDI-432
> Project: Apache Hudi (incubating)
> Issue Type: Sub-task
> Components: Performance, Storage Management
> Reporter: sivabalan narayanan
> Assignee: sivabalan narayanan
> Priority: Major
> Fix For: 0.6.0
>
> Attachments: HFile benchmark.xlsx, HFile benchmark_withS3.xlsx, Screen Shot 2020-01-03 at 6.44.25 PM.png, Screen Shot 2020-03-09 at 12.22.54 AM.png
>
>
> We want to benchmark HFile scan vs seek as we intend to use HFile to record indexing. HFile will be used inline in hudi log for index purposes.
> So, as part of benchmarking, we want to see when does scan out performs seek.
> This is our experiment set up.
> keysToRead = no of keys to be looked up. // differs for different exp runs like 100k, 200k, 500k, 1M.
> N = no of iterations
>
> {code:java}
> 1M entries were written to a single HFile as key value pairs.
> Also, stored the keys in a separate file(key_file).
> keyList = read all keys from key_file
> for N no of iterations
> {
> shuffle keyList
> trim the list to keysToRead
> start timer HFile
> read benchmark(scan/seek)
> end timer
> }
> found avg for all timers captured
> {code}
>
>
> Result:
> Scan outperforms seek somewhere around 350k to 400k look ups out of 1M entries with optimized configs.
> !Screen Shot 2020-01-03 at 6.44.25 PM.png!
> Results can be found here: [^HFile benchmark.xlsx]
> Source for benchmarking can be found here:
> [https://github.com/nsivabalan/hudi/commit/94bef5ded3d70308e52b98e06b41e2cb999b5301]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)