You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Chandar (Jira)" <ji...@apache.org> on 2020/04/10 14:37:00 UTC

[jira] [Commented] (HUDI-432) Benchmark HFile for scan vs seek

    [ https://issues.apache.org/jira/browse/HUDI-432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080532#comment-17080532 ] 

Vinoth Chandar commented on HUDI-432:
-------------------------------------

On s3, we should also consider the random read optimizations.. https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.5/bk_cloud-data-access/content/s3-get-requests.html 

> Benchmark HFile for scan vs seek
> --------------------------------
>
>                 Key: HUDI-432
>                 URL: https://issues.apache.org/jira/browse/HUDI-432
>             Project: Apache Hudi (incubating)
>          Issue Type: Sub-task
>          Components: Performance, Storage Management
>            Reporter: sivabalan narayanan
>            Assignee: sivabalan narayanan
>            Priority: Major
>             Fix For: 0.6.0
>
>         Attachments: HFile benchmark.xlsx, HFile benchmark_withS3.xlsx, Screen Shot 2020-01-03 at 6.44.25 PM.png, Screen Shot 2020-03-09 at 12.22.54 AM.png
>
>
> We want to benchmark HFile scan vs seek as we intend to use HFile to record indexing. HFile will be used inline in hudi log for index purposes. 
> So, as part of benchmarking, we want to see when does scan out performs seek. 
> This is our experiment set up.
> keysToRead = no of keys to be looked up. // differs for different exp runs like 100k, 200k, 500k, 1M. 
> N = no of iterations
>  
> {code:java}
> 1M entries were written to a single HFile as key value pairs. 
> Also, stored the keys in a separate file(key_file).
> keyList = read all keys from key_file
> for N no of iterations
> {
>     shuffle keyList 
>     trim the list to keysToRead 
>     start timer HFile 
>     read benchmark(scan/seek) 
>     end timer
> }
> found avg for all timers captured
> {code}
>  
>  
> Result:
> Scan outperforms seek somewhere around 350k to 400k look ups out of 1M entries with optimized configs.
>   !Screen Shot 2020-01-03 at 6.44.25 PM.png!
> Results can be found here: [^HFile benchmark.xlsx]
> Source for benchmarking can be found here: 
> [https://github.com/nsivabalan/hudi/commit/94bef5ded3d70308e52b98e06b41e2cb999b5301]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)