You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Balajee Nagasubramaniam (Jira)" <ji...@apache.org> on 2019/11/15 19:02:00 UTC
[jira] [Updated] (HUDI-335) Improvements to DiskBasedMap
[ https://issues.apache.org/jira/browse/HUDI-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Balajee Nagasubramaniam updated HUDI-335:
-----------------------------------------
Attachment: Screen Shot 2019-11-11 at 1.22.44 PM.png
Screen Shot 2019-11-13 at 2.56.53 PM.png
> Improvements to DiskBasedMap
> ----------------------------
>
> Key: HUDI-335
> URL: https://issues.apache.org/jira/browse/HUDI-335
> Project: Apache Hudi (incubating)
> Issue Type: Improvement
> Components: Common Core
> Reporter: Balajee Nagasubramaniam
> Priority: Major
> Labels: Hoodie
> Fix For: 0.5.1
>
> Attachments: Screen Shot 2019-11-11 at 1.22.44 PM.png, Screen Shot 2019-11-13 at 2.56.53 PM.png
>
> Original Estimate: 504h
> Remaining Estimate: 504h
>
> DiskBasedMap is used by ExternalSpillableMap for writing (K,V) pair to a file,
> keeping the (K, fileMetadata) in memory, to reduce the foot print of the record on disk.
> This change improves the performance of the record get/read operation to disk, by using
> a BufferedInputStream to cache the data.
> Results from POC are promising. Before the write performance improvement, spilling/writing 1 million records (record size ~ 350 bytes) to the file took about 104 seconds.
> After the improvement, same operation can be performed in under 5 seconds
> Similarly, before the read performance improvement reading 1 million records (size ~350 bytes) from the spill file took about 23 seconds. After the improvement, same operation can be performed in under 4 seconds.
> {{without read/write performance improvements
> RecordsHandled: 10000 totalTestTime: 3145 writeTime: 1176 readTime: 255
> RecordsHandled: 50000 totalTestTime: 5775 writeTime: 4187 readTime: 1175
> RecordsHandled: 100000 totalTestTime: 10570 writeTime: 7718 readTime: 2203
> RecordsHandled: 500000 totalTestTime: 59723 writeTime: 45618 readTime: 11093
> RecordsHandled: 1000000 totalTestTime: 120022 writeTime: 87918 readTime: 22355
> RecordsHandled: 2000000 totalTestTime: 258627 writeTime: 187185 readTime: 56431}}
> {{With write improvement:
> RecordsHandled: 10000 totalTestTime: 2013 writeTime: 700 readTime: 503
> RecordsHandled: 50000 totalTestTime: 2525 writeTime: 390 readTime: 1247
> RecordsHandled: 100000 totalTestTime: 3583 writeTime: 464 readTime: 2352
> RecordsHandled: 500000 totalTestTime: 22934 writeTime: 3731 readTime: 15778
> RecordsHandled: 1000000 totalTestTime: 42415 writeTime: 4816 readTime: 30332
> RecordsHandled: 2000000 totalTestTime: 74158 writeTime: 10192 readTime: 53195}}
> {{With read improvements:
> RecordsHandled: 10000 totalTestTime: 2473 writeTime: 1562 readTime: 87
> RecordsHandled: 50000 totalTestTime: 6169 writeTime: 5151 readTime: 438
> RecordsHandled: 100000 totalTestTime: 9967 writeTime: 8636 readTime: 252
> RecordsHandled: 500000 totalTestTime: 50889 writeTime: 46766 readTime: 1014
> RecordsHandled: 1000000 totalTestTime: 114482 writeTime: 104353 readTime: 3776
> RecordsHandled: 2000000 totalTestTime: 239251 writeTime: 219041 readTime: 8127}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)