You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Balajee Nagasubramaniam (Jira)" <ji...@apache.org> on 2019/11/15 19:01:00 UTC
[jira] [Created] (HUDI-335) Improvements to DiskBasedMap
Balajee Nagasubramaniam created HUDI-335:
--------------------------------------------
Summary: Improvements to DiskBasedMap
Key: HUDI-335
URL: https://issues.apache.org/jira/browse/HUDI-335
Project: Apache Hudi (incubating)
Issue Type: Improvement
Components: Common Core
Reporter: Balajee Nagasubramaniam
Fix For: 0.5.1
DiskBasedMap is used by ExternalSpillableMap for writing (K,V) pair to a file,
keeping the (K, fileMetadata) in memory, to reduce the foot print of the record on disk.
This change improves the performance of the record get/read operation to disk, by using
a BufferedInputStream to cache the data.
Results from POC are promising. Before the write performance improvement, spilling/writing 1 million records (record size ~ 350 bytes) to the file took about 104 seconds.
After the improvement, same operation can be performed in under 5 seconds
Similarly, before the read performance improvement reading 1 million records (size ~350 bytes) from the spill file took about 23 seconds. After the improvement, same operation can be performed in under 4 seconds.
{{without read/write performance improvements
RecordsHandled: 10000 totalTestTime: 3145 writeTime: 1176 readTime: 255
RecordsHandled: 50000 totalTestTime: 5775 writeTime: 4187 readTime: 1175
RecordsHandled: 100000 totalTestTime: 10570 writeTime: 7718 readTime: 2203
RecordsHandled: 500000 totalTestTime: 59723 writeTime: 45618 readTime: 11093
RecordsHandled: 1000000 totalTestTime: 120022 writeTime: 87918 readTime: 22355
RecordsHandled: 2000000 totalTestTime: 258627 writeTime: 187185 readTime: 56431}}
{{With write improvement:
RecordsHandled: 10000 totalTestTime: 2013 writeTime: 700 readTime: 503
RecordsHandled: 50000 totalTestTime: 2525 writeTime: 390 readTime: 1247
RecordsHandled: 100000 totalTestTime: 3583 writeTime: 464 readTime: 2352
RecordsHandled: 500000 totalTestTime: 22934 writeTime: 3731 readTime: 15778
RecordsHandled: 1000000 totalTestTime: 42415 writeTime: 4816 readTime: 30332
RecordsHandled: 2000000 totalTestTime: 74158 writeTime: 10192 readTime: 53195}}
{{With read improvements:
RecordsHandled: 10000 totalTestTime: 2473 writeTime: 1562 readTime: 87
RecordsHandled: 50000 totalTestTime: 6169 writeTime: 5151 readTime: 438
RecordsHandled: 100000 totalTestTime: 9967 writeTime: 8636 readTime: 252
RecordsHandled: 500000 totalTestTime: 50889 writeTime: 46766 readTime: 1014
RecordsHandled: 1000000 totalTestTime: 114482 writeTime: 104353 readTime: 3776
RecordsHandled: 2000000 totalTestTime: 239251 writeTime: 219041 readTime: 8127}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)