You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Balajee Nagasubramaniam (Jira)" <ji...@apache.org> on 2019/11/15 19:02:00 UTC

[jira] [Updated] (HUDI-335) Improvements to DiskBasedMap

     [ https://issues.apache.org/jira/browse/HUDI-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Balajee Nagasubramaniam updated HUDI-335:
-----------------------------------------
    Attachment: Screen Shot 2019-11-11 at 1.22.44 PM.png
                Screen Shot 2019-11-13 at 2.56.53 PM.png

> Improvements to DiskBasedMap
> ----------------------------
>
>                 Key: HUDI-335
>                 URL: https://issues.apache.org/jira/browse/HUDI-335
>             Project: Apache Hudi (incubating)
>          Issue Type: Improvement
>          Components: Common Core
>            Reporter: Balajee Nagasubramaniam
>            Priority: Major
>              Labels: Hoodie
>             Fix For: 0.5.1
>
>         Attachments: Screen Shot 2019-11-11 at 1.22.44 PM.png, Screen Shot 2019-11-13 at 2.56.53 PM.png
>
>   Original Estimate: 504h
>  Remaining Estimate: 504h
>
> DiskBasedMap is used by ExternalSpillableMap for writing (K,V) pair to a file,
> keeping the (K, fileMetadata) in memory, to reduce the foot print of the record on disk.
> This change improves the performance of the record get/read operation to disk, by using
> a BufferedInputStream to cache the data.
> Results from POC are promising.   Before the write performance improvement, spilling/writing 1 million records (record size ~ 350 bytes) to the file took about 104 seconds. 
> After the improvement, same operation can be performed in under 5 seconds
> Similarly, before the read performance improvement reading 1 million records (size ~350 bytes) from the spill file took about 23 seconds.  After the improvement, same operation can be performed in under 4 seconds.
> {{without read/write performance improvements							
> RecordsHandled:	10000	totalTestTime:	3145	writeTime:	1176	readTime:	255
> RecordsHandled:	50000	totalTestTime:	5775	writeTime:	4187	readTime:	1175
> RecordsHandled:	100000	totalTestTime:	10570	writeTime:	7718	readTime:	2203
> RecordsHandled:	500000	totalTestTime:	59723	writeTime:	45618	readTime:	11093
> RecordsHandled:	1000000	totalTestTime:	120022	writeTime:	87918	readTime:	22355
> RecordsHandled:	2000000	totalTestTime:	258627	writeTime:	187185	readTime:	56431}}
> {{With write improvement:
> RecordsHandled:	10000	totalTestTime:	2013	writeTime:	700	readTime:	503
> RecordsHandled:	50000	totalTestTime:	2525	writeTime:	390	readTime:	1247
> RecordsHandled:	100000	totalTestTime:	3583	writeTime:	464	readTime:	2352
> RecordsHandled:	500000	totalTestTime:	22934	writeTime:	3731	readTime:	15778
> RecordsHandled:	1000000	totalTestTime:	42415	writeTime:	4816	readTime:	30332
> RecordsHandled:	2000000	totalTestTime:	74158	writeTime:	10192	readTime:	53195}}
> {{With read improvements:
> RecordsHandled:	10000	totalTestTime:	2473	writeTime:	1562	readTime:	87
> RecordsHandled:	50000	totalTestTime:	6169	writeTime:	5151	readTime:	438
> RecordsHandled:	100000	totalTestTime:	9967	writeTime:	8636	readTime:	252
> RecordsHandled:	500000	totalTestTime:	50889	writeTime:	46766	readTime:	1014
> RecordsHandled:	1000000	totalTestTime:	114482	writeTime:	104353	readTime:	3776
> RecordsHandled:	2000000	totalTestTime:	239251	writeTime:	219041	readTime:	8127}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)