You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Balajee Nagasubramaniam (Jira)" <ji...@apache.org> on 2019/11/15 19:01:00 UTC

[jira] [Created] (HUDI-335) Improvements to DiskBasedMap

Balajee Nagasubramaniam created HUDI-335:
--------------------------------------------

             Summary: Improvements to DiskBasedMap
                 Key: HUDI-335
                 URL: https://issues.apache.org/jira/browse/HUDI-335
             Project: Apache Hudi (incubating)
          Issue Type: Improvement
          Components: Common Core
            Reporter: Balajee Nagasubramaniam
             Fix For: 0.5.1


DiskBasedMap is used by ExternalSpillableMap for writing (K,V) pair to a file,
keeping the (K, fileMetadata) in memory, to reduce the foot print of the record on disk.

This change improves the performance of the record get/read operation to disk, by using
a BufferedInputStream to cache the data.

Results from POC are promising.   Before the write performance improvement, spilling/writing 1 million records (record size ~ 350 bytes) to the file took about 104 seconds. 
After the improvement, same operation can be performed in under 5 seconds

Similarly, before the read performance improvement reading 1 million records (size ~350 bytes) from the spill file took about 23 seconds.  After the improvement, same operation can be performed in under 4 seconds.

{{without read/write performance improvements							
RecordsHandled:	10000	totalTestTime:	3145	writeTime:	1176	readTime:	255
RecordsHandled:	50000	totalTestTime:	5775	writeTime:	4187	readTime:	1175
RecordsHandled:	100000	totalTestTime:	10570	writeTime:	7718	readTime:	2203
RecordsHandled:	500000	totalTestTime:	59723	writeTime:	45618	readTime:	11093
RecordsHandled:	1000000	totalTestTime:	120022	writeTime:	87918	readTime:	22355
RecordsHandled:	2000000	totalTestTime:	258627	writeTime:	187185	readTime:	56431}}

{{With write improvement:
RecordsHandled:	10000	totalTestTime:	2013	writeTime:	700	readTime:	503
RecordsHandled:	50000	totalTestTime:	2525	writeTime:	390	readTime:	1247
RecordsHandled:	100000	totalTestTime:	3583	writeTime:	464	readTime:	2352
RecordsHandled:	500000	totalTestTime:	22934	writeTime:	3731	readTime:	15778
RecordsHandled:	1000000	totalTestTime:	42415	writeTime:	4816	readTime:	30332
RecordsHandled:	2000000	totalTestTime:	74158	writeTime:	10192	readTime:	53195}}

{{With read improvements:
RecordsHandled:	10000	totalTestTime:	2473	writeTime:	1562	readTime:	87
RecordsHandled:	50000	totalTestTime:	6169	writeTime:	5151	readTime:	438
RecordsHandled:	100000	totalTestTime:	9967	writeTime:	8636	readTime:	252
RecordsHandled:	500000	totalTestTime:	50889	writeTime:	46766	readTime:	1014
RecordsHandled:	1000000	totalTestTime:	114482	writeTime:	104353	readTime:	3776
RecordsHandled:	2000000	totalTestTime:	239251	writeTime:	219041	readTime:	8127}}





--
This message was sent by Atlassian Jira
(v8.3.4#803005)