You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@trafodion.apache.org by "Liu, Ming (Ming)" <mi...@esgyn.cn> on 2016/02/12 17:36:28 UTC

how Bulkloader determine the max size of HFile?

Hi, all,

I am trying to understand Trafodion bulkloader better. One thing I noticed is that the bulkloader will generate 10G size of HFile into staging area, and then incremental add them into corresponding hbase region. My question is: how this 10G is determined? Is there any way I can change it to a smaller value?

The purpose is : 10G is rather big , so I assume this is the reason that the SORT operator need overflow to scratch file. So I am wondering, if each HFile is 500M for example, so that SORT can be done totally in RAM, then avoid the 'write amplification' of writing 10G into scratch file and then read them out and write same data into HFile. And the scratch content is not compressed, pretty much IO cost. Isn't this will improve the overall bulkloading speed, by avoiding the scratch files?

Although the bulkload speed is not bad comparing to HBase's importtsv utility as I tested these days (Trafodion is up to 3x faster than importtsv ), I am wondering if there are still rooms to improve. So if this 10G can be changed so I can do some more tests , that will be very helpful.

Of course, that will create a bunch of small HFiles, and HBase needs to do a major compaction. So it just postpone the dirty job, but I just want to try it out and see if it could help the loading. 

So if there is a CQD I can try, it will be super.

Thanks,
Ming