You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@iceberg.apache.org by "timmycheng (程力)" <ti...@tencent.com> on 2019/08/12 02:11:21 UTC

Re: Encouraging performance results for Vectorized Iceberg code(Internet mail)

Thanks for broadcasting! Just have a few questions to better understand the awesome work.

Could you give a little more details on the score and error columns? Does error mean every time the query hits a null?
Shall I assume 5k/10k means the number of rows? What do we learn from compare to IcebergSourceFlatParquetDataReadBenchmark.readIceberg? Or rather, what numbers are we comparing to?

-Li

发件人: Anjali Norwood <an...@netflix.com>
答复: "dev@iceberg.apache.org" <de...@iceberg.apache.org>
日期: 2019年8月10日 星期六 上午4:47
收件人: Ryan Blue <rb...@netflix.com>, "dev@iceberg.apache.org" <de...@iceberg.apache.org>
抄送: Gautam <ga...@gmail.com>, "ppadma@apache.org" <pp...@apache.org>, Samarth Jain <sj...@netflix.com>, Daniel Weeks <dw...@netflix.com>
主题: Re: Encouraging performance results for Vectorized Iceberg code(Internet mail)

Good suggestion Ryan. Added dev@iceberg now.

Dev: Please see early vectorized Iceberg performance results a couple emails down. This WIP.

thanks,
Anjali.

On Thu, Aug 8, 2019 at 10:39 AM Ryan Blue <rb...@netflix.com>> wrote:
Hi everyone,

Is it possible to copy the Iceberg dev list when sending these emails? There are other people in the community that are interested, like Palantir. If there isn't anything sensitive then let's try to be more inclusive. Thanks!

On Wed, Aug 7, 2019 at 10:34 PM Anjali Norwood <an...@netflix.com>> wrote:
Hi Gautam, Padma,
We wanted to update you before Gautam takes off for vacation.

Samarth and I profiled the code and found the following:
Profiling the IcebergSourceFlatParquetDataReadBenchmark (10 files, 10M rows, a single long column) using visualVM shows two places where CPU time can be optimized:
1) Iterator abstractions (triple iterators, page iterators etc) seem to take up quite a bit of time. Not using these iterators or making them 'batched' iterators and moving the reading of the data close to the file should help ameliorate this problem.
2) Current code goes back and forth between definition levels and value reads through the levels of iterators. Quite a bit of CPU time is spent here. Reading a batch of primitive values at once after consulting the definition level should help improve performance.

So, we prototyped the code to walk over the definition levels and read corresponding values in batches (read values till we hit a null, then read nulls till we hit values and so on) and made the iterators batched iterators. Here are the results:

Benchmark Mode Cnt Score Error Units
IcebergSourceFlatParquetDataReadBenchmark.readFileSourceNonVectorized ss 5 10.247 ± 0.202 s/op
IcebergSourceFlatParquetDataReadBenchmark.readFileSourceVectorized ss 5 3.747 ± 0.206 s/op
IcebergSourceFlatParquetDataReadBenchmark.readIceberg ss 5 11.286 ± 0.457 s/op
IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized100k ss 5 6.088 ± 0.324 s/op
IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized10k ss 5 5.875 ± 0.378 s/op
IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized1k ss 5 6.029 ± 0.387 s/op
IcebergSourceFlatParquetDataReadBenchmark.readIcebergVectorized5k ss 5 6.106 ± 0.497 s/op

Moreover, as I mentioned to Gautam on chat, we prototyped reading the string column as a byte array without decoding it into UTF8 (above changes were not made at the time) and we saw significant performance improvements there (21.18 secs before Vs 13.031 secs with the change). When used along with batched iterators, these numbers should get better.

Note that we haven't tightened/profiled the new code yet (we will start on that next). Just wanted to share some early positive results.

regards,
Anjali.

--
Ryan Blue
Software Engineer
Netflix