You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2019/12/17 15:37:05 UTC
[GitHub] [flink] shuttie edited a comment on issue #10529: [FLINK-15171]
[serialization] fix performance regression caused by too many buffer
allocations on string serialization
shuttie edited a comment on issue #10529: [FLINK-15171] [serialization] fix performance regression caused by too many buffer allocations on string serialization
URL: https://github.com/apache/flink/pull/10529#issuecomment-566545321
@pnowojski @rkhachatryan So looks like that I was able to understand what's going on with this performance regression. To reproduce the issue I had to write yet another benchmark mimicking the one in `SerializationFrameworkMiniBenchmarks.serializerTuple`, but without job setup overhead (as it may take almost 50% of the benchmark runtime). This benchmark just roundtrips the `Tuple8` to byte array and back, using different serializer implementations (I will make a PR to `flink-benchmarks` later).
```
Benchmark Mode Cnt Score Error Units
roundtripTuple thrpt 50 623.194 ± 4.601 ops/ms pre-FLINK-14346
roundtripTuple thrpt 50 590.663 ± 3.570 ops/ms FLINK-14346
```
Compared to the original pre-1.10 implementation, we do much more things in the new version:
1. compute exact buffer size
2. allocate the buffer itself
3. encode chars to the buffer (and not to the stream directly)
4. flush the buffer to the stream
As @pnowojski suggested, there is no need to compute exact buffer size while doing serialization on step 1, we can allocate a small buffer and flush it when it's exhausted. Also, as this small buffer has fixed size, a simple idea was to allocate it once in ThreadLocal and reuse it later:
```
Benchmark Mode Cnt Score Error Units
roundtripTuple thrpt 50 623.194 ± 4.601 ops/ms pre-FLINK-14346
roundtripTuple thrpt 50 590.663 ± 3.570 ops/ms FLINK-14346
roundtripTuple thrpt 50 613.878 ± 2.498 ops/ms FLINK-15171 with ThreadLocal
```
The results were a bit better, as we never scanned the string twice and significantly reduced the number of allocations. But the results were still not that good as before. Surprisingly, ThreadLocal manipulations took most of the time:
![image](https://user-images.githubusercontent.com/999061/70998376-2340f480-20cf-11ea-8094-6ff41833286f.png)
Then we made a test without ThreadLocal, but with fixed-size allocated buffer of size 8 and 16:
```
Benchmark Mode Cnt Score Error Units
roundtripTuple thrpt 50 623.194 ± 4.601 ops/ms pre-FLINK-14346
roundtripTuple thrpt 50 590.663 ± 3.570 ops/ms FLINK-14346
roundtripTuple thrpt 50 613.878 ± 2.498 ops/ms FLINK-15171 with ThreadLocal
roundtripTuple thrpt 50 622.679 ± 4.347 ops/ms FLINK-15171 buffer[8]
roundtripTuple thrpt 50 631.729 ± 4.937 ops/ms FLINK-15171 buffer[16]
```
It improved the situation quite dramatically. We also did a set of benchmarks to observe the impact of buffer size on read/write performance for strings of different length: usually the larger the buffer, the better the performance.
```
[info] Benchmark (bufferSize) (length) (stringType) Mode Cnt Score Error Units
[info] deserializeImproved 8 1 ascii avgt 10 50.854 ± 1.187 ns/op
[info] deserializeImproved 8 4 ascii avgt 10 51.080 ± 1.235 ns/op
[info] deserializeImproved 8 16 ascii avgt 10 67.324 ± 2.230 ns/op
[info] deserializeImproved 8 256 ascii avgt 10 568.002 ± 16.895 ns/op
[info] deserializeImproved 16 1 ascii avgt 10 51.080 ± 1.846 ns/op
[info] deserializeImproved 16 4 ascii avgt 10 51.044 ± 1.862 ns/op
[info] deserializeImproved 16 16 ascii avgt 10 55.298 ± 0.419 ns/op
[info] deserializeImproved 16 256 ascii avgt 10 400.586 ± 1.440 ns/op
[info] deserializeImproved 32 1 ascii avgt 10 51.196 ± 0.443 ns/op
[info] deserializeImproved 32 4 ascii avgt 10 51.587 ± 0.881 ns/op
[info] deserializeImproved 32 16 ascii avgt 10 56.025 ± 0.159 ns/op
[info] deserializeImproved 32 256 ascii avgt 10 335.878 ± 1.231 ns/op
[info] serializeImproved 8 1 ascii avgt 50 30.823 ± 1.189 ns/op
[info] serializeImproved 8 4 ascii avgt 50 31.606 ± 0.479 ns/op
[info] serializeImproved 8 16 ascii avgt 50 87.594 ± 0.700 ns/op
[info] serializeImproved 8 256 ascii avgt 50 861.818 ± 3.382 ns/op
[info] serializeImproved 16 1 ascii avgt 50 30.295 ± 0.429 ns/op
[info] serializeImproved 16 4 ascii avgt 50 32.123 ± 0.406 ns/op
[info] serializeImproved 16 16 ascii avgt 50 70.481 ± 0.830 ns/op
[info] serializeImproved 16 256 ascii avgt 50 522.778 ± 3.020 ns/op
[info] serializeImproved 32 1 ascii avgt 50 30.973 ± 0.284 ns/op
[info] serializeImproved 32 4 ascii avgt 50 32.353 ± 0.313 ns/op
[info] serializeImproved 32 16 ascii avgt 50 38.090 ± 0.383 ns/op
[info] serializeImproved 32 256 ascii avgt 50 418.664 ± 4.335 ns/op
```
But allocating a large buffer for a short string seems to be a bit wasteful, so we tried to make a flexible implementation of buffer sizing (like `min(32, max(8, 1+strlen))` and surprizingly the results degraded quite significantly:
```
Benchmark Mode Cnt Score Error Units
roundtripTuple thrpt 50 623.194 ± 4.601 ops/ms pre-FLINK-14346
roundtripTuple thrpt 50 590.663 ± 3.570 ops/ms FLINK-14346
roundtripTuple thrpt 50 613.878 ± 2.498 ops/ms FLINK-15171 with ThreadLocal
roundtripTuple thrpt 50 622.679 ± 4.347 ops/ms FLINK-15171 buffer[8]
roundtripTuple thrpt 50 631.729 ± 4.937 ops/ms FLINK-15171 buffer[16]
roundtripTuple thrpt 50 547.097 ± 2.687 ops/ms FLINK-15171 buffer[dynamic]
```
If you check the output of perfasm profiler comparing buffer[16] and buffer[dynamic] variants, you will notice that when you allocate a small byte array with known size at the moment of compilation, then JVM can do scalarisation: skip heap allocation, and allocate only 16 bytes right on the stack.
When the buffer is dynamic, then it's always going to heap with significant performance penalty.
As for the increased number of char[] allocations - it looks like to be related to the benchmarking process. As if you increase throughput, then you increase the number of Strings produced by serializer, then increasing the number of chars[] allocated.
So, to sum up the current status of this PR:
1. ~The current version of the regression fix is ready for review.~ Not yet ready, need to fix the issue with `HybridMemorySegment`, see the next comment.
2. The `flink-benchmark` PR with a narrower reproducer used here for the performance regression will be created tomorrow.
3. I also would like to make a PR to `flink-benchmark` to extract all the flink job setup code in the `SerializationFrameworkMiniBenchmarks` out of the main benchmark code, so the results will be much more reproducible and representable. Currently job setup code it highly sensitive to system RAM throughput, that's why I was not able to see this regression, as my RAM is ~15% faster than the one on benchmark machine.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
With regards,
Apache Git Services