You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by SongYadong <gi...@git.apache.org> on 2018/09/06 08:32:22 UTC
[GitHub] spark pull request #22348: Reduce unneeded operation in nextKeyValue process...
GitHub user SongYadong opened a pull request:
https://github.com/apache/spark/pull/22348
Reduce unneeded operation in nextKeyValue process of parquet vectorized record reader
## What changes were proposed in this pull request?
this PR do following in VectorizedParquetRecordReader class:
1. In nextKeyValue() method, call to resultBatch() for only initializing a columnar batch if not initialized, not for a return of columnar batch. so we move initBatch() operation to nextBatch();
2. In nextBatch() method, we need not reset columnVectors every time. When rowsReturned >= totalRowCount, function return, reset cost is vasted. so we put "if (rowsReturned >= totalRowCount) return false;" before columnVectors reset for performance.
3. In nextBatch() method, we need not call checkEndOfRowGroup() every time. When rowsReturned != totalCountLoadedSoFar is true, checkEndOfRowGroup do nothing but just return, so we call checkEndOfRowGroup only when rowsReturned == totalCountLoadedSoFar for reducing function calling.
4. In checkEndOfRowGroup() function, we need not get columns of requestedSchema every time. we get columns only for the first time and save it for future use for performance.
Accoring to analysis of spark application with JMC tool, we found parquet vectorized record reader call nextKeyValue() and subsequent function very very frequent, performance gains from optimizition of this process is worth to do.
## How was this patch tested?
1. Existing Unit Tests
2. A test of running 4885 spark applications.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/SongYadong/spark parquet_vectorized_read
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22348.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22348
----
commit db5799b3eed27103bdaf50ac10b1da6758987619
Author: SongYadong <so...@...>
Date: 2018-09-06T08:16:34Z
Reduce unneeded operation in nextKeyValue process of parquet vectorized record reader
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22348: [SPARK-25354][SQL] Reduce unneeded operation in n...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:
https://github.com/apache/spark/pull/22348#discussion_r215860694
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java ---
@@ -154,8 +159,6 @@ public void close() throws IOException {
@Override
public boolean nextKeyValue() throws IOException {
- resultBatch();
--- End diff --
I like this direction to reduce overhead. I confirmed that the current code calls `initBatch()` before calling `nextKeyValue()`, too.
My question is that is calling sequence is mandatory or optional. In the future, I wonder a developer may call this method thru `hasNext()` before calling `initBatch()`.
How about moving `if-statement` into here like? Does this help performance improvement?
```
if (columnarBatch == null) resultBatch();
```
I am curious how orc reader handle this.
cc @dongjoon-hyun @cloud-fan
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...
Posted by SongYadong <gi...@git.apache.org>.
Github user SongYadong commented on the issue:
https://github.com/apache/spark/pull/22348
Sounds reasonable. I will close this PR. Thank you!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...
Posted by SongYadong <gi...@git.apache.org>.
Github user SongYadong commented on the issue:
https://github.com/apache/spark/pull/22348
Could I ask you to review this PR if you have the time? cc @dongjoon-hyun @cloud-fan
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22348
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22348: [SPARK-25354][SQL] Reduce unneeded operation in n...
Posted by SongYadong <gi...@git.apache.org>.
Github user SongYadong commented on a diff in the pull request:
https://github.com/apache/spark/pull/22348#discussion_r215870108
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java ---
@@ -154,8 +159,6 @@ public void close() throws IOException {
@Override
public boolean nextKeyValue() throws IOException {
- resultBatch();
--- End diff --
Thanks for your review.
calling sequence is optional. `initBatch()` was moved to `nextBatch()`, if `hasNext()` called before `initBatch()`, eventually `initBatch()` will be called first of all in `nextBatch()`:
```
public boolean nextBatch() throws IOException {
if (columnarBatch == null) initBatch();
```
Orc has a much simple way:
```
public boolean nextKeyValue() throws IOException {
return nextBatch();
}
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22348: Reduce unneeded operation in nextKeyValue process of par...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22348
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...
Posted by SongYadong <gi...@git.apache.org>.
Github user SongYadong commented on the issue:
https://github.com/apache/spark/pull/22348
@dongjoon-hyun . You are right, DataSourceReadBenchmark result show the benefit is too small even in some cases is covered up by fluctuation.
Java HotSpot(TM) 64-Bit Server VM 1.8.0_141-b15 on Windows 7 6.1
Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
Before:
```
Parquet Reader Single TINYINT Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 330 / 334 47.7 21.0 1.0X
ParquetReader Vectorized -> Row 213 / 301 73.7 13.6 1.5X
```
After:
```
Parquet Reader Single TINYINT Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 292 / 366 53.8 18.6 1.0X
ParquetReader Vectorized -> Row 254 / 286 62.0 16.1 1.2X
```
Before:
```
Parquet Reader Single SMALLINT Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 391 / 425 40.2 24.9 1.0X
ParquetReader Vectorized -> Row 371 / 407 42.4 23.6 1.1X
```
After:
```
Parquet Reader Single SMALLINT Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 435 / 485 36.1 27.7 1.0X
ParquetReader Vectorized -> Row 398 / 440 39.5 25.3 1.1X
```
Before:
```
Parquet Reader Single INT Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 453 / 516 34.7 28.8 1.0X
ParquetReader Vectorized -> Row 542 / 563 29.0 34.5 0.8X
```
After:
```
Parquet Reader Single INT Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 533 / 602 29.5 33.9 1.0X
ParquetReader Vectorized -> Row 549 / 570 28.6 34.9 1.0X
```
Before:
```
Parquet Reader Single BIGINT Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 800 / 817 19.7 50.9 1.0X
ParquetReader Vectorized -> Row 530 / 686 29.7 33.7 1.5X
```
After:
```
Parquet Reader Single BIGINT Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 692 / 847 22.7 44.0 1.0X
ParquetReader Vectorized -> Row 580 / 610 27.1 36.9 1.2X
```
Before:
```
Parquet Reader Single FLOAT Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 467 / 543 33.7 29.7 1.0X
ParquetReader Vectorized -> Row 457 / 507 34.4 29.1 1.0X
```
After:
```
Parquet Reader Single FLOAT Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 584 / 600 26.9 37.1 1.0X
ParquetReader Vectorized -> Row 546 / 555 28.8 34.7 1.1X
```
Before:
```
Parquet Reader Single DOUBLE Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 801 / 808 19.6 50.9 1.0X
ParquetReader Vectorized -> Row 590 / 668 26.7 37.5 1.4X
```
After:
```
Parquet Reader Single DOUBLE Column Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 833 / 858 18.9 53.0 1.0X
ParquetReader Vectorized -> Row 672 / 722 23.4 42.7 1.2X
```
Before:
```
String with Nulls Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 2098 / 2263 5.0 200.1 13.3X
```
After:
```
String with Nulls Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 1943 / 2084 5.4 185.3 14.0X
```
Before:
```
String with Nulls Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 1930 / 1980 5.4 184.0 15.0X
```
After:
```
String with Nulls Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 1873 / 1875 5.6 178.6 16.7X
```
Before:
```
String with Nulls Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 301 / 356 34.9 28.7 83.1X
```
After:
```
String with Nulls Scan: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
ParquetReader Vectorized 506 / 542 20.7 48.3 53.0X
```
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22348: [SPARK-25354][SQL] Reduce unneeded operation in n...
Posted by kiszk <gi...@git.apache.org>.
Github user kiszk commented on a diff in the pull request:
https://github.com/apache/spark/pull/22348#discussion_r216405506
--- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java ---
@@ -154,8 +159,6 @@ public void close() throws IOException {
@Override
public boolean nextKeyValue() throws IOException {
- resultBatch();
--- End diff --
Ah, i see. Thank you for explanation.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22348
@SongYadong . `0.6%` seems to be too small to do this. Do you have another benchmark result to show the benefit more clearly?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...
Posted by SongYadong <gi...@git.apache.org>.
Github user SongYadong commented on the issue:
https://github.com/apache/spark/pull/22348
In my test, total executing time of 4885 spark applications:
origin: 447213 seconds
after : 444584 seconds
time saved : about 0.6%
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark pull request #22348: [SPARK-25354][SQL] Reduce unneeded operation in n...
Posted by SongYadong <gi...@git.apache.org>.
Github user SongYadong closed the pull request at:
https://github.com/apache/spark/pull/22348
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...
Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:
https://github.com/apache/spark/pull/22348
Thank you for experiment, @SongYadong . IMO, if the gain is not obvious, we had better keep the current proven one because new one can introduce some future regressions (or even bugs).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/22348
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org