You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by SongYadong <gi...@git.apache.org> on 2018/09/06 08:32:22 UTC

[GitHub] spark pull request #22348: Reduce unneeded operation in nextKeyValue process...

GitHub user SongYadong opened a pull request:

    https://github.com/apache/spark/pull/22348

    Reduce unneeded operation in nextKeyValue process of parquet vectorized record reader

    ## What changes were proposed in this pull request?
    
    this PR do following in VectorizedParquetRecordReader class:
    
    1. In nextKeyValue() method, call to resultBatch() for only initializing a columnar batch if not initialized, not for a return of columnar batch. so we move initBatch() operation to nextBatch();
    
    2. In nextBatch() method, we need not reset columnVectors every time. When rowsReturned >= totalRowCount, function return, reset cost is vasted. so we put "if (rowsReturned >= totalRowCount) return false;" before columnVectors reset for performance.
    
    3. In nextBatch() method, we need not call checkEndOfRowGroup() every time. When rowsReturned != totalCountLoadedSoFar is true, checkEndOfRowGroup do nothing but just return, so we call checkEndOfRowGroup only when rowsReturned == totalCountLoadedSoFar for reducing function calling.
    
    4. In checkEndOfRowGroup() function, we need not get columns of requestedSchema every time. we get columns only for the first time and save it for future use for performance.
    
    Accoring to analysis of spark application with JMC tool, we found parquet vectorized record reader call nextKeyValue() and subsequent function very very frequent, performance gains from optimizition of this process is worth to do.
    
    
    ## How was this patch tested?
    
    1. Existing Unit Tests
    2. A test of running 4885 spark applications.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/SongYadong/spark parquet_vectorized_read

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22348.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22348
    
----
commit db5799b3eed27103bdaf50ac10b1da6758987619
Author: SongYadong <so...@...>
Date:   2018-09-06T08:16:34Z

    Reduce unneeded operation in nextKeyValue process of parquet vectorized record reader

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22348: [SPARK-25354][SQL] Reduce unneeded operation in n...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22348#discussion_r215860694
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java ---
    @@ -154,8 +159,6 @@ public void close() throws IOException {
     
       @Override
       public boolean nextKeyValue() throws IOException {
    -    resultBatch();
    --- End diff --
    
    I like this direction to reduce overhead. I confirmed that the current code calls `initBatch()` before calling `nextKeyValue()`, too.  
    My question is that is calling sequence is mandatory or optional. In the future, I wonder a developer may call this method thru `hasNext()` before calling `initBatch()`.
    
    How about moving `if-statement` into here like? Does this help performance improvement?
    ```
    if (columnarBatch == null) resultBatch();
    ```
    I am curious how orc reader handle this.
    cc @dongjoon-hyun @cloud-fan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...

Posted by SongYadong <gi...@git.apache.org>.

Github user SongYadong commented on the issue:

    https://github.com/apache/spark/pull/22348
  
    Sounds reasonable. I will close this PR. Thank you!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...

Posted by SongYadong <gi...@git.apache.org>.

Github user SongYadong commented on the issue:

    https://github.com/apache/spark/pull/22348
  
    Could I ask you to review this PR if you have the time? cc @dongjoon-hyun @cloud-fan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22348
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22348: [SPARK-25354][SQL] Reduce unneeded operation in n...

Posted by SongYadong <gi...@git.apache.org>.

Github user SongYadong commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22348#discussion_r215870108
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java ---
    @@ -154,8 +159,6 @@ public void close() throws IOException {
     
       @Override
       public boolean nextKeyValue() throws IOException {
    -    resultBatch();
    --- End diff --
    
    Thanks for your review.
    calling sequence is optional. `initBatch()` was moved to `nextBatch()`, if `hasNext()` called before `initBatch()`, eventually `initBatch()` will be called first of all in `nextBatch()`:
    ```
      public boolean nextBatch() throws IOException {
         if (columnarBatch == null) initBatch();
    ```
    
    Orc has a much simple way: 
    ```
      public boolean nextKeyValue() throws IOException {
        return nextBatch();
      }
    ```



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22348: Reduce unneeded operation in nextKeyValue process of par...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22348
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...

Posted by SongYadong <gi...@git.apache.org>.

Github user SongYadong commented on the issue:

    https://github.com/apache/spark/pull/22348
  
     @dongjoon-hyun .  You are right, DataSourceReadBenchmark result show the benefit is too small even in some cases is covered up by fluctuation.
    
    Java HotSpot(TM) 64-Bit Server VM 1.8.0_141-b15 on Windows 7 6.1
    Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
    
    Before:
    ```
    Parquet Reader Single TINYINT Column Scan: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                       330 /  334         47.7          21.0       1.0X
    ParquetReader Vectorized -> Row                213 /  301         73.7          13.6       1.5X
    ```
    After:
    ```
    Parquet Reader Single TINYINT Column Scan: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                       292 /  366         53.8          18.6       1.0X
    ParquetReader Vectorized -> Row                254 /  286         62.0          16.1       1.2X
    ```
    
    
    
    Before:
    ```
    Parquet Reader Single SMALLINT Column Scan: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                       391 /  425         40.2          24.9       1.0X
    ParquetReader Vectorized -> Row                371 /  407         42.4          23.6       1.1X
    ```
    After:
    ```
    Parquet Reader Single SMALLINT Column Scan: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                       435 /  485         36.1          27.7       1.0X
    ParquetReader Vectorized -> Row                398 /  440         39.5          25.3       1.1X
    ```
    
    
    
    Before:
    ```
    Parquet Reader Single INT Column Scan:   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                       453 /  516         34.7          28.8       1.0X
    ParquetReader Vectorized -> Row                542 /  563         29.0          34.5       0.8X
    ```
    After:
    ```
    Parquet Reader Single INT Column Scan:   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                       533 /  602         29.5          33.9       1.0X
    ParquetReader Vectorized -> Row                549 /  570         28.6          34.9       1.0X
    ```
    
    
    
    
    Before:
    ```
    Parquet Reader Single BIGINT Column Scan: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                       800 /  817         19.7          50.9       1.0X
    ParquetReader Vectorized -> Row                530 /  686         29.7          33.7       1.5X
    ```
    After:
    ```
    Parquet Reader Single BIGINT Column Scan: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                       692 /  847         22.7          44.0       1.0X
    ParquetReader Vectorized -> Row                580 /  610         27.1          36.9       1.2X
    ```
    
    
    
    
    Before:
    ```
    Parquet Reader Single FLOAT Column Scan: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                       467 /  543         33.7          29.7       1.0X
    ParquetReader Vectorized -> Row                457 /  507         34.4          29.1       1.0X
    ```
    After:
    ```
    Parquet Reader Single FLOAT Column Scan: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                       584 /  600         26.9          37.1       1.0X
    ParquetReader Vectorized -> Row                546 /  555         28.8          34.7       1.1X
    ```
    
    
    
    
    Before:
    ```
    Parquet Reader Single DOUBLE Column Scan: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                       801 /  808         19.6          50.9       1.0X
    ParquetReader Vectorized -> Row                590 /  668         26.7          37.5       1.4X
    ```
    After:
    ```
    Parquet Reader Single DOUBLE Column Scan: Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                       833 /  858         18.9          53.0       1.0X
    ParquetReader Vectorized -> Row                672 /  722         23.4          42.7       1.2X
    ```
    
    
    Before:
    ```
    String with Nulls Scan:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                      2098 / 2263          5.0         200.1      13.3X
    ```
    After:
    ```
    String with Nulls Scan:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                      1943 / 2084          5.4         185.3      14.0X
    ```
    
    
    
    Before:
    ```
    String with Nulls Scan:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                      1930 / 1980          5.4         184.0      15.0X
    ```
    After:
    ```
    String with Nulls Scan:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                      1873 / 1875          5.6         178.6      16.7X
    ```
    
    
    Before:
    ```
    String with Nulls Scan:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                       301 /  356         34.9          28.7      83.1X
    ```
    After:
    ```
    String with Nulls Scan:                  Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    ------------------------------------------------------------------------------------------------
    ParquetReader Vectorized                       506 /  542         20.7          48.3      53.0X
    ```
    
    



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22348: [SPARK-25354][SQL] Reduce unneeded operation in n...

Posted by kiszk <gi...@git.apache.org>.

Github user kiszk commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22348#discussion_r216405506
  
    --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java ---
    @@ -154,8 +159,6 @@ public void close() throws IOException {
     
       @Override
       public boolean nextKeyValue() throws IOException {
    -    resultBatch();
    --- End diff --
    
    Ah, i see. Thank you for explanation.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22348
  
    @SongYadong . `0.6%` seems to be too small to do this. Do you have another benchmark result to show the benefit more clearly?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...

Posted by SongYadong <gi...@git.apache.org>.

Github user SongYadong commented on the issue:

    https://github.com/apache/spark/pull/22348
  
    In my test, total executing time of 4885 spark applications:
    origin: 447213 seconds
    after : 444584 seconds
    time saved : about 0.6%



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22348: [SPARK-25354][SQL] Reduce unneeded operation in n...

Posted by SongYadong <gi...@git.apache.org>.

Github user SongYadong closed the pull request at:

    https://github.com/apache/spark/pull/22348


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...

Posted by dongjoon-hyun <gi...@git.apache.org>.

Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/22348
  
    Thank you for experiment, @SongYadong . IMO, if the gain is not obvious, we had better keep the current proven one because new one can introduce some future regressions (or even bugs).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22348: [SPARK-25354][SQL] Reduce unneeded operation in nextKeyV...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22348
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org