You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by gatorsmile <gi...@git.apache.org> on 2018/10/15 17:51:11 UTC

[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...

GitHub user gatorsmile opened a pull request:

    https://github.com/apache/spark/pull/22731

    [SPARK-25674][FOLLOW-UP] Update the stats for each ColumnarBatch

    ## What changes were proposed in this pull request?
    This PR is a follow-up of https://github.com/apache/spark/pull/22594 . This alternative can avoid the unneeded computation in the hot code path. 
    
    - For row-based scan, we keep the original way. 
    - For the columnar scan, we just need to update the stats after each batch.
    
    ## How was this patch tested?
    N/A

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gatorsmile/spark udpateStatsFileScanRDD

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/22731.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #22731
    
----
commit 8731588d28f302e51095f7ed1a4331edc5233958
Author: gatorsmile <ga...@...>
Date:   2018-10-15T17:44:39Z

    fix

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22731#discussion_r225333576
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -106,15 +106,16 @@ class FileScanRDD(
             // don't need to run this `if` for every record.
             val preNumRecordsRead = inputMetrics.recordsRead
             if (nextElement.isInstanceOf[ColumnarBatch]) {
    +          incTaskInputMetricsBytesRead()
    --- End diff --
    
    4096 is the default number of the batch reader in both ORC and Parquet. If the users set the conf to a much smaller number, they will face the perf regression due to the the extra overhead in many places. I do not think end users will do this. 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22731: [SPARK-25674][FOLLOW-UP] Update the stats for each Colum...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on the issue:

    https://github.com/apache/spark/pull/22731
  
    I'm going to merge this back to 2.3, as I had merged the original change back to 2.3


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22731: [SPARK-25674][FOLLOW-UP] Update the stats for each Colum...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22731
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3998/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22731: [SPARK-25674][FOLLOW-UP] Update the stats for each Colum...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22731
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22731: [SPARK-25674][FOLLOW-UP] Update the stats for each Colum...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22731
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22731: [SPARK-25674][FOLLOW-UP] Update the stats for each Colum...

Posted by gatorsmile <gi...@git.apache.org>.

Github user gatorsmile commented on the issue:

    https://github.com/apache/spark/pull/22731
  
    cc @10110346 @srowen @cloud-fan 


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22731#discussion_r225262559
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -106,15 +106,16 @@ class FileScanRDD(
             // don't need to run this `if` for every record.
             val preNumRecordsRead = inputMetrics.recordsRead
             if (nextElement.isInstanceOf[ColumnarBatch]) {
    +          incTaskInputMetricsBytesRead()
    --- End diff --
    
    I see, so always update when processing `ColumnarBatch`, but use the previous logic otherwise. That seems OK. It should still address the original problem.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22731: [SPARK-25674][FOLLOW-UP] Update the stats for each Colum...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22731
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/3997/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22731: [SPARK-25674][FOLLOW-UP] Update the stats for each Colum...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22731
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97405/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22731: [SPARK-25674][FOLLOW-UP] Update the stats for each Colum...

Posted by cloud-fan <gi...@git.apache.org>.

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/22731
  
    LGTM, merging to master!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22731: [SPARK-25674][FOLLOW-UP] Update the stats for each Colum...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22731
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/97406/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...

Posted by 10110346 <gi...@git.apache.org>.

Github user 10110346 commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22731#discussion_r225362470
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -106,15 +106,16 @@ class FileScanRDD(
             // don't need to run this `if` for every record.
             val preNumRecordsRead = inputMetrics.recordsRead
             if (nextElement.isInstanceOf[ColumnarBatch]) {
    +          incTaskInputMetricsBytesRead()
    --- End diff --
    
    Considering that the default value of `"spark.sql.parquet.columnarReaderBatchSize` is 4096, this change is better .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22731#discussion_r225325506
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -106,15 +106,16 @@ class FileScanRDD(
             // don't need to run this `if` for every record.
             val preNumRecordsRead = inputMetrics.recordsRead
             if (nextElement.isInstanceOf[ColumnarBatch]) {
    +          incTaskInputMetricsBytesRead()
    --- End diff --
    
    ... I guess the only possible drawback is that if the number of records in a ColumnarBatch is pretty small, then this could cause it to update bytes read a lot more frequently than before. Bu if the number of records is large (>100) then this won't matter.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22731: [SPARK-25674][FOLLOW-UP] Update the stats for each Colum...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22731
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22731: [SPARK-25674][FOLLOW-UP] Update the stats for each Colum...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22731
  
    **[Test build #97406 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97406/testReport)** for PR 22731 at commit [`7c3fd54`](https://github.com/apache/spark/commit/7c3fd54feabd69320579626819e8083e025e6429).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...

Posted by srowen <gi...@git.apache.org>.

Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/22731#discussion_r225337806
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala ---
    @@ -106,15 +106,16 @@ class FileScanRDD(
             // don't need to run this `if` for every record.
             val preNumRecordsRead = inputMetrics.recordsRead
             if (nextElement.isInstanceOf[ColumnarBatch]) {
    +          incTaskInputMetricsBytesRead()
    --- End diff --
    
    Makes sense. In this case the behavior should be the same before and after this change, but it's therefore fine, too.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22731: [SPARK-25674][FOLLOW-UP] Update the stats for each Colum...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22731
  
    **[Test build #97405 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97405/testReport)** for PR 22731 at commit [`8731588`](https://github.com/apache/spark/commit/8731588d28f302e51095f7ed1a4331edc5233958).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22731: [SPARK-25674][FOLLOW-UP] Update the stats for each Colum...

Posted by AmplabJenkins <gi...@git.apache.org>.

Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/22731
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark pull request #22731: [SPARK-25674][FOLLOW-UP] Update the stats for eac...

Posted by asfgit <gi...@git.apache.org>.

Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/22731


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22731: [SPARK-25674][FOLLOW-UP] Update the stats for each Colum...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22731
  
    **[Test build #97405 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97405/testReport)** for PR 22731 at commit [`8731588`](https://github.com/apache/spark/commit/8731588d28f302e51095f7ed1a4331edc5233958).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] spark issue #22731: [SPARK-25674][FOLLOW-UP] Update the stats for each Colum...

Posted by SparkQA <gi...@git.apache.org>.

Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/22731
  
    **[Test build #97406 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/97406/testReport)** for PR 22731 at commit [`7c3fd54`](https://github.com/apache/spark/commit/7c3fd54feabd69320579626819e8083e025e6429).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org