You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by maropu <gi...@git.apache.org> on 2018/06/24 03:14:46 UTC

[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

GitHub user maropu opened a pull request:

    https://github.com/apache/spark/pull/21625

    [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBenchmark benchmark results 

    ## What changes were proposed in this pull request?
    This pr corrected the default configuration (`spark.master=local[1]`) for benchmarks. Also, this updated performance results on the AWS `r3.xlarge`.
    
    ## How was this patch tested?
    N/A

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/maropu/spark FixDataSourceReadBenchmark

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21625.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21625
    
----
commit 23528200f833f236a83d6b891388b6ec698bcac7
Author: Takeshi Yamamuro <ya...@...>
Date:   2018-06-16T01:48:15Z

    Fix

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBench...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21625
  
    **[Test build #92264 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92264/testReport)** for PR 21625 at commit [`2352820`](https://github.com/apache/spark/commit/23528200f833f236a83d6b891388b6ec698bcac7).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBench...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21625
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/460/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBench...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21625
  
    **[Test build #92294 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92294/testReport)** for PR 21625 at commit [`4e76ffd`](https://github.com/apache/spark/commit/4e76ffde10f175fe71c7a4db544941e2bfad1132).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/21625


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21625#discussion_r197678386
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala ---
    @@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
               }
             }
     
    -        /*
    -        Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    -        Partitioned Table:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -        --------------------------------------------------------------------------------------------
    --- End diff --
    
    @maropu, if the JIRA blocks this PR, please feel free to set the configuration to false and proceed. Technically, looks that's what the benchmark originally covered at that time it's merged in. Setting it true can be separately done in the JIRA you opened.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21625#discussion_r197709175
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala ---
    @@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
               }
             }
     
    -        /*
    -        Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    -        Partitioned Table:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -        --------------------------------------------------------------------------------------------
    --- End diff --
    
    Anyway, I updated the results by applying #21631


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21625#discussion_r197679212
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala ---
    @@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
               }
             }
     
    -        /*
    -        Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    -        Partitioned Table:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -        --------------------------------------------------------------------------------------------
    --- End diff --
    
    @HyukjinKwon I'm currently fixing this now. But, it seems this bug is similar to SPARK-24645. So, would it be better to merge this fix with SPARK-24645?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21625#discussion_r197676313
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala ---
    @@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
               }
             }
     
    -        /*
    -        Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    -        Partitioned Table:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -        --------------------------------------------------------------------------------------------
    --- End diff --
    
    I filed a jira; https://issues.apache.org/jira/browse/SPARK-24645


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21625#discussion_r197627610
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala ---
    @@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
               }
             }
     
    -        /*
    -        Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    -        Partitioned Table:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -        --------------------------------------------------------------------------------------------
    --- End diff --
    
    Seems missed to update.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBench...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21625
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92294/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21625#discussion_r197652431
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala ---
    @@ -39,9 +39,11 @@ import org.apache.spark.util.{Benchmark, Utils}
     object DataSourceReadBenchmark {
       val conf = new SparkConf()
         .setAppName("DataSourceReadBenchmark")
    -    .setIfMissing("spark.master", "local[1]")
    +    // Since `spark.master` always exists, overrides this value
    +    .set("spark.master", "local[1]")
    --- End diff --
    
    Thank you for fixing this and updating the result, @maropu .


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21625#discussion_r197628635
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala ---
    @@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
               }
             }
     
    -        /*
    -        Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    -        Partitioned Table:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -        --------------------------------------------------------------------------------------------
    --- End diff --
    
    oh, thanks. I'll update soon.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBench...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21625
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBench...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on the issue:

    https://github.com/apache/spark/pull/21625
  
    This pr is a follow-up of https://github.com/apache/spark/pull/21288#event-1697426905


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBench...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21625
  
    **[Test build #92294 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92294/testReport)** for PR 21625 at commit [`4e76ffd`](https://github.com/apache/spark/commit/4e76ffde10f175fe71c7a4db544941e2bfad1132).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBench...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21625
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/92264/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21625#discussion_r197676056
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala ---
    @@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
               }
             }
     
    -        /*
    -        Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    -        Partitioned Table:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -        --------------------------------------------------------------------------------------------
    --- End diff --
    
    oh, I hit the bug in csv parsing when updating this benchmark...
    ```
    scala> val dir = "/tmp/spark-csv/csv"
    scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
    scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
    18/06/25 13:12:51 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 5)
    java.lang.NullPointerException
            at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.org$apache$spark$sql$execution$datasources$csv$UnivocityParser$$convert(UnivocityParser.scala:197)  
            at org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:190)
            at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309)
            at org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:309)
            at org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:61)
            ...
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBench...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21625
  
    **[Test build #92264 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/92264/testReport)** for PR 21625 at commit [`2352820`](https://github.com/apache/spark/commit/23528200f833f236a83d6b891388b6ec698bcac7).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBench...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21625
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/441/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBench...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21625
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBench...

Posted by HyukjinKwon <gi...@git.apache.org>.
Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21625
  
    LGTM too
    
    Merged to master.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceRe...

Posted by maropu <gi...@git.apache.org>.
Github user maropu commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21625#discussion_r197679077
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/DataSourceReadBenchmark.scala ---
    @@ -573,32 +578,6 @@ object DataSourceReadBenchmark {
               }
             }
     
    -        /*
    -        Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
    -        Partitioned Table:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
    -        --------------------------------------------------------------------------------------------
    --- End diff --
    
    yea, I though I would do so first, but I couldn't because I hit another bug when the column pruning disabled...;
    ```
    ./bin/spark-shell --conf spark.sql.csv.parser.columnPruning.enabled=false
    scala> val dir = "/tmp/spark-csv/csv"
    scala> spark.range(10).selectExpr("id % 2 AS p", "id").write.mode("overwrite").partitionBy("p").csv(dir)
    scala> spark.read.csv(dir).selectExpr("sum(p)").collect()
    18/06/25 13:48:46 ERROR Executor: Exception in task 2.0 in stage 2.0 (TID 7)
    java.lang.ClassCastException: org.apache.spark.unsafe.types.UTF8String cannot be cast to java.lang.Integer
            at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:101)
            at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getInt(rows.scala:41)
            ...
    ```


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBench...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21625
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21625: [SPARK-24206][SQL][FOLLOW-UP] Update DataSourceReadBench...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21625
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org