You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by dongjoon-hyun <gi...@git.apache.org> on 2016/07/04 07:36:52 UTC

[GitHub] spark pull request #14044: [SPARK-16360][SQL] Speed up SQL query performance...

GitHub user dongjoon-hyun opened a pull request:

    https://github.com/apache/spark/pull/14044

    [SPARK-16360][SQL] Speed up SQL query performance by removing redundant analysis in `Dataset`

    ## What changes were proposed in this pull request?
    
    Currently, there are a few reports about Spark 2.0 query performance regression for large queries.
    
    This PR speeds up SQL query processing performance by removing redundant consecutive analysis in `Dataset.ofRows` function and `Dataset` instantiation. Specifically, this PR aims to reduce the overhead of SQL query analysis, not query execution. So, we can not see he result in the Spark Web UI. Please use the following query script.
    
    **Before**
    ```scala
    scala> :pa
    // Entering paste mode (ctrl-D to finish)
    
    val n = 4000
    val values = (1 to n).map(_.toString).mkString(", ")
    val columns = (1 to n).map("column" + _).mkString(", ")
    val query =
      s"""
         |SELECT $columns
         |FROM VALUES ($values) T($columns)
         |WHERE 1=2 AND 1 IN ($columns)
         |GROUP BY $columns
         |ORDER BY $columns
         |""".stripMargin
    
    def time[R](block: => R): R = {
      val t0 = System.nanoTime()
      val result = block
      println("Elapsed time: " + ((System.nanoTime - t0) / 1e9) + "s")
      result
    }
    
    time(sql(query))
    time(sql(query))
    
    // Exiting paste mode, now interpreting.
    
    Elapsed time: 30.138142577s
    Elapsed time: 25.787751452s
    ```
    
    **After**
    ```scala
    Elapsed time: 17.500279659s
    Elapsed time: 12.364812255s
    ```
    
    ## How was this patch tested?
    
    Manual by the above script.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dongjoon-hyun/spark SPARK-16360

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14044.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14044
    
----
commit 1402a9d21cdce66158858560902571a9d91ac2fa
Author: Dongjoon Hyun <do...@apache.org>
Date:   2016-07-04T07:32:22Z

    [SPARK-16360][SQL] Speed up SQL query performance by removing redundant analysis in `Dataset`

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by naliazheli <gi...@git.apache.org>.
Github user naliazheli commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    **[Test build #61714 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61714/consoleFull)** for PR 14044 at commit [`1402a9d`](https://github.com/apache/spark/commit/1402a9d21cdce66158858560902571a9d91ac2fa).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    `LogicalPlan.resolve(...)` uses a linear search to resolve a column. This is pretty bad if you are trying to lookup 4000 columns 4 times (filter, project, aggregate, sort): 4000 * (4000 / 2) * 4 = 32.000.000 lookups. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Yep. I agree.
    Could you make a PR for that? I think we also have some optimization points about that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    **[Test build #61744 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61744/consoleFull)** for PR 14044 at commit [`45eb28a`](https://github.com/apache/spark/commit/45eb28af51203a97c22c8b9022cb38ac0451d401).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14044: [SPARK-16360][SQL] Speed up SQL query performance...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14044#discussion_r69498744
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -62,7 +62,7 @@ private[sql] object Dataset {
       def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
         val qe = sparkSession.sessionState.executePlan(logicalPlan)
         qe.assertAnalyzed()
    -    new Dataset[Row](sparkSession, logicalPlan, RowEncoder(qe.analyzed.schema))
    +    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema), skipAnalysis = true)
    --- End diff --
    
    can we test how much we can speed up by avoiding the duplicated check analysis? I think it's necessary to avoid duplicated analysis, but seems check analysis is not a big deal?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14044: [SPARK-16360][SQL] Speed up SQL query performance...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14044#discussion_r69503621
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -62,7 +62,7 @@ private[sql] object Dataset {
       def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
         val qe = sparkSession.sessionState.executePlan(logicalPlan)
         qe.assertAnalyzed()
    -    new Dataset[Row](sparkSession, logicalPlan, RowEncoder(qe.analyzed.schema))
    +    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema), skipAnalysis = true)
    --- End diff --
    
    Oh, I misunderstand your point.
    You mean 1) changing `logicalPlan` , but 2) `skipAnalysis = false`.
    Okay. I'll report soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14044: [SPARK-16360][SQL] Speed up SQL query performance...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14044#discussion_r69502213
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -62,7 +62,7 @@ private[sql] object Dataset {
       def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
         val qe = sparkSession.sessionState.executePlan(logicalPlan)
         qe.assertAnalyzed()
    -    new Dataset[Row](sparkSession, logicalPlan, RowEncoder(qe.analyzed.schema))
    +    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema), skipAnalysis = true)
    --- End diff --
    
    I think I wrote the result in the PR description. Is it not what you mean?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    LGTM pending Jenkins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Merged to master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14044: [SPARK-16360][SQL] Speed up SQL query performance...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14044#discussion_r69481251
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -62,7 +62,7 @@ private[sql] object Dataset {
       def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
         val qe = sparkSession.sessionState.executePlan(logicalPlan)
         qe.assertAnalyzed()
    -    new Dataset[Row](sparkSession, logicalPlan, RowEncoder(qe.analyzed.schema))
    +    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema), skipAnalysis = true)
    --- End diff --
    
    It is used due to `RowEncoder(qe.analyzed.schema)`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by liancheng <gi...@git.apache.org>.
Github user liancheng commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Agree with @hvanhovell. Analysis should never take so long a time for such a simple query. We should avoid duplicated analysis work, but fixing performance issue(s) within the analyzer seems to be more resultful.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Hi, @cloud-fan , @hvanhovell , @liancheng .
    
    According to @cloud-fan 's advice, after changing the following, it turns out that the difference is not noticeable.
    ```
    -    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema), skipAnalysis = true)
    +    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
    ```
    
    Exactly as you guys told, the second call of `qe.assertAnalyzed()` is not the root cause. The only difference resides on `sparkSession.sessionState.executePlan(logicalPlan)`.
    
    I'll update the PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    @dongjoon-hyun my point is that analysis should not be taking 12 seconds at all. You can see how much time is spent in a rule, if you add the following lines of code to your example:
    ```scala
    import org.apache.spark.sql.catalyst.rules.RuleExecutor
    println(RuleExecutor.dumpTimeSpent)
    ```
    This yields the following result (timing in ns):
    ```
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences                 18784486408
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions         505619796
    org.apache.spark.sql.catalyst.analysis.TypeCoercion$PropagateTypes                195027905
    org.apache.spark.sql.catalyst.analysis.Analyzer$FixNullability                    118882430
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveMissingReferences          74401505
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics          40068476
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer               32929965
    org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator                  30524660
    org.apache.spark.sql.catalyst.analysis.TypeCoercion$ImplicitTypeCasts             30453770
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions                  28383135
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowFrame                26168955
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveWindowOrder                25736499
    org.apache.spark.sql.catalyst.analysis.TimeWindowing                              24807670
    org.apache.spark.sql.catalyst.analysis.DecimalPrecision                           24000260
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubquery                   21653219
    org.apache.spark.sql.catalyst.analysis.TypeCoercion$InConversion                  20830229
    org.apache.spark.sql.catalyst.analysis.TypeCoercion$PromoteStrings                19183636
    org.apache.spark.sql.catalyst.analysis.TypeCoercion$FunctionArgumentConversion    17849664
    org.apache.spark.sql.catalyst.analysis.TypeCoercion$BooleanEquality               15186886
    org.apache.spark.sql.catalyst.analysis.TypeCoercion$IfCoercion                    13994296
    org.apache.spark.sql.catalyst.analysis.TypeCoercion$Division                      13929023
    org.apache.spark.sql.catalyst.analysis.TypeCoercion$DateTimeOperations            13468710
    org.apache.spark.sql.catalyst.analysis.CleanupAliases                             13210810
    org.apache.spark.sql.catalyst.analysis.TypeCoercion$StringToIntegralCasts         13191046
    org.apache.spark.sql.catalyst.analysis.Analyzer$PullOutNondeterministic           11310837
    org.apache.spark.sql.catalyst.analysis.Analyzer$HandleNullInputsForUDF            10712897
    org.apache.spark.sql.catalyst.analysis.TypeCoercion$CaseWhenCoercion              10589030
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases                    7172334
    org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions          5994564
    org.apache.spark.sql.catalyst.analysis.Analyzer$CTESubstitution                   5914136
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy 5303578
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNaturalAndUsingJoin        4060244
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot                      3174805
    org.apache.spark.sql.catalyst.analysis.EliminateUnions                            2787433
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate                   2731683
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations                  2624228
    org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates                  2417768
    org.apache.spark.sql.catalyst.analysis.Analyzer$WindowsSubstitution               2368503
    org.apache.spark.sql.execution.datasources.PreprocessTableInsertion               2126155
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNewInstance                2059795
    org.apache.spark.sql.execution.datasources.DataSourceAnalysis                     1944978
    org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast                     1912039
    org.apache.spark.sql.execution.datasources.ResolveDataSource                      1896232
    org.apache.spark.sql.catalyst.analysis.TypeCoercion$WidenSetOperationTypes        1623414
    org.apache.spark.sql.execution.datasources.FindDataSourceTable                    1623004
    ```
    I think we should take a look at `ResolveReferences`. I do think your PR has merit; we really shouldn't be analyzing the same plan twice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14044: [SPARK-16360][SQL] Speed up SQL query performance...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/14044


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61744/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Interesting result. We definitely need to take a look at `ResolveReferences`-related stuff.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    **[Test build #61714 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61714/consoleFull)** for PR 14044 at commit [`1402a9d`](https://github.com/apache/spark/commit/1402a9d21cdce66158858560902571a9d91ac2fa).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14044: [SPARK-16360][SQL] Speed up SQL query performance...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14044#discussion_r69498260
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -62,7 +62,7 @@ private[sql] object Dataset {
       def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
         val qe = sparkSession.sessionState.executePlan(logicalPlan)
         qe.assertAnalyzed()
    -    new Dataset[Row](sparkSession, logicalPlan, RowEncoder(qe.analyzed.schema))
    +    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema), skipAnalysis = true)
    --- End diff --
    
    we can make the `encoder` a by-name parameter in `Dataset`, then the `qe.assertAnalyzed()` will be called first.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Oh, thank you for advice of `dumpTimeSpent`. I didn't look at in that way.
    In these days, I'm trying to investigate large queries situation.
    This analysis is very helpful for me. Thank you so much, @hvanhovell .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    cc @cloud-fan , too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Thank you for review and merging, @liancheng , @cloud-fan , @hvanhovell , and @naliazheli !


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Now, I update the title and description of PR/JIRA.
    The only patch in this PR is the following one word change.
    ```
    -    new Dataset[Row](sparkSession, logicalPlan, RowEncoder(qe.analyzed.schema))
    +    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema))
    ```
    Thank you all for fast review & advice. At first commit, I thought it is important to remove all repeating logics. But, now only the minimum meaningful code change remains.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Thank you for review, @liancheng .
    I'm sure that the performance of Analyzer need to be improved. But, in any cases, the cost of analyzer cannot be zero.
    We should skip the redundant analysis. IMO, that idea sounds orthogonal to this PR. So, I asked @hvanhovell to make a PR for that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Thank you for review, @naliazheli .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by hvanhovell <gi...@git.apache.org>.
Github user hvanhovell commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Any idea what causes the regression? 5 seconds seems way too long for analysis...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14044: [SPARK-16360][SQL] Speed up SQL query performance...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14044#discussion_r69418538
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -62,7 +62,7 @@ private[sql] object Dataset {
       def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
         val qe = sparkSession.sessionState.executePlan(logicalPlan)
         qe.assertAnalyzed()
    -    new Dataset[Row](sparkSession, logicalPlan, RowEncoder(qe.analyzed.schema))
    +    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema), skipAnalysis = true)
    --- End diff --
    
    Here is two optimization.
    - By using `qe`, `sparkSession.sessionState.executePlan(logicalPlan)` is not called again.
    - By using `skipAnalysis = true`, `qe.assertAnalyzed()` is not called again.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #14044: [SPARK-16360][SQL] Speed up SQL query performance...

Posted by cloud-fan <gi...@git.apache.org>.
Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/14044#discussion_r69467305
  
    --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
    @@ -62,7 +62,7 @@ private[sql] object Dataset {
       def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame = {
         val qe = sparkSession.sessionState.executePlan(logicalPlan)
         qe.assertAnalyzed()
    -    new Dataset[Row](sparkSession, logicalPlan, RowEncoder(qe.analyzed.schema))
    +    new Dataset[Row](sparkSession, qe, RowEncoder(qe.analyzed.schema), skipAnalysis = true)
    --- End diff --
    
    how about we remove the `qe.assertAnalyzed()` in `ofRows`? Then we don't need the `skipAnalysis` flag.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Thank you for review, @hvanhovell . BTW, it's over 12 seconds for one single analysis.
    
    Elapsed time: 25.787751452s  --> Elapsed time: 12.364812255s.
    
    The reason I executed `time(sql(query))` two times is that SQL parser and other overhead.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by dongjoon-hyun <gi...@git.apache.org>.
Github user dongjoon-hyun commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Hi, @liancheng and @rxin .
    Could you review this PR?
    This code path occurs during Dataset/Dataframe merging.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/61714/
    Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #14044: [SPARK-16360][SQL] Speed up SQL query performance by rem...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/14044
  
    **[Test build #61744 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/61744/consoleFull)** for PR 14044 at commit [`45eb28a`](https://github.com/apache/spark/commit/45eb28af51203a97c22c8b9022cb38ac0451d401).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org