You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Cheng Lian (JIRA)" <ji...@apache.org> on 2016/07/05 08:22:11 UTC

[jira] [Resolved] (SPARK-16360) Speed up SQL query performance by removing redundant `executePlan` call in `Dataset`

     [ https://issues.apache.org/jira/browse/SPARK-16360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Cheng Lian resolved SPARK-16360.
--------------------------------
       Resolution: Fixed
    Fix Version/s: 2.1.0

Issue resolved by pull request 14044
[https://github.com/apache/spark/pull/14044]

> Speed up SQL query performance by removing redundant `executePlan` call in `Dataset`
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-16360
>                 URL: https://issues.apache.org/jira/browse/SPARK-16360
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Dongjoon Hyun
>             Fix For: 2.1.0
>
>
> Currently, there are a few reports about Spark 2.0 query performance regression for large queries.
> This issue speeds up SQL query processing performance by removing redundant consecutive `executePlan` call in `Dataset.ofRows` function and `Dataset` instantiation. Specifically, this issue aims to reduce the overhead of SQL query execution plan generation, not real query execution. So, we can not see the result in the Spark Web UI. Please use the following query script.
> **Before**
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> val n = 4000
> val values = (1 to n).map(_.toString).mkString(", ")
> val columns = (1 to n).map("column" + _).mkString(", ")
> val query =
>   s"""
>      |SELECT $columns
>      |FROM VALUES ($values) T($columns)
>      |WHERE 1=2 AND 1 IN ($columns)
>      |GROUP BY $columns
>      |ORDER BY $columns
>      |""".stripMargin
> def time[R](block: => R): R = {
>   val t0 = System.nanoTime()
>   val result = block
>   println("Elapsed time: " + ((System.nanoTime - t0) / 1e9) + "s")
>   result
> }
> time(sql(query))
> time(sql(query))
> // Exiting paste mode, now interpreting.
> Elapsed time: 30.138142577s
> Elapsed time: 25.787751452s
> {code}
> **After**
> {code}
> Elapsed time: 17.500279659s  // First query has a little overhead of initialization.
> Elapsed time: 12.364812255s  // This shows the real difference. The speed up is about 2 times.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org