You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Herman van Hovell (JIRA)" <ji...@apache.org> on 2018/08/24 02:14:00 UTC

[jira] [Resolved] (SPARK-25209) Optimization in Dataset.apply for DataFrames

     [ https://issues.apache.org/jira/browse/SPARK-25209?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Herman van Hovell resolved SPARK-25209.
---------------------------------------
       Resolution: Fixed
         Assignee: Bogdan Raducanu
    Fix Version/s: 2.4.0

> Optimization in Dataset.apply for DataFrames
> --------------------------------------------
>
>                 Key: SPARK-25209
>                 URL: https://issues.apache.org/jira/browse/SPARK-25209
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: Bogdan Raducanu
>            Assignee: Bogdan Raducanu
>            Priority: Major
>             Fix For: 2.4.0
>
>
> {{Dataset.apply}} calls {{dataset.deserializer}} (to provide an early error) which ends up calling the full {{Analyzer}} on the deserializer. This can take tens of milliseconds, depending on how big the plan is.
>  Since {{Dataset.apply}} is called for many {{Dataset}} operations such as {{Dataset.where}} it can be a significant overhead for short queries.
> In the following code: {{duration}} is *17 ms* in current spark *vs 1 ms* 
>  if I remove the line {{dataset.deserializer}}.
> It seems the resulting {{deserializer}} is particularly big in the case of nested schema, but the same overhead can be observed if we have a very wide flat schema.
>  According to a comment in the PR that introduced this check, we can at least remove this check for {{DataFrames}}: [https://github.com/apache/spark/pull/20402#discussion_r164338267]
> {code}
>     val col = "named_struct(" +
>       (0 until 100).map { i => s"'col$i', id"}.mkString(",") + ")"
>     val df = spark.range(10).selectExpr(col)
>     val TRUE = lit(true)
>     val numIter = 1000
>     var startTime = System.nanoTime()
>     for(i <- 0 until numIter) {
>       df.where(TRUE)
>     }
>     val durationMs = (System.nanoTime() - startTime) / numIter / 1000000
>     println(s"duration $durationMs")
>  {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org