You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2015/04/22 20:04:59 UTC
[jira] [Commented] (SPARK-7058) Task deserialization time metric does not include time to deserialize broadcasted RDDs

    [ https://issues.apache.org/jira/browse/SPARK-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507565#comment-14507565 ] 

Apache Spark commented on SPARK-7058:
-------------------------------------

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5635

> Task deserialization time metric does not include time to deserialize broadcasted RDDs
> --------------------------------------------------------------------------------------
>
>                 Key: SPARK-7058
>                 URL: https://issues.apache.org/jira/browse/SPARK-7058
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>            Priority: Critical
>
> The web UI's "task deserialization time" metric is slightly misleading because it does not capture the time taken to deserialize the broadcasted RDD.  Currently, this statistic is measured in {{Executor.run}} by measuring the time to deserialize the {{Task}} instance: https://github.com/apache/spark/blob/bdc5c16e76c5d0bc147408353b2ba4faa8e914fc/core/src/main/scala/org/apache/spark/executor/Executor.scala#L193
> As of Spark 1.1.0, we transfer RDDs using broadcast variables rather than sending them directly as part of the {{Task}} object (see SPARK-2521 for more details).  As a result, the deserialization of the RDD is performed outside of this block and is not accounted for in this statistic.  As a result, the reported task deserialization time may be a severe underestimate of the actual time.
> To measure actual RDD deserialization time, I hacked the following change into ShuffleMapTask:
> {code}
> diff --git a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
> index 6c7d000..adab574 100644
> --- a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
> +++ b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
> @@ -57,8 +57,10 @@ private[spark] class ShuffleMapTask(
>    override def runTask(context: TaskContext): MapStatus = {
>      // Deserialize the RDD using the broadcast variable.
>      val ser = SparkEnv.get.closureSerializer.newInstance()
> +    val deserializeStartTime = System.currentTimeMillis()
>      val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
>        ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
> +    println(s"Deserialized a shuffle map task in ${System.currentTimeMillis() - deserializeStartTime} ms")
> {code}
> For one of my benchmark jobs (a SQL aggregation query that used code generation), the actual deserialization time was ~150ms per task even though the UI only reported 1ms.
> I think that this should be pretty easy to fix by simply adding additional calls in ShuffleMapTask and ResultTask to increment the deserialization time metric.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org