You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2015/04/22 20:04:59 UTC
[jira] [Commented] (SPARK-7058) Task deserialization time metric
does not include time to deserialize broadcasted RDDs
[ https://issues.apache.org/jira/browse/SPARK-7058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14507565#comment-14507565 ]
Apache Spark commented on SPARK-7058:
-------------------------------------
User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/5635
> Task deserialization time metric does not include time to deserialize broadcasted RDDs
> --------------------------------------------------------------------------------------
>
> Key: SPARK-7058
> URL: https://issues.apache.org/jira/browse/SPARK-7058
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Reporter: Josh Rosen
> Assignee: Josh Rosen
> Priority: Critical
>
> The web UI's "task deserialization time" metric is slightly misleading because it does not capture the time taken to deserialize the broadcasted RDD. Currently, this statistic is measured in {{Executor.run}} by measuring the time to deserialize the {{Task}} instance: https://github.com/apache/spark/blob/bdc5c16e76c5d0bc147408353b2ba4faa8e914fc/core/src/main/scala/org/apache/spark/executor/Executor.scala#L193
> As of Spark 1.1.0, we transfer RDDs using broadcast variables rather than sending them directly as part of the {{Task}} object (see SPARK-2521 for more details). As a result, the deserialization of the RDD is performed outside of this block and is not accounted for in this statistic. As a result, the reported task deserialization time may be a severe underestimate of the actual time.
> To measure actual RDD deserialization time, I hacked the following change into ShuffleMapTask:
> {code}
> diff --git a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
> index 6c7d000..adab574 100644
> --- a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
> +++ b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
> @@ -57,8 +57,10 @@ private[spark] class ShuffleMapTask(
> override def runTask(context: TaskContext): MapStatus = {
> // Deserialize the RDD using the broadcast variable.
> val ser = SparkEnv.get.closureSerializer.newInstance()
> + val deserializeStartTime = System.currentTimeMillis()
> val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
> ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
> + println(s"Deserialized a shuffle map task in ${System.currentTimeMillis() - deserializeStartTime} ms")
> {code}
> For one of my benchmark jobs (a SQL aggregation query that used code generation), the actual deserialization time was ~150ms per task even though the UI only reported 1ms.
> I think that this should be pretty easy to fix by simply adding additional calls in ShuffleMapTask and ResultTask to increment the deserialization time metric.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org