You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (JIRA)" <ji...@apache.org> on 2015/04/22 19:41:01 UTC

[jira] [Created] (SPARK-7058) Task deserialization time metric does not include time to deserialize broadcasted RDDs

Josh Rosen created SPARK-7058:
---------------------------------

             Summary: Task deserialization time metric does not include time to deserialize broadcasted RDDs
                 Key: SPARK-7058
                 URL: https://issues.apache.org/jira/browse/SPARK-7058
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
            Reporter: Josh Rosen
            Priority: Critical


The web UI's "task deserialization time" metric is slightly misleading because it does not capture the time taken to deserialize the broadcasted RDD.  Currently, this statistic is measured in {{Executor.run}} by measuring the time to deserialize the {{Task}} instance: https://github.com/apache/spark/blob/bdc5c16e76c5d0bc147408353b2ba4faa8e914fc/core/src/main/scala/org/apache/spark/executor/Executor.scala#L193

As of Spark 1.1.0, we transfer RDDs using broadcast variables rather than sending them directly as part of the {{Task}} object (see SPARK-2521 for more details).  As a result, the deserialization of the RDD is performed outside of this block and is not accounted for in this statistic.  As a result, the reported task deserialization time may be a severe underestimate of the actual time.

To measure actual RDD deserialization time, I hacked the following change into ShuffleMapTask:

{code}
diff --git a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
index 6c7d000..adab574 100644
--- a/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
+++ b/core/src/main/scala/org/apache/spark/scheduler/ShuffleMapTask.scala
@@ -57,8 +57,10 @@ private[spark] class ShuffleMapTask(
   override def runTask(context: TaskContext): MapStatus = {
     // Deserialize the RDD using the broadcast variable.
     val ser = SparkEnv.get.closureSerializer.newInstance()
+    val deserializeStartTime = System.currentTimeMillis()
     val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](
       ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)
+    println(s"Deserialized a shuffle map task in ${System.currentTimeMillis() - deserializeStartTime} ms")
{code}

For one of my benchmark jobs (a SQL aggregation query that used code generation), the actual deserialization time was ~150ms per task even though the UI only reported 1ms.

I think that this should be pretty easy to fix by simply adding additional calls in ShuffleMapTask and ResultTask to increment the deserialization time metric.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org