You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Tom Hubregtsen <th...@gmail.com> on 2015/07/15 20:14:46 UTC

Info from the event timeline appears to contradict dstat info

I am trying to analyze my program, in particular to see what the bottleneck
is (IO, CPU, network), and started using the event timeline for this.

When looking at my Job 0, Stage 0 (the sampler function taking up 5.6
minutes of my 40 minute program), I see in the even timeline that all time
is spend in "Executor Computing Time." I am not quite sure what this means.
I first thought that because of this metric, I could immediately assume that
I was CPU bound, but this does not line up with my dstat log. When looking
at dstat, I see that I spend 65% in CPU wait, 17% in CPU system and only 18%
in CPU user, together with disk IO being fully utilized for the entire
duration of the stage. From this data, I would assume I am actually disk
bound.

My question based on this is: How do I interpreted the label "Executor
Computing Time," and what conclusions can I make from it?
As I do not see read input/write output as one of the 7 labels, is IO meant
to be part of the "Executor Computing Time" (even though shuffle IO seems to
be separate)? Can I use information from event timeline as a basis for any
conclusions on my bottleneck (IO, CPU or network)? Is network included in
any of these 7 labels?

Thanks in advance,

Tom Hubregtsen

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Info-from-the-event-timeline-appears-to-contradict-dstat-info-tp23862.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Info from the event timeline appears to contradict dstat info

Posted by Tom Hubregtsen <th...@gmail.com>.

I think I found my answer at https://github.com/kayousterhout/trace-analysis:

"One thing to keep in mind is that Spark does not currently include
instrumentation to measure the time spent reading input data from disk or
writing job output to disk (the `Output write wait'' shown in the waterfall
is time to write shuffle output to disk, which Spark does have
instrumentation for); as a result, the time shown asCompute' may include
time using the disk. We have a custom Hadoop branch that measures the time
Hadoop spends transferring data to/from disk, and we are hopeful that
similar timing metrics will someday be included in the Hadoop FileStatistics
API. In the meantime, it is not currently possible to understand how much of
a Spark task's time is spent reading from disk via HDFS."

That said, this might be posted as a footnote at the event timeline to avoid
confusion :)

Best regards,

Tom Hubregtsen



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Info-from-the-event-timeline-appears-to-contradict-dstat-info-tp23862p23865.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org