You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Kay Ousterhout <ke...@eecs.berkeley.edu> on 2014/07/09 21:23:17 UTC

CPU/Disk/network performance instrumentation

Hi all,

I've been doing a bunch of performance measurement of Spark and, as part of
doing this, added metrics that record the average CPU utilization, disk
throughput and utilization for each block device, and network throughput
while each task is running. These metrics are collected by reading the
/proc filesystem so work only on Linux. I'm happy to submit a pull request
with the appropriate changes but first wanted to see if sufficiently many
people think this would be useful. I know the metrics reported by Spark
(and in the UI) are already overwhelming to some folks so don't want to add
more instrumentation if it's not widely useful.

These metrics are slightly more difficult to interpret for Spark than
similar metrics reported by Hadoop because, with Spark, multiple tasks run
in the same JVM and therefore as part of the same process. This means
that, for example, the CPU utilization metrics reflect the CPU use across
all tasks in the JVM, rather than only the CPU time used by the particular
task. This is a pro and a con -- it makes it harder to determine why
utilization is high (it may be from a different task) but it also makes the
metrics useful for diagnosing straggler problems. Just wanted to clarify
this before asking folks to weigh in on whether the added metrics would be
useful.

-Kay

(if you're curious, the instrumentation code is on a very messy branch
here:
https://github.com/kayousterhout/spark-1/tree/proc_logging_perf_minimal_temp/core/src/main/scala/org/apache/spark/performance_logging
)

Re: CPU/Disk/network performance instrumentation

Posted by Surendranauth Hiraman <su...@velos.io>.

+1 on advanced tab.



On Wed, Jul 9, 2014 at 5:20 PM, Mridul Muralidharan <mr...@gmail.com>
wrote:

> +1 on advanced mode !
>
> Regards.
> Mridul
>
> On Thu, Jul 10, 2014 at 12:55 AM, Reynold Xin <rx...@databricks.com> wrote:
> > Maybe it's time to create an advanced mode in the ui.
> >
> >
> > On Wed, Jul 9, 2014 at 12:23 PM, Kay Ousterhout <ke...@eecs.berkeley.edu>
> > wrote:
> >
> >> Hi all,
> >>
> >> I've been doing a bunch of performance measurement of Spark and, as
> part of
> >> doing this, added metrics that record the average CPU utilization, disk
> >> throughput and utilization for each block device, and network throughput
> >> while each task is running.  These metrics are collected by reading the
> >> /proc filesystem so work only on Linux.  I'm happy to submit a pull
> request
> >> with the appropriate changes but first wanted to see if sufficiently
> many
> >> people think this would be useful.  I know the metrics reported by Spark
> >> (and in the UI) are already overwhelming to some folks so don't want to
> add
> >> more instrumentation if it's not widely useful.
> >>
> >> These metrics are slightly more difficult to interpret for Spark than
> >> similar metrics reported by Hadoop because, with Spark, multiple tasks
> run
> >> in the same JVM and therefore as part of the same process.  This means
> >> that, for example, the CPU utilization metrics reflect the CPU use
> across
> >> all tasks in the JVM, rather than only the CPU time used by the
> particular
> >> task.  This is a pro and a con -- it makes it harder to determine why
> >> utilization is high (it may be from a different task) but it also makes
> the
> >> metrics useful for diagnosing straggler problems.  Just wanted to
> clarify
> >> this before asking folks to weigh in on whether the added metrics would
> be
> >> useful.
> >>
> >> -Kay
> >>
> >> (if you're curious, the instrumentation code is on a very messy branch
> >> here:
> >>
> >>
> https://github.com/kayousterhout/spark-1/tree/proc_logging_perf_minimal_temp/core/src/main/scala/org/apache/spark/performance_logging
> >> )
> >>
>



-- 

SUREN HIRAMAN, VP TECHNOLOGY
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR
NEW YORK, NY 10001
O: (917) 525-2466 ext. 105
F: 646.349.4063
E: suren.hiraman@v <su...@sociocast.com>elos.io
W: www.velos.io

Re: CPU/Disk/network performance instrumentation

Posted by Mridul Muralidharan <mr...@gmail.com>.

+1 on advanced mode !

Regards.
Mridul

On Thu, Jul 10, 2014 at 12:55 AM, Reynold Xin <rx...@databricks.com> wrote:
> Maybe it's time to create an advanced mode in the ui.
>
>
> On Wed, Jul 9, 2014 at 12:23 PM, Kay Ousterhout <ke...@eecs.berkeley.edu>
> wrote:
>
>> Hi all,
>>
>> I've been doing a bunch of performance measurement of Spark and, as part of
>> doing this, added metrics that record the average CPU utilization, disk
>> throughput and utilization for each block device, and network throughput
>> while each task is running.  These metrics are collected by reading the
>> /proc filesystem so work only on Linux.  I'm happy to submit a pull request
>> with the appropriate changes but first wanted to see if sufficiently many
>> people think this would be useful.  I know the metrics reported by Spark
>> (and in the UI) are already overwhelming to some folks so don't want to add
>> more instrumentation if it's not widely useful.
>>
>> These metrics are slightly more difficult to interpret for Spark than
>> similar metrics reported by Hadoop because, with Spark, multiple tasks run
>> in the same JVM and therefore as part of the same process.  This means
>> that, for example, the CPU utilization metrics reflect the CPU use across
>> all tasks in the JVM, rather than only the CPU time used by the particular
>> task.  This is a pro and a con -- it makes it harder to determine why
>> utilization is high (it may be from a different task) but it also makes the
>> metrics useful for diagnosing straggler problems.  Just wanted to clarify
>> this before asking folks to weigh in on whether the added metrics would be
>> useful.
>>
>> -Kay
>>
>> (if you're curious, the instrumentation code is on a very messy branch
>> here:
>>
>> https://github.com/kayousterhout/spark-1/tree/proc_logging_perf_minimal_temp/core/src/main/scala/org/apache/spark/performance_logging
>> )
>>

Re: CPU/Disk/network performance instrumentation

Posted by Shivaram Venkataraman <sh...@eecs.berkeley.edu>.

I think it would be very useful to have this. We could put the ui display
either behind a flag or a url parameter

Shivaram


On Wed, Jul 9, 2014 at 12:25 PM, Reynold Xin <rx...@databricks.com> wrote:

> Maybe it's time to create an advanced mode in the ui.
>
>
> On Wed, Jul 9, 2014 at 12:23 PM, Kay Ousterhout <ke...@eecs.berkeley.edu>
> wrote:
>
> > Hi all,
> >
> > I've been doing a bunch of performance measurement of Spark and, as part
> of
> > doing this, added metrics that record the average CPU utilization, disk
> > throughput and utilization for each block device, and network throughput
> > while each task is running.  These metrics are collected by reading the
> > /proc filesystem so work only on Linux.  I'm happy to submit a pull
> request
> > with the appropriate changes but first wanted to see if sufficiently many
> > people think this would be useful.  I know the metrics reported by Spark
> > (and in the UI) are already overwhelming to some folks so don't want to
> add
> > more instrumentation if it's not widely useful.
> >
> > These metrics are slightly more difficult to interpret for Spark than
> > similar metrics reported by Hadoop because, with Spark, multiple tasks
> run
> > in the same JVM and therefore as part of the same process.  This means
> > that, for example, the CPU utilization metrics reflect the CPU use across
> > all tasks in the JVM, rather than only the CPU time used by the
> particular
> > task.  This is a pro and a con -- it makes it harder to determine why
> > utilization is high (it may be from a different task) but it also makes
> the
> > metrics useful for diagnosing straggler problems.  Just wanted to clarify
> > this before asking folks to weigh in on whether the added metrics would
> be
> > useful.
> >
> > -Kay
> >
> > (if you're curious, the instrumentation code is on a very messy branch
> > here:
> >
> >
> https://github.com/kayousterhout/spark-1/tree/proc_logging_perf_minimal_temp/core/src/main/scala/org/apache/spark/performance_logging
> > )
> >
>

Re: CPU/Disk/network performance instrumentation

Posted by Reynold Xin <rx...@databricks.com>.

Maybe it's time to create an advanced mode in the ui.


On Wed, Jul 9, 2014 at 12:23 PM, Kay Ousterhout <ke...@eecs.berkeley.edu>
wrote:

> Hi all,
>
> I've been doing a bunch of performance measurement of Spark and, as part of
> doing this, added metrics that record the average CPU utilization, disk
> throughput and utilization for each block device, and network throughput
> while each task is running.  These metrics are collected by reading the
> /proc filesystem so work only on Linux.  I'm happy to submit a pull request
> with the appropriate changes but first wanted to see if sufficiently many
> people think this would be useful.  I know the metrics reported by Spark
> (and in the UI) are already overwhelming to some folks so don't want to add
> more instrumentation if it's not widely useful.
>
> These metrics are slightly more difficult to interpret for Spark than
> similar metrics reported by Hadoop because, with Spark, multiple tasks run
> in the same JVM and therefore as part of the same process.  This means
> that, for example, the CPU utilization metrics reflect the CPU use across
> all tasks in the JVM, rather than only the CPU time used by the particular
> task.  This is a pro and a con -- it makes it harder to determine why
> utilization is high (it may be from a different task) but it also makes the
> metrics useful for diagnosing straggler problems.  Just wanted to clarify
> this before asking folks to weigh in on whether the added metrics would be
> useful.
>
> -Kay
>
> (if you're curious, the instrumentation code is on a very messy branch
> here:
>
> https://github.com/kayousterhout/spark-1/tree/proc_logging_perf_minimal_temp/core/src/main/scala/org/apache/spark/performance_logging
> )
>