You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Ryan Williams <ry...@gmail.com> on 2015/01/09 21:16:46 UTC
Present/Future of monitoring spark jobs, "MetricsSystem" vs. Web UI, etc.

I've long wished the web UI gave me a better sense of how the metrics it
reports are changing over time, so I was intrigued to stumble across the
MetricsSystem
<https://github.com/apache/spark/blob/b6aa557300275b835cce7baa7bc8a80eb5425cbb/core/src/main/scala/org/apache/spark/metrics/MetricsSystem.scala>
infrastructure the other day.

I've set up a very basic Graphite instance and had dummy Spark jobs report
to it, but that process was a little bumpy (and the docs sparse
<https://spark.apache.org/docs/latest/monitoring.html#metrics>) so I wanted
to come up for air and ask a few questions about the present/future plans
for monitoring Spark jobs.

In rough order of increasing scope:

   - Do most people monitor their Spark jobs in realtime by repeatedly
   refreshing the web UI (cf. SPARK-5106
   <https://issues.apache.org/jira/browse/SPARK-5106>), or is there a
   better way?
   - Does anyone use or rely on the GraphiteSink? Quick googling turned up
   no evidence of anyone using it.
      - Likewise the other Sinks? GangliaSink?
   - Do people have custom Sink subclasses and dashboards that they've
   built to monitor Spark jobs, as was suggested by the appearance of a
   mysterious Ooyala "DatadogSink" gist
   <https://gist.github.com/ibuenros/9b94736c2bad2f4b8e23#file-sparkutils-scala-L336>
   in the recent thread on this list about custom metrics
   <http://apache-spark-developers-list.1001551.n3.nabble.com/Registering-custom-metrics-tp9030p10041.html>
   ?
   - What is the longer-term plan for how people should monitor / diagnose
   problems at runtime?
      - Will the official Spark web UI remain the main way that the average
      user will monitor their jobs?
      - Or, will SPARK-3644
      <https://issues.apache.org/jira/browse/SPARK-3644> usher in an era of
      many external implementations of Spark web UIs, so that the average user
      will take one of those "off the shelf" that they like best (because its
      graphs are prettier or it emphasizes / pivots around certain metrics that
      others do not)?
      - Is the MetricsSystem infrastructure redundant with the REST API
      discussed in SPARK-3644
      <https://issues.apache.org/jira/browse/SPARK-3644>?
         - Would more robust versions of each start to be redundant in the
         future?
         - I feel like the answers are "somewhat yes" and "yes", and would
         like to hear other perspectives.

Basically, I want to live in a world where:

   - I can see all of the stats currently exposed on the Web UI,
   - as well as others that aren't there yet,
      - number of records assigned to each task,
      - number of records completed by each task in realtime,
      - gc stats in realtime,
      - # of spill events,
      - size of spill events,
   - and all kinds of derivates of the above,
      - latencies/histograms for everything
         - records per second per task,
         - records per second per executor,
         - top N slowest/worst of any metric,
         - avg spill size,
         - etc.
      - over time,
   - at scale <https://issues.apache.org/jira/browse/SPARK-2017>


Are we going to get to this world by improving the web UI that ships with
Spark? I am pessimistic of that approach:

   - It may be impossible to do in a way that satisfies all stakeholders'
   aesthetic sensibilities and preferences for what stats/views are important.
   - It would be a monumental undertaking relative to the amount of
   attention that seems to have been directed at improving the web UI in the
   last few quarters.

OTOH, if the space of derivative stats and slices thereof that we want to
support is as complex as the outline I gave above suggests it might be,
then Graphite (or some equivalent) could be well suited to the task.
However, this is at odds with the relative obscurity that the MetricsSystem
seems to reside in and my impression that it is not something that core
developers think about or are focused on.

Finally, while the existence of SPARK-3644 (and Josh et al's great work on
it thus far) implies that the REST API / "let 1000 [web UIs] bloom" vision
is at least nominally being pursued, it seems like it's still a long way
from fostering a world where my dream use-cases above are realized, and
it's not clear from the outside whether fulfilling that vision is a
priority.

So I'm interested to hear peoples' thoughts on the above questions and what
the plan is / should be going forward. Having learned a lot about how Spark
works, the process of figuring out "Why My Spark Jobs Are Failing" still
feels daunting (at best) using the tools I've come across; we need to do a
better job of empowering people to figure these things out.