You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Bryan Jeffrey <br...@gmail.com> on 2020/01/21 20:29:28 UTC

Accumulator v2

Hello.

We're currently using Spark streaming (Spark 2.3) for a number of
applications. One pattern we've used successfully is to generate an
accumulator inside a DStream transform statement.  We then accumulate
values associated with the RDD as we process the data.  A stage completion
listener that listens for stage complete events, retrieves the
AccumulableInfo for our custom classes and exhausts the statistics to our
back-end.

We're trying to move more of our applications to using Structured
Streaming.  However, the accumulator pattern does not seem to obviously fit
Structured Streaming.  In many cases we're able to see basic statistics
(e.g. # input and # output events) from the built-in statistics.  We need
to determine a pattern for more complex statistics (# errors, # of internal
records, etc).  Defining an accumulator on startup and adding statistics,
we're able to see the statistics - but only updates - so if we read 10
records in the first trigger, and 15 in the second trigger we see
accumulated values of 10, 25.

There are several options that might allow us to move ahead:
1. We could have the AccumulableInfo contain previous counts and current
counts
2. We could maintain current and previous counts separately
3. We could maintain a list of ID to AccumulatorV2 and then call
accumulator.reset() once we've read data

All of these options seem a little bit like a hacky workaround.  Has anyone
encountered this use-case?  Is there a good pattern to follow?

Regards,

Bryan Jeffrey