You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Leigh Stewart <ag...@gmail.com> on 2018/10/21 20:12:58 UTC
[Spark streaming]

I have a question about use cases.

We have a workload which is iterative. Data arrives which is
partitioned by 'session'. It arrives on a kinesis stream, and consists
of records which represent changes to session state. In this case the
session is a sensor data ingestion session - sensor data is arriving
from customers and it needs to be accumulated and transformed.

Broadly speaking we need to perform a few operations on this data:
1. Accumulate - just aggregate data which has arrived, per session,
into a state variable. This variable can be quite large, say 100MB.
2. Streaming processing - examine the state as records arrive and
compute some intermediate results. Most of these results are compute
intensive, and can easily benefit from parallelization.
3. Optimization - optimization operates on the state variable
mentioned earlier. It is not parallelizable (except at the level of
session id partitioning as described earlier). The optimization that
is occurring is non linear least squares optimization. It basically
must run sequentially and must run on the entirety of the data
accumulated so far.

And then the cycle repeats, as the output of optimization is an input
to the next optimization cycle.

Currently we have a pretty bespoke solution for this, and I'm
beginning to explore whether we could express it in Spark.

This problem isn't exactly embarrassingly parallel. Pieces of it are,
but the sequential optimization step seems like it might be a pain to
work with. So does the need to accumulate a very large state variable
(which is ultimately related to the optimization step). However it
seems like it would be great to be able to express the very parallel
parts plus the overall workflow part using a concise functional Scala
syntax, not have to worry about fault tolerance, communication,
scheduling, etc., and support ad-hoc query and manipulation of
datasets to boot.

Couple questions (apologies if these are too open ended):
1. Does this seems like a use case that could fit well with spark? Are
the large state variables and non-parallelizable/partitions=1
processing steps red flags? Is there benefit in using spark as a sort
of data flow manager even when some of the most important steps are
compute intensive and not very parallelizable?
2. I could imagine accumulating the state in the driver, and then when
its time to run optimization, 'shipping' it to a worker by turning it
into an RDD and reducing it to the optimization of output. Would
something like this work? Does this pattern ever show up?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org