You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Juan Rodríguez Hortalá <ju...@gmail.com> on 2016/02/24 07:03:16 UTC

DataStream transformation isolation in Flink Streaming

Hi,

I was thinking on a problem and how to solve it with Flink Streaming.
Imagine you have a stream of data where you want to apply several
transformations, where some transformations depend on previous
transformations and there is a final set of actions. This is modeled in a
natural way as a DAG which could be implemented in Flink Streaming or Storm
or Spark Streaming. So far this is a typical problem, but imagine that now
I have the requirement that each of the paths in the graph must be able to
fail without affecting the other paths. For example given the following DAG

I -f1 -> A -f2-> B ->a1
  \g1-> C -g2-> D ->a2
  \h1-> E /
         \-h2-> F ->a3

here I have
- a single input DataStream I
- several derived DataStream  A, B, C, D, E
- several DataStream transformations f1, f2, g1, g2, h1, h2, each of them
of arity 1 except for g2 that defined D = g2(C, E)
- final data sinks a1, a2, a3

I don't know much about Flink, but I assume that if some of the
transformations starts failing temporarily, then the whole program will
temporarily fail until that transformation goes back to normal. For example
if h2 starts failing to compute F = h2(E), because h2 uses has some
dependency that is temporarily unavailable (a database, a service, ...),
then there is no warranty that B will keep being computed correctly and
sending records to the sink a1, event though there is no path from I to B
that prevents those records from being computed. At least that is what I
would expect to happen with Spark Streaming. My first question is that this
would in fact happen with Flink Streaming too.

Also it would be nice to be able to update the code of each DataStream
transformation independently, I guess you can't because you have a single
Flink program. Hence if you want to modify the implementation of a single
transformation, even if you are still respecting the transformation
input-output interface, you have to stop and restart the whole topology.
You could have an approximation to that topology by defining a micro
service per each DataStream transformation, and connecting the services
with queues. You could also make this more scalable by using several server
instances for each service behind a load balancer per service. Or you could
use Akka actors or something like that, which is basically equivalent to
these groups of processes communicating through queues. But then you lose
the high level programming interface and all the other benefits of Flink,
and also the system infraestructure gets way more complicated that a single
YARN cluster. I was wondering if it could be possible to split the DAG into
several sub DAGs, implement each of those sub DAGs as Flink Streaming
program, and then connecting those DAGs without having to use some
intermediate external queue like Kafka, but using the internal queues used
by Flink. In other words, it is possible to connect several Flink Streaming
programs without using an external queue? That could be an interesting
compromise that would allow to have different types of modularities (in
functions, in different physical components) and isolation levels.
This is quite of a speculative problem, but I thinks situations like this
are not uncommon in practice.
Thank for your help in any case.

Greetings,

Juan