You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Urs Schoenenberger <ur...@tngtech.com> on 2018/07/17 12:50:36 UTC

Object reuse in DataStreams

Hi all,

we came across some interesting behaviour today.
We enabled object reuse on a streaming job that looks like this:

stream = env.addSource(source)
stream.map(mapFnA).addSink(sinkA)
stream.map(mapFnB).addSink(sinkB)

Operator chaining is enabled, so the optimizer fuses all operations into
a single slot.
The same object reference gets passed to both mapFnA and mapFnB. This
makes sense when I think about the internal implementation, but it still
came as a bit of a surprise since the object reuse docs (for batch -
there are no official ones for streaming, right?) don't really deal with
splitting the DataSet/DataStream. I guess my case is *technically*
covered by the documented warning that it is unsafe to reuse an object
that has already been collected, only in this case this reuse is
"hidden" behind the stream definition DSL.

Is this the expected behaviour? Is object reuse for DataStreams
encouraged at all or is it more of a "hidden beta" feature until FLIP-21
is officially finished?

Best,
Urs

-- 
Urs Schönenberger - urs.schoenenberger@tngtech.com

TNG Technology Consulting GmbH, Beta-Straße 13a, 85774 Unterföhring
Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
Sitz: Unterföhring * Amtsgericht München * HRB 135082

Re: Object reuse in DataStreams

Posted by vino yang <ya...@gmail.com>.
Hi Urs,

I think Flink does not encourage to use "object reuse" feature, because in
the documentation, it warn the user it may course bug when the user-code
function of an operation is not aware of this behavior[1].

The "object reuse" is runtime behavior and it's configuration item belongs
`ExecutionConfig` (this class both for batch and streaming), so it takes
efforts for both batch and streaming[1].

"Object reuse" feature is not perfect, the main use case is for batch[2]
and FLIP-21 tries to promote this feature gracefully, but the current state
is "*Under discussion*".

[1]:
https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/execution_configuration.html#execution-configuration
[2]:
https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/batch/#operating-on-data-objects-in-functions

Thanks, vino.

2018-07-17 20:50 GMT+08:00 Urs Schoenenberger <
urs.schoenenberger@tngtech.com>:

> Hi all,
>
> we came across some interesting behaviour today.
> We enabled object reuse on a streaming job that looks like this:
>
> stream = env.addSource(source)
> stream.map(mapFnA).addSink(sinkA)
> stream.map(mapFnB).addSink(sinkB)
>
> Operator chaining is enabled, so the optimizer fuses all operations into
> a single slot.
> The same object reference gets passed to both mapFnA and mapFnB. This
> makes sense when I think about the internal implementation, but it still
> came as a bit of a surprise since the object reuse docs (for batch -
> there are no official ones for streaming, right?) don't really deal with
> splitting the DataSet/DataStream. I guess my case is *technically*
> covered by the documented warning that it is unsafe to reuse an object
> that has already been collected, only in this case this reuse is
> "hidden" behind the stream definition DSL.
>
> Is this the expected behaviour? Is object reuse for DataStreams
> encouraged at all or is it more of a "hidden beta" feature until FLIP-21
> is officially finished?
>
> Best,
> Urs
>
> --
> Urs Schönenberger - urs.schoenenberger@tngtech.com
>
> TNG Technology Consulting GmbH, Beta-Straße 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
> Sitz: Unterföhring * Amtsgericht München * HRB 135082
>