You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Pieter-Jan Van Aeken <pi...@euranova.eu> on 2018/10/09 12:52:27 UTC

BroadcastStream vs Broadcasted DataStream

Hi,


I am not sure I fully understand the differences between doing something
like

dataStreamX.connect(dataStreamY.broadcast()).process(new
CoProcessFunction{})

and this

dataStreamX.connect(dataStreamY.broadcast(*descriptor*).process(new
BroadcastProcessFunction)

Couldn't I manage the state exactly the same way using the first option?
The only difference I could find was in read/write permissions for
different states. But other than that, the difference in possibilities
escapes me and I was hoping someone here could clarify.


Kind regards,

Pieter-Jan

___________________________

*Pieter-Jan Van Aeken*

Consultant - Data Engineer
(M) +32 (0) 474 06 64 48

*EURA NOVA*

Rue Emile Francqui, 4

1435 Mont-Saint-Guibert

(T) +32 10 75 02 00

*euranova.eu <http://euranova.eu>*

*research.euranova.eu* <http://research.euranova.eu>

-- 
♻ Be green, keep it on the screen

Re: BroadcastStream vs Broadcasted DataStream

Posted by Kostas Kloudas <k....@data-artisans.com>.
Hi Pieter-Jan,

The second variant stores the elements of the broadcasted stream in operator (thus non-keyed) state.

On the differences: 

The Broadcast stream is not a keyed stream, so you are not in a keyed context, thus you have no access to keyed state.
Given this, and assuming that you are implementing a CoProcessFunction, then your function should also implement the 
ListCheckpointed interface and store the broadcasted data as ListState.

From a systems perspective, in case of rescaling, then your function should take care of removing duplicated manually
as in the case of scaling down for example, the state of 2 nodes may end up on one. In addition, given that elements are
re-distributed randomly, you may even end up with some elements missing one some nodes. This would be more evident 
in case of scaling up. In this case, nodes will be missing data.

From an API perspective, assuming that the other stream is keyed and it has state, accessing it in order to do some computation
would be impossible.

Cheers,
Kostas

> On Oct 9, 2018, at 2:52 PM, Pieter-Jan Van Aeken <pi...@euranova.eu> wrote:
> 
> Hi,
> 
> 
> I am not sure I fully understand the differences between doing something like
> 
> dataStreamX.connect(dataStreamY.broadcast()).process(new CoProcessFunction{})
> 
> and this
> 
> dataStreamX.connect(dataStreamY.broadcast(descriptor).process(new BroadcastProcessFunction)
> 
> Couldn't I manage the state exactly the same way using the first option? The only difference I could find was in read/write permissions for different states. But other than that, the difference in possibilities escapes me and I was hoping someone here could clarify.
> 
> 
> Kind regards,
> 
> Pieter-Jan
> ___________________________
> 
> Pieter-Jan Van Aeken
> 
> 
> Consultant - Data Engineer
> (M) +32 (0) 474 06 64 48
> 
> 
> EURA NOVA
> 
> Rue Emile Francqui, 4
> 
> 1435 Mont-Saint-Guibert
> 
> (T) +32 10 75 02 00
> 
> 
> euranova.eu <http://euranova.eu/>
> research.euranova.eu <http://research.euranova.eu/>
> ♻ Be green, keep it on the screen