You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Manas Kale <ma...@gmail.com> on 2020/05/11 06:06:45 UTC

Broadcast state vs data enrichment

Hi,
I have a single broadcast message that contains configuration data consumed
by different operators. For eg:
config = {
"config1" : 1,
"config2" : 2,
"config3" : 3
}

Operator 1 will consume config1 only, operator 2 will consume config2 only
etc.


   - Right now in my implementation the config message gets broadcast over
   operators 1,2,3 and each operator only stores what it needs.


   - A different approach would be to broadcast the config message to a
   single root operator. This will then enrich event data flowing through it
   with config1,config2 and config3 and each downstream operator will "strip
   off" the config parameter that it needs.


*I was wondering which approach would be the best to go with performance
wise. *I don't really have the time to implement both and compare, so
perhaps someone here already knows if one approach is better or both
provide similar performance.

FWIW, the config stream is very sporadic compared to the event stream.

Thank you,
Manas Kale

Re: Broadcast state vs data enrichment

Posted by Manas Kale <ma...@gmail.com>.
I see, thank you Roman!

On Tue, May 12, 2020 at 4:59 PM Khachatryan Roman <
khachatryan.roman@gmail.com> wrote:

> Thanks for the clarification.
>
> Apparently, the second option (with enricher) creates more load by adding
> configuration to every event. Unless events are much bigger than the
> configuration, this will significantly increase network, memory, and CPU
> usage.
> Btw, I think you don't need a broadcast in the 2nd option, since the
> interested subtask will receive the configuration anyways.
>
> Regards,
> Roman
>
>
> On Tue, May 12, 2020 at 5:57 AM Manas Kale <ma...@gmail.com> wrote:
>
>> Sure. Apologies for not making this clear enough.
>>
>> > each operator only stores what it needs.
>> Lets imagine this setup :
>>
>> BROADCAST STREAM
>> config-stream --------------------------------------------------------------------
>>                             |                           |                      |
>> event-stream----------> operator1------------------> operator2-------------> operator3
>>
>>
>> In this scenario, all 3 operators will be BroadcastProcessFunctions. Each
>> of them will receive the whole config message in their
>> processBroadcastElement method, but each one will only store what it
>> needs in their state store. So even though operator1 will receive
>>  config = {
>> "config1" : 1,
>> "config2" : 2,
>> "config3" : 3
>> }
>> it will only store config1.
>>
>> > each downstream operator will "strip off" the config parameter that it
>> needs.
>>
>> BROADCAST STREAM
>> config-stream -----------------
>>                               |
>> event-stream---------->  enricher --------------> operator1------------------> operator2-------------> operator3
>>
>> In this case, the enricher operator will store the whole config message.
>> When an event message arrives, this operator will append config1, config2
>> and config3 to it. Operator 1 will extract and use config1, and output a
>> message that has config1 stripped off.
>>
>> I hope that helps!
>>
>> Perhaps I am being too pedantic but I would like to know if these two
>> methods have comparable performance differences and if so which one would
>> be preferred.
>>
>>
>>
>>
>> On Mon, May 11, 2020 at 11:46 PM Khachatryan Roman <
>> khachatryan.roman@gmail.com> wrote:
>>
>>> Hi Manas,
>>>
>>> The approaches you described looks the same:
>>> > each operator only stores what it needs.
>>> > each downstream operator will "strip off" the config parameter that it
>>> needs.
>>>
>>> Can you please explain the difference?
>>>
>>> Regards,
>>> Roman
>>>
>>>
>>> On Mon, May 11, 2020 at 8:07 AM Manas Kale <ma...@gmail.com>
>>> wrote:
>>>
>>>> Hi,
>>>> I have a single broadcast message that contains configuration data
>>>> consumed by different operators. For eg:
>>>> config = {
>>>> "config1" : 1,
>>>> "config2" : 2,
>>>> "config3" : 3
>>>> }
>>>>
>>>> Operator 1 will consume config1 only, operator 2 will consume config2
>>>> only etc.
>>>>
>>>>
>>>>    - Right now in my implementation the config message gets broadcast
>>>>    over operators 1,2,3 and each operator only stores what it needs.
>>>>
>>>>
>>>>    - A different approach would be to broadcast the config message to
>>>>    a single root operator. This will then enrich event data flowing through it
>>>>    with config1,config2 and config3 and each downstream operator will "strip
>>>>    off" the config parameter that it needs.
>>>>
>>>>
>>>> *I was wondering which approach would be the best to go with
>>>> performance wise. *I don't really have the time to implement both and
>>>> compare, so perhaps someone here already knows if one approach is better or
>>>> both provide similar performance.
>>>>
>>>> FWIW, the config stream is very sporadic compared to the event stream.
>>>>
>>>> Thank you,
>>>> Manas Kale
>>>>
>>>>
>>>>
>>>>

Re: Broadcast state vs data enrichment

Posted by Khachatryan Roman <kh...@gmail.com>.
Thanks for the clarification.

Apparently, the second option (with enricher) creates more load by adding
configuration to every event. Unless events are much bigger than the
configuration, this will significantly increase network, memory, and CPU
usage.
Btw, I think you don't need a broadcast in the 2nd option, since the
interested subtask will receive the configuration anyways.

Regards,
Roman


On Tue, May 12, 2020 at 5:57 AM Manas Kale <ma...@gmail.com> wrote:

> Sure. Apologies for not making this clear enough.
>
> > each operator only stores what it needs.
> Lets imagine this setup :
>
> BROADCAST STREAM
> config-stream --------------------------------------------------------------------
>                             |                           |                      |
> event-stream----------> operator1------------------> operator2-------------> operator3
>
>
> In this scenario, all 3 operators will be BroadcastProcessFunctions. Each
> of them will receive the whole config message in their
> processBroadcastElement method, but each one will only store what it
> needs in their state store. So even though operator1 will receive
>  config = {
> "config1" : 1,
> "config2" : 2,
> "config3" : 3
> }
> it will only store config1.
>
> > each downstream operator will "strip off" the config parameter that it
> needs.
>
> BROADCAST STREAM
> config-stream -----------------
>                               |
> event-stream---------->  enricher --------------> operator1------------------> operator2-------------> operator3
>
> In this case, the enricher operator will store the whole config message.
> When an event message arrives, this operator will append config1, config2
> and config3 to it. Operator 1 will extract and use config1, and output a
> message that has config1 stripped off.
>
> I hope that helps!
>
> Perhaps I am being too pedantic but I would like to know if these two
> methods have comparable performance differences and if so which one would
> be preferred.
>
>
>
>
> On Mon, May 11, 2020 at 11:46 PM Khachatryan Roman <
> khachatryan.roman@gmail.com> wrote:
>
>> Hi Manas,
>>
>> The approaches you described looks the same:
>> > each operator only stores what it needs.
>> > each downstream operator will "strip off" the config parameter that it
>> needs.
>>
>> Can you please explain the difference?
>>
>> Regards,
>> Roman
>>
>>
>> On Mon, May 11, 2020 at 8:07 AM Manas Kale <ma...@gmail.com> wrote:
>>
>>> Hi,
>>> I have a single broadcast message that contains configuration data
>>> consumed by different operators. For eg:
>>> config = {
>>> "config1" : 1,
>>> "config2" : 2,
>>> "config3" : 3
>>> }
>>>
>>> Operator 1 will consume config1 only, operator 2 will consume config2
>>> only etc.
>>>
>>>
>>>    - Right now in my implementation the config message gets broadcast
>>>    over operators 1,2,3 and each operator only stores what it needs.
>>>
>>>
>>>    - A different approach would be to broadcast the config message to a
>>>    single root operator. This will then enrich event data flowing through it
>>>    with config1,config2 and config3 and each downstream operator will "strip
>>>    off" the config parameter that it needs.
>>>
>>>
>>> *I was wondering which approach would be the best to go with performance
>>> wise. *I don't really have the time to implement both and compare, so
>>> perhaps someone here already knows if one approach is better or both
>>> provide similar performance.
>>>
>>> FWIW, the config stream is very sporadic compared to the event stream.
>>>
>>> Thank you,
>>> Manas Kale
>>>
>>>
>>>
>>>

Re: Broadcast state vs data enrichment

Posted by Manas Kale <ma...@gmail.com>.
Sure. Apologies for not making this clear enough.

> each operator only stores what it needs.
Lets imagine this setup :

BROADCAST STREAM
config-stream --------------------------------------------------------------------
                            |                           |                      |
event-stream----------> operator1------------------>
operator2-------------> operator3


In this scenario, all 3 operators will be BroadcastProcessFunctions. Each
of them will receive the whole config message in their
processBroadcastElement method, but each one will only store what it needs
in their state store. So even though operator1 will receive
 config = {
"config1" : 1,
"config2" : 2,
"config3" : 3
}
it will only store config1.

> each downstream operator will "strip off" the config parameter that it
needs.

BROADCAST STREAM
config-stream -----------------
                              |
event-stream---------->  enricher -------------->
operator1------------------> operator2-------------> operator3

In this case, the enricher operator will store the whole config message.
When an event message arrives, this operator will append config1, config2
and config3 to it. Operator 1 will extract and use config1, and output a
message that has config1 stripped off.

I hope that helps!

Perhaps I am being too pedantic but I would like to know if these two
methods have comparable performance differences and if so which one would
be preferred.




On Mon, May 11, 2020 at 11:46 PM Khachatryan Roman <
khachatryan.roman@gmail.com> wrote:

> Hi Manas,
>
> The approaches you described looks the same:
> > each operator only stores what it needs.
> > each downstream operator will "strip off" the config parameter that it
> needs.
>
> Can you please explain the difference?
>
> Regards,
> Roman
>
>
> On Mon, May 11, 2020 at 8:07 AM Manas Kale <ma...@gmail.com> wrote:
>
>> Hi,
>> I have a single broadcast message that contains configuration data
>> consumed by different operators. For eg:
>> config = {
>> "config1" : 1,
>> "config2" : 2,
>> "config3" : 3
>> }
>>
>> Operator 1 will consume config1 only, operator 2 will consume config2
>> only etc.
>>
>>
>>    - Right now in my implementation the config message gets broadcast
>>    over operators 1,2,3 and each operator only stores what it needs.
>>
>>
>>    - A different approach would be to broadcast the config message to a
>>    single root operator. This will then enrich event data flowing through it
>>    with config1,config2 and config3 and each downstream operator will "strip
>>    off" the config parameter that it needs.
>>
>>
>> *I was wondering which approach would be the best to go with performance
>> wise. *I don't really have the time to implement both and compare, so
>> perhaps someone here already knows if one approach is better or both
>> provide similar performance.
>>
>> FWIW, the config stream is very sporadic compared to the event stream.
>>
>> Thank you,
>> Manas Kale
>>
>>
>>
>>

Re: Broadcast state vs data enrichment

Posted by Khachatryan Roman <kh...@gmail.com>.
Hi Manas,

The approaches you described looks the same:
> each operator only stores what it needs.
> each downstream operator will "strip off" the config parameter that it
needs.

Can you please explain the difference?

Regards,
Roman


On Mon, May 11, 2020 at 8:07 AM Manas Kale <ma...@gmail.com> wrote:

> Hi,
> I have a single broadcast message that contains configuration data
> consumed by different operators. For eg:
> config = {
> "config1" : 1,
> "config2" : 2,
> "config3" : 3
> }
>
> Operator 1 will consume config1 only, operator 2 will consume config2 only
> etc.
>
>
>    - Right now in my implementation the config message gets broadcast
>    over operators 1,2,3 and each operator only stores what it needs.
>
>
>    - A different approach would be to broadcast the config message to a
>    single root operator. This will then enrich event data flowing through it
>    with config1,config2 and config3 and each downstream operator will "strip
>    off" the config parameter that it needs.
>
>
> *I was wondering which approach would be the best to go with performance
> wise. *I don't really have the time to implement both and compare, so
> perhaps someone here already knows if one approach is better or both
> provide similar performance.
>
> FWIW, the config stream is very sporadic compared to the event stream.
>
> Thank you,
> Manas Kale
>
>
>
>