You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Lasse Nedergaard <la...@gmail.com> on 2018/03/21 20:21:53 UTC

Out off memory when catching up

Hi. 

When our jobs are catching up they read with a factor 10-20 times normal rate but then we loose our task managers with OOM. We could increase the memory allocation but is there a way to figure out how high rate we can consume with the current memory and slot allocation and a way to limit the input to avoid OOM

Med venlig hilsen / Best regards
Lasse Nedergaard

Re: Out off memory when catching up

Posted by Lasse Nedergaard <la...@gmail.com>.

Hi

For sure I can share more info. We run on Flink 1.4.2 ( but have the same
problems on 1.3.2 ) on a Aws EMR cluster. 6 taskmanagers on each m4.xlarge
slave. Taskmanager heab set to 1850. We use RockStateDbBackend. we have set
akka.ask.timeout to 60 s if GC should prevent heatbeat,
yarn.maximum-failed-containers to 10000 to have some buffer before we loos
our yarn session.
One of our jobs reads data from Kinesis as a Json string and map it into a
object. Then we do some enrichment over a coPtocessFunction. If we can't
find the data in the coprocess stream stream, we make a lookup through a
asyncDataStream. Then we merge the 2 stream so that we now have one stream
where enrichment has taken place. We then parse the binary data and create
new object and output one main stream and 4 sideoutput streams. There
should be 1 to 1 in number of objects in this map function.
For some of the sideout streams we do additional enrichment before all 5
streams are stored in kinesis.
I have now implemented max number of records read from kinesis, and by
doing that I can avoid loosing my task manager, but now I can't catch up as
fast as I would like. I have only seen back pressure once and that was for
another job that use iteration and it never returned from that state.

So yes we create objects. I guess we create around 10-20 objects for each
input objects and I would like to understand what going on, so I can make
an implementation that takes care of it.
But is there a way to configure Flink so it will spill to disk instead of
OOM. I would prefer a slow system instead of a dead system

Please let me know if you need additional information or it don't make any
sense.

Lasse Nedergaard


2018-03-26 12:29 GMT+02:00 Timo Walther <tw...@apache.org>:

> Hi Lasse,
>
> in order to avoid OOM exception you should analyze your Flink job
> implementation. Are you creating a lot of objects within your Flink
> functions? Which state backend are you using? Maybe you can tell us a
> little bit more about your pipeline?
>
> Usually, there should be enough memory for the network buffers and state.
> Once the processing is not fast enough and the network buffers are filled
> up the input is limited anyway which results in back-pressure.
>
> Regards,
> Timo
>
>
> Am 21.03.18 um 21:21 schrieb Lasse Nedergaard:
>
> Hi.
>>
>> When our jobs are catching up they read with a factor 10-20 times normal
>> rate but then we loose our task managers with OOM. We could increase the
>> memory allocation but is there a way to figure out how high rate we can
>> consume with the current memory and slot allocation and a way to limit the
>> input to avoid OOM
>>
>> Med venlig hilsen / Best regards
>> Lasse Nedergaard
>>
>
>
>

Re: Out off memory when catching up

Posted by Timo Walther <tw...@apache.org>.

Hi Lasse,

in order to avoid OOM exception you should analyze your Flink job 
implementation. Are you creating a lot of objects within your Flink 
functions? Which state backend are you using? Maybe you can tell us a 
little bit more about your pipeline?

Usually, there should be enough memory for the network buffers and 
state. Once the processing is not fast enough and the network buffers 
are filled up the input is limited anyway which results in back-pressure.

Regards,
Timo


Am 21.03.18 um 21:21 schrieb Lasse Nedergaard:
> Hi.
>
> When our jobs are catching up they read with a factor 10-20 times normal rate but then we loose our task managers with OOM. We could increase the memory allocation but is there a way to figure out how high rate we can consume with the current memory and slot allocation and a way to limit the input to avoid OOM
>
> Med venlig hilsen / Best regards
> Lasse Nedergaard