You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Piotr Nowojski <pn...@apache.org> on 2021/02/01 14:53:16 UTC

Re: Flink and Amazon EMR

Hi,

Yes, it's working. You would need to analyse what's working slower than
expected. Checkpointing times? (Async duration? Sync duration? Start
delay/back pressure?) Throughput? Recovery/startup? Are you being rate
limited by Amazon?

Piotrek

czw., 28 sty 2021 o 03:46 Marco Villalobos <mv...@kineteque.com>
napisał(a):

> Just curious, has anybody had success with Amazon EMR with RocksDB and
> checkpointing in S3?
>
> That's the configuration I am trying to setup, but my system is running
> more slowly than expected.
>

Re: Flink and Amazon EMR

Posted by Piotr Nowojski <pn...@apache.org>.

Hi Marco,

>  Is this assumption correct?
Yes. More or else each operator is first creating a copy of its state
locally and uploading to S3 this whole file at once.

Please first take a look which part of checkpointing is taking so long.

Re backpressure. Keep in mind that Checkpoint Barriers need to travel
through the job graph. If your job is very heavily back pressured with low
record throughput, checkpoints might be timeouting because Checkpoint
Barriers do not manage to propagate through the job graph quickly enough.
For example, take a look at my response earlier today. [1]

Best,
Piotrek

[1]
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Checkpoint-Failures-from-Backpressure-possibly-due-to-overloaded-network-buffers-td41124.html

pon., 1 lut 2021 o 17:16 Marco Villalobos <mv...@kineteque.com>
napisał(a):

> Thank you.
>
> Checkpoints timeout often, even though the timeout limit is 20 minutes.
> The volume of records in our processing window that require checkpointing
> is large (between 200000 and 2 million). I made the assumption that Flink
> would batch a blob of bytes to S3, and not create an S3 call per record. Is
> this assumption correct?
>
> I need to look into whether I am being rate-limited by amazon. I assumed
> that a rate limiting error would have bubbled up as an error in the logs. I
> will find a way to assure that error is logged or captured somehow.
>
> How would backpressure come into play during checkpointing? I would expect
> Amazon to have enough resources. When I turn my sink (the next operator)
> into a print, it fails during checkpointing as well.
>
> I will explore what you mentioned though. Thank you.
>
> On Mon, Feb 1, 2021 at 6:53 AM Piotr Nowojski <pn...@apache.org>
> wrote:
>
>> Hi,
>>
>> Yes, it's working. You would need to analyse what's working slower than
>> expected. Checkpointing times? (Async duration? Sync duration? Start
>> delay/back pressure?) Throughput? Recovery/startup? Are you being rate
>> limited by Amazon?
>>
>> Piotrek
>>
>> czw., 28 sty 2021 o 03:46 Marco Villalobos <mv...@kineteque.com>
>> napisał(a):
>>
>>> Just curious, has anybody had success with Amazon EMR with RocksDB and
>>> checkpointing in S3?
>>>
>>> That's the configuration I am trying to setup, but my system is running
>>> more slowly than expected.
>>>
>>

Re: Flink and Amazon EMR

Posted by Marco Villalobos <mv...@kineteque.com>.

Thank you.

Checkpoints timeout often, even though the timeout limit is 20 minutes. The
volume of records in our processing window that require checkpointing is
large (between 200000 and 2 million). I made the assumption that Flink
would batch a blob of bytes to S3, and not create an S3 call per record. Is
this assumption correct?

I need to look into whether I am being rate-limited by amazon. I assumed
that a rate limiting error would have bubbled up as an error in the logs. I
will find a way to assure that error is logged or captured somehow.

How would backpressure come into play during checkpointing? I would expect
Amazon to have enough resources. When I turn my sink (the next operator)
into a print, it fails during checkpointing as well.

I will explore what you mentioned though. Thank you.

On Mon, Feb 1, 2021 at 6:53 AM Piotr Nowojski <pn...@apache.org> wrote:

> Hi,
>
> Yes, it's working. You would need to analyse what's working slower than
> expected. Checkpointing times? (Async duration? Sync duration? Start
> delay/back pressure?) Throughput? Recovery/startup? Are you being rate
> limited by Amazon?
>
> Piotrek
>
> czw., 28 sty 2021 o 03:46 Marco Villalobos <mv...@kineteque.com>
> napisał(a):
>
>> Just curious, has anybody had success with Amazon EMR with RocksDB and
>> checkpointing in S3?
>>
>> That's the configuration I am trying to setup, but my system is running
>> more slowly than expected.
>>
>