You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Josh <jo...@gmail.com> on 2016/08/26 11:20:55 UTC

Kinesis connector - Iterator expired exception

Hi all,

I guess this is probably a question for Gordon - I've been using the
Flink-Kinesis connector for a while now and seen this exception a couple of
times:

com.amazonaws.services.kinesis.model.ExpiredIteratorException:
Iterator expired. The iterator was created at time Fri Aug 26 10:47:47
UTC 2016 while right now it is Fri Aug 26 11:05:40 UTC 2016 which is
further in the future than the tolerated delay of 300000 milliseconds.
(Service: AmazonKinesis; Status Code: 400; Error Code:
ExpiredIteratorException; Request ID:
d3db1d90-df97-912b-83e1-3954e766bbe0)


It happens when my Flink job goes down for a couple of hours, then I
restart from the existing state and it needs to catch up on all the
data that has been put in Kinesis stream in the hours where the job
was down. The job then runs for ~15 mins and fails with this exception
(and this happens repeatedly - meaning I can't restore the job from
the existing state).


Any ideas what's causing this? It's possible that it's been fixed in
recent commits, as the version of the Kinesis connector I'm using is
behind master - I'm not sure exactly what commit I'm using (doh!) but
it was built around mid June.


Thanks,

Josh

Re: Kinesis connector - Iterator expired exception

Posted by "Tzu-Li (Gordon) Tai" <tz...@apache.org>.

Hi Josh,

Thanks for the description. From your description and a check into the code, I’m suspecting what could be happening is that before the consumer caught up to the head of the stream, Kinesis was somehow returning the same shard iterator on consecutive fetch calls, and the consumer kept on using the same one until it eventually timed out.

This is actually suggesting the cause is due to Kinesis-side unexpected behaviour, so I probably need to run some long-running tests to clarify / reproduce this. The constant "15 minute" fail is suggesting this too, because the expire time for shard iterators is actually 5 minutes (from the Kinesis docs) …

Either way, it should be possible to handle this in the consumer so that it doesn’t fail on such situations. I’ve filed up a JIRA for this: https://issues.apache.org/jira/browse/FLINK-4514 .
I’ll get back to you after I figure out the root cause ;)

Regards,
Gordon


On August 26, 2016 at 10:43:02 PM, Josh (jofo90@gmail.com) wrote:

Hi Gordon,

My job only went down for around 2-3 hours, and I'm using the default Kinesis retention of 24 hours. When I restored the job, it got this exception after around 15 minutes (and then restarted again, and got the same exception 15 minutes later etc) - but actually I found that after this happened around 5 times the job fully caught up to the head of the stream and started running smoothly again.

Thanks for looking into this!

Best,
Josh


On Fri, Aug 26, 2016 at 1:57 PM, Tzu-Li (Gordon) Tai <tz...@apache.org> wrote:
Hi Josh,

Thank you for reporting this, I’m looking into it. There was some major changes to the Kinesis connector after mid June, but the changes don’t seem to be related to the iterator timeout, so it may be a bug that had always been there.

I’m not sure yet if it may be related, but may I ask how long was your Flink job down before restarting it again from the existing state? Was it longer than the retention duration of the Kinesis records (default is 24 hours)?

Regards,
Gordon


On August 26, 2016 at 7:20:59 PM, Josh (jofo90@gmail.com) wrote:

Hi all,

I guess this is probably a question for Gordon - I've been using the Flink-Kinesis connector for a while now and seen this exception a couple of times:

com.amazonaws.services.kinesis.model.ExpiredIteratorException: Iterator expired. The iterator was created at time Fri Aug 26 10:47:47 UTC 2016 while right now it is Fri Aug 26 11:05:40 UTC 2016 which is further in the future than the tolerated delay of 300000 milliseconds. (Service: AmazonKinesis; Status Code: 400; Error Code: ExpiredIteratorException; Request ID: d3db1d90-df97-912b-83e1-3954e766bbe0)

It happens when my Flink job goes down for a couple of hours, then I restart from the existing state and it needs to catch up on all the data that has been put in Kinesis stream in the hours where the job was down. The job then runs for ~15 mins and fails with this exception (and this happens repeatedly - meaning I can't restore the job from the existing state).

Any ideas what's causing this? It's possible that it's been fixed in recent commits, as the version of the Kinesis connector I'm using is behind master - I'm not sure exactly what commit I'm using (doh!) but it was built around mid June.

Thanks,
Josh

Re: Kinesis connector - Iterator expired exception

Posted by Josh <jo...@gmail.com>.

Hi Gordon,

My job only went down for around 2-3 hours, and I'm using the default
Kinesis retention of 24 hours. When I restored the job, it got this
exception after around 15 minutes (and then restarted again, and got the
same exception 15 minutes later etc) - but actually I found that after this
happened around 5 times the job fully caught up to the head of the stream
and started running smoothly again.

Thanks for looking into this!

Best,
Josh


On Fri, Aug 26, 2016 at 1:57 PM, Tzu-Li (Gordon) Tai <tz...@apache.org>
wrote:

> Hi Josh,
>
> Thank you for reporting this, I’m looking into it. There was some major
> changes to the Kinesis connector after mid June, but the changes don’t seem
> to be related to the iterator timeout, so it may be a bug that had always
> been there.
>
> I’m not sure yet if it may be related, but may I ask how long was your
> Flink job down before restarting it again from the existing state? Was it
> longer than the retention duration of the Kinesis records (default is 24
> hours)?
>
> Regards,
> Gordon
>
>
> On August 26, 2016 at 7:20:59 PM, Josh (jofo90@gmail.com) wrote:
>
> Hi all,
>
> I guess this is probably a question for Gordon - I've been using the
> Flink-Kinesis connector for a while now and seen this exception a couple of
> times:
>
> com.amazonaws.services.kinesis.model.ExpiredIteratorException: Iterator expired. The iterator was created at time Fri Aug 26 10:47:47 UTC 2016 while right now it is Fri Aug 26 11:05:40 UTC 2016 which is further in the future than the tolerated delay of 300000 milliseconds. (Service: AmazonKinesis; Status Code: 400; Error Code: ExpiredIteratorException; Request ID: d3db1d90-df97-912b-83e1-3954e766bbe0)
>
>
> It happens when my Flink job goes down for a couple of hours, then I restart from the existing state and it needs to catch up on all the data that has been put in Kinesis stream in the hours where the job was down. The job then runs for ~15 mins and fails with this exception (and this happens repeatedly - meaning I can't restore the job from the existing state).
>
>
> Any ideas what's causing this? It's possible that it's been fixed in recent commits, as the version of the Kinesis connector I'm using is behind master - I'm not sure exactly what commit I'm using (doh!) but it was built around mid June.
>
>
> Thanks,
>
> Josh
>
>

Re: Kinesis connector - Iterator expired exception

Posted by "Tzu-Li (Gordon) Tai" <tz...@apache.org>.

Hi Josh,

Thank you for reporting this, I’m looking into it. There was some major changes to the Kinesis connector after mid June, but the changes don’t seem to be related to the iterator timeout, so it may be a bug that had always been there.

I’m not sure yet if it may be related, but may I ask how long was your Flink job down before restarting it again from the existing state? Was it longer than the retention duration of the Kinesis records (default is 24 hours)?

Regards,
Gordon

On August 26, 2016 at 7:20:59 PM, Josh (jofo90@gmail.com) wrote:

Hi all,

I guess this is probably a question for Gordon - I've been using the Flink-Kinesis connector for a while now and seen this exception a couple of times:

com.amazonaws.services.kinesis.model.ExpiredIteratorException: Iterator expired. The iterator was created at time Fri Aug 26 10:47:47 UTC 2016 while right now it is Fri Aug 26 11:05:40 UTC 2016 which is further in the future than the tolerated delay of 300000 milliseconds. (Service: AmazonKinesis; Status Code: 400; Error Code: ExpiredIteratorException; Request ID: d3db1d90-df97-912b-83e1-3954e766bbe0)

It happens when my Flink job goes down for a couple of hours, then I restart from the existing state and it needs to catch up on all the data that has been put in Kinesis stream in the hours where the job was down. The job then runs for ~15 mins and fails with this exception (and this happens repeatedly - meaning I can't restore the job from the existing state).

Any ideas what's causing this? It's possible that it's been fixed in recent commits, as the version of the Kinesis connector I'm using is behind master - I'm not sure exactly what commit I'm using (doh!) but it was built around mid June.

Thanks,
Josh