You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Seth Wiesman <sw...@mediamath.com> on 2018/03/09 16:53:20 UTC

PartitionNotFoundException when restarting from checkpoint

Hi,

We are running Flink 1.4.0 with a yarn deployment on ec2 instances, rocks dB and incremental checkpointing, last night a job failed and became stuck in a restart cycle with a PartitionNotFound. We tried restarting the checkpoint on a fresh Flink session with no luck. Looking through the logs we can see that the specified partition is never registered with the ResultPartitionManager.

My questions are:

1)       Are partitions a part of state or are the ephemeral to the job

2)       If they are not part of state, where would the task managers be getting that partition id to begin with

3)       Right now we are logging everything under org.apache.flink.runtime.io.network, is there anywhere else to look

Thank you,

[cid:image001.png@01D3B79D.36E45B00]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 10007 swiesman@mediamath.com<ma...@mediamath.com>

Re: PartitionNotFoundException when restarting from checkpoint

Posted by Seth Wiesman <sw...@mediamath.com>.

Yes that exception. Attached are logs.

[cid:image001.png@01D3BC59.48A91C70]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 10007 swiesman@mediamath.com<ma...@mediamath.com>

From: Stephan Ewen <se...@apache.org>
Date: Thursday, March 15, 2018 at 11:21 AM
To: Seth Wiesman <sw...@mediamath.com>
Cc: Fabian Hueske <fh...@gmail.com>, Stefan Richter <s....@data-artisans.com>, "user@flink.apache.org" <us...@flink.apache.org>
Subject: Re: PartitionNotFoundException when restarting from checkpoint

Just to double check: We are talking about a Flink PartitionNotFoundException<https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/PartitionNotFoundException.java>, I assume?

The split brain situation is a good hint - the minority partition should stop its work, though, and the TaskManager should cleanly re-join the majority once the split brain is resolved.

This sounds almost like the TaskManagers rejoining where still assuming a stale version of the ExecutionGraph (ExecutionGraph should have new IDs when recovery of the lost nodes into majority partition happened).

Logs would definitely be helpful to debug this further...

On Wed, Mar 14, 2018 at 3:25 PM, Seth Wiesman <sw...@mediamath.com>> wrote:
Hit send too soon.

Having spent some more time with this, it appears that zookeeper being in a bad state was unable to track a downed kafka broker. This investigation has been very much trial and error up to this point please let me know if I seem way off base ☺

[cid:image002.png@01D3BC59.48A91C70]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 10007 swiesman@mediamath.com<ma...@mediamath.com>

From: Seth Wiesman <sw...@mediamath.com>>
Date: Wednesday, March 14, 2018 at 10:14 AM
To: Fabian Hueske <fh...@gmail.com>>, Stefan Richter <s....@data-artisans.com>>

Cc: "user@flink.apache.org<ma...@flink.apache.org>" <us...@flink.apache.org>>
Subject: Re: PartitionNotFoundException when restarting from checkpoint

Unfortunately the stack trace was swallowed by the java timer in the LocalInputChannel[1], the real error is forwarded out to the main thread but I couldn’t figure out how to see that in my logs.

However, I believe I am close to having a reproducible example. Run a 1.4 DataStream, sinking to kafka 0.11 and cancel with a savepoint. If you then shut down the kafka daemon on a single broker but keep the rest proxy up you should see this error when you resume.

[1] https://github.com/apache/flink/blob/fa024726bb801fc71cec5cc303cac1d4a03f555e/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/LocalInputChannel.java#L151

[cid:image003.png@01D3BC59.48A91C70]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 10007 swiesman@mediamath.com<ma...@mediamath.com>

From: Fabian Hueske <fh...@gmail.com>>
Date: Tuesday, March 13, 2018 at 8:02 PM
To: Seth Wiesman <sw...@mediamath.com>>, Stefan Richter <s....@data-artisans.com>>
Cc: "user@flink.apache.org<ma...@flink.apache.org>" <us...@flink.apache.org>>
Subject: Re: PartitionNotFoundException when restarting from checkpoint

Hi Seth,
Thanks for sharing how you resolved the problem!
The problem might have been related to Flink's key groups which are used to assign key ranges to tasks.
Not sure why this would be related to ZooKeeper being in a bad state. Maybe Stefan (in CC) has an idea about the cause.
Also, it would be helpful if you could share the stacktrace of the exception (in case you still have it).
Best, Fabian

2018-03-13 14:35 GMT+01:00 Seth Wiesman <sw...@mediamath.com>>:
It turns out the issue was due to our zookeeper installation being in a bad state. I am not clear enough on flink’s networking internals to explain how this manifested as a partition not found exception, but hopefully this can serve as a starting point for other’s who run into the same issue.

[cid:image004.png@01D3BC59.48A91C70]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 10007 swiesman@mediamath.com<ma...@mediamath.com>

From: Seth Wiesman <sw...@mediamath.com>>
Date: Friday, March 9, 2018 at 11:53 AM
To: "user@flink.apache.org<ma...@flink.apache.org>" <us...@flink.apache.org>>
Subject: PartitionNotFoundException when restarting from checkpoint

Hi,

We are running Flink 1.4.0 with a yarn deployment on ec2 instances, rocks dB and incremental checkpointing, last night a job failed and became stuck in a restart cycle with a PartitionNotFound. We tried restarting the checkpoint on a fresh Flink session with no luck. Looking through the logs we can see that the specified partition is never registered with the ResultPartitionManager.

My questions are:

1)      Are partitions a part of state or are the ephemeral to the job

2)      If they are not part of state, where would the task managers be getting that partition id to begin with

3)      Right now we are logging everything under org.apache.flink.runtime.io<http://org.apache.flink.runtime.io>.network, is there anywhere else to look

Thank you,

[cid:image005.png@01D3BC59.48A91C70]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 10007 swiesman@mediamath.com<ma...@mediamath.com>

Re: PartitionNotFoundException when restarting from checkpoint

Posted by Stephan Ewen <se...@apache.org>.

Just to double check: We are talking about a Flink
PartitionNotFoundException
<https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/PartitionNotFoundException.java>,
I assume?

The split brain situation is a good hint - the minority partition should
stop its work, though, and the TaskManager should cleanly re-join the
majority once the split brain is resolved.

This sounds almost like the TaskManagers rejoining where still assuming a
stale version of the ExecutionGraph (ExecutionGraph should have new IDs
when recovery of the lost nodes into majority partition happened).

Logs would definitely be helpful to debug this further...


On Wed, Mar 14, 2018 at 3:25 PM, Seth Wiesman <sw...@mediamath.com>
wrote:

> Hit send too soon.
>
>
>
> Having spent some more time with this, it appears that zookeeper being in
> a bad state was unable to track a downed kafka broker. This investigation
> has been very much trial and error up to this point please let me know if I
> seem way off base ☺
>
>
>
> *Seth Wiesman*| Software Engineer 4 World Trade Center, 46th Floor, New
> York, NY 10007 swiesman@mediamath.com <fl...@mediamath.com>
>
>
>
>
>
> *From: *Seth Wiesman <sw...@mediamath.com>
> *Date: *Wednesday, March 14, 2018 at 10:14 AM
> *To: *Fabian Hueske <fh...@gmail.com>, Stefan Richter <
> s.richter@data-artisans.com>
>
> *Cc: *"user@flink.apache.org" <us...@flink.apache.org>
> *Subject: *Re: PartitionNotFoundException when restarting from checkpoint
>
>
>
> Unfortunately the stack trace was swallowed by the java timer in the
> LocalInputChannel[1], the real error is forwarded out to the main thread
> but I couldn’t figure out how to see that in my logs.
>
>
>
> However, I believe I am close to having a reproducible example. Run a 1.4
> DataStream, sinking to kafka 0.11 and cancel with a savepoint. If you then
> shut down the kafka daemon on a single broker but keep the rest proxy up
> you should see this error when you resume.
>
>
>
> [1] https://github.com/apache/flink/blob/fa024726bb801fc71cec5cc303cac1
> d4a03f555e/flink-runtime/src/main/java/org/apache/flink/
> runtime/io/network/partition/consumer/LocalInputChannel.java#L151
>
>
>
> *Seth Wiesman*| Software Engineer 4 World Trade Center, 46th Floor, New
> York, NY 10007 swiesman@mediamath.com <fl...@mediamath.com>
>
>
>
>
>
> *From: *Fabian Hueske <fh...@gmail.com>
> *Date: *Tuesday, March 13, 2018 at 8:02 PM
> *To: *Seth Wiesman <sw...@mediamath.com>, Stefan Richter <
> s.richter@data-artisans.com>
> *Cc: *"user@flink.apache.org" <us...@flink.apache.org>
> *Subject: *Re: PartitionNotFoundException when restarting from checkpoint
>
>
>
> Hi Seth,
>
> Thanks for sharing how you resolved the problem!
>
> The problem might have been related to Flink's key groups which are used
> to assign key ranges to tasks.
>
> Not sure why this would be related to ZooKeeper being in a bad state.
> Maybe Stefan (in CC) has an idea about the cause.
>
> Also, it would be helpful if you could share the stacktrace of the
> exception (in case you still have it).
>
> Best, Fabian
>
>
>
> 2018-03-13 14:35 GMT+01:00 Seth Wiesman <sw...@mediamath.com>:
>
> It turns out the issue was due to our zookeeper installation being in a
> bad state. I am not clear enough on flink’s networking internals to explain
> how this manifested as a partition not found exception, but hopefully this
> can serve as a starting point for other’s who run into the same issue.
>
>
>
> *Seth Wiesman*| Software Engineer 4 World Trade Center, 46th Floor, New
> York, NY 10007 swiesman@mediamath.com <fl...@mediamath.com>
>
>
>
>
>
> *From: *Seth Wiesman <sw...@mediamath.com>
> *Date: *Friday, March 9, 2018 at 11:53 AM
> *To: *"user@flink.apache.org" <us...@flink.apache.org>
> *Subject: *PartitionNotFoundException when restarting from checkpoint
>
>
>
> Hi,
>
>
>
> We are running Flink 1.4.0 with a yarn deployment on ec2 instances, rocks
> dB and incremental checkpointing, last night a job failed and became stuck
> in a restart cycle with a PartitionNotFound. We tried restarting the
> checkpoint on a fresh Flink session with no luck. Looking through the logs
> we can see that the specified partition is never registered with the
> ResultPartitionManager.
>
>
>
> My questions are:
>
> 1)      Are partitions a part of state or are the ephemeral to the job
>
> 2)      If they are not part of state, where would the task managers be
> getting that partition id to begin with
>
> 3)      Right now we are logging everything under
> org.apache.flink.runtime.io.network, is there anywhere else to look
>
>
>
> Thank you,
>
>
>
> *Seth Wiesman*| Software Engineer 4 World Trade Center, 46th Floor, New
> York, NY 10007 swiesman@mediamath.com <fl...@mediamath.com>
>
>
>
>
>

Re: PartitionNotFoundException when restarting from checkpoint

Posted by Seth Wiesman <sw...@mediamath.com>.

Hit send too soon.

Having spent some more time with this, it appears that zookeeper being in a bad state was unable to track a downed kafka broker. This investigation has been very much trial and error up to this point please let me know if I seem way off base ☺

[cid:image001.png@01D3BB7E.BAFAAC20]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 10007 swiesman@mediamath.com<ma...@mediamath.com>



From: Seth Wiesman <sw...@mediamath.com>
Date: Wednesday, March 14, 2018 at 10:14 AM
To: Fabian Hueske <fh...@gmail.com>, Stefan Richter <s....@data-artisans.com>
Cc: "user@flink.apache.org" <us...@flink.apache.org>
Subject: Re: PartitionNotFoundException when restarting from checkpoint

Unfortunately the stack trace was swallowed by the java timer in the LocalInputChannel[1], the real error is forwarded out to the main thread but I couldn’t figure out how to see that in my logs.

However, I believe I am close to having a reproducible example. Run a 1.4 DataStream, sinking to kafka 0.11 and cancel with a savepoint. If you then shut down the kafka daemon on a single broker but keep the rest proxy up you should see this error when you resume.

[1] https://github.com/apache/flink/blob/fa024726bb801fc71cec5cc303cac1d4a03f555e/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/LocalInputChannel.java#L151

[cid:image002.png@01D3BB7E.BAFAAC20]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 10007 swiesman@mediamath.com<ma...@mediamath.com>



From: Fabian Hueske <fh...@gmail.com>
Date: Tuesday, March 13, 2018 at 8:02 PM
To: Seth Wiesman <sw...@mediamath.com>, Stefan Richter <s....@data-artisans.com>
Cc: "user@flink.apache.org" <us...@flink.apache.org>
Subject: Re: PartitionNotFoundException when restarting from checkpoint

Hi Seth,
Thanks for sharing how you resolved the problem!
The problem might have been related to Flink's key groups which are used to assign key ranges to tasks.
Not sure why this would be related to ZooKeeper being in a bad state. Maybe Stefan (in CC) has an idea about the cause.
Also, it would be helpful if you could share the stacktrace of the exception (in case you still have it).
Best, Fabian

2018-03-13 14:35 GMT+01:00 Seth Wiesman <sw...@mediamath.com>>:
It turns out the issue was due to our zookeeper installation being in a bad state. I am not clear enough on flink’s networking internals to explain how this manifested as a partition not found exception, but hopefully this can serve as a starting point for other’s who run into the same issue.

[cid:image003.png@01D3BB7E.BAFAAC20]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 10007 swiesman@mediamath.com<ma...@mediamath.com>



From: Seth Wiesman <sw...@mediamath.com>>
Date: Friday, March 9, 2018 at 11:53 AM
To: "user@flink.apache.org<ma...@flink.apache.org>" <us...@flink.apache.org>>
Subject: PartitionNotFoundException when restarting from checkpoint

Hi,

We are running Flink 1.4.0 with a yarn deployment on ec2 instances, rocks dB and incremental checkpointing, last night a job failed and became stuck in a restart cycle with a PartitionNotFound. We tried restarting the checkpoint on a fresh Flink session with no luck. Looking through the logs we can see that the specified partition is never registered with the ResultPartitionManager.

My questions are:

1)      Are partitions a part of state or are the ephemeral to the job

2)      If they are not part of state, where would the task managers be getting that partition id to begin with

3)      Right now we are logging everything under org.apache.flink.runtime.io<http://org.apache.flink.runtime.io>.network, is there anywhere else to look

Thank you,

[cid:image004.png@01D3BB7E.BAFAAC20]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 10007 swiesman@mediamath.com<ma...@mediamath.com>

Re: PartitionNotFoundException when restarting from checkpoint

Posted by Seth Wiesman <sw...@mediamath.com>.

Unfortunately the stack trace was swallowed by the java timer in the LocalInputChannel[1], the real error is forwarded out to the main thread but I couldn’t figure out how to see that in my logs.

However, I believe I am close to having a reproducible example. Run a 1.4 DataStream, sinking to kafka 0.11 and cancel with a savepoint. If you then shut down the kafka daemon on a single broker but keep the rest proxy up you should see this error when you resume.

[1] https://github.com/apache/flink/blob/fa024726bb801fc71cec5cc303cac1d4a03f555e/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/LocalInputChannel.java#L151

[cid:image001.png@01D3BB7D.472CF0B0]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 10007 swiesman@mediamath.com<ma...@mediamath.com>



From: Fabian Hueske <fh...@gmail.com>
Date: Tuesday, March 13, 2018 at 8:02 PM
To: Seth Wiesman <sw...@mediamath.com>, Stefan Richter <s....@data-artisans.com>
Cc: "user@flink.apache.org" <us...@flink.apache.org>
Subject: Re: PartitionNotFoundException when restarting from checkpoint

Hi Seth,
Thanks for sharing how you resolved the problem!
The problem might have been related to Flink's key groups which are used to assign key ranges to tasks.
Not sure why this would be related to ZooKeeper being in a bad state. Maybe Stefan (in CC) has an idea about the cause.
Also, it would be helpful if you could share the stacktrace of the exception (in case you still have it).
Best, Fabian

2018-03-13 14:35 GMT+01:00 Seth Wiesman <sw...@mediamath.com>>:
It turns out the issue was due to our zookeeper installation being in a bad state. I am not clear enough on flink’s networking internals to explain how this manifested as a partition not found exception, but hopefully this can serve as a starting point for other’s who run into the same issue.

[cid:image002.png@01D3BB7D.472CF0B0]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 10007 swiesman@mediamath.com<ma...@mediamath.com>



From: Seth Wiesman <sw...@mediamath.com>>
Date: Friday, March 9, 2018 at 11:53 AM
To: "user@flink.apache.org<ma...@flink.apache.org>" <us...@flink.apache.org>>
Subject: PartitionNotFoundException when restarting from checkpoint

Hi,

We are running Flink 1.4.0 with a yarn deployment on ec2 instances, rocks dB and incremental checkpointing, last night a job failed and became stuck in a restart cycle with a PartitionNotFound. We tried restarting the checkpoint on a fresh Flink session with no luck. Looking through the logs we can see that the specified partition is never registered with the ResultPartitionManager.

My questions are:

1)      Are partitions a part of state or are the ephemeral to the job

2)      If they are not part of state, where would the task managers be getting that partition id to begin with

3)      Right now we are logging everything under org.apache.flink.runtime.io<http://org.apache.flink.runtime.io>.network, is there anywhere else to look

Thank you,

[cid:image003.png@01D3BB7D.472CF0B0]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 10007 swiesman@mediamath.com<ma...@mediamath.com>

Re: PartitionNotFoundException when restarting from checkpoint

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Seth,

Thanks for sharing how you resolved the problem!

The problem might have been related to Flink's key groups which are used to
assign key ranges to tasks.
Not sure why this would be related to ZooKeeper being in a bad state. Maybe
Stefan (in CC) has an idea about the cause.

Also, it would be helpful if you could share the stacktrace of the
exception (in case you still have it).

Best, Fabian

2018-03-13 14:35 GMT+01:00 Seth Wiesman <sw...@mediamath.com>:

> It turns out the issue was due to our zookeeper installation being in a
> bad state. I am not clear enough on flink’s networking internals to explain
> how this manifested as a partition not found exception, but hopefully this
> can serve as a starting point for other’s who run into the same issue.
>
>
>
> *Seth Wiesman*| Software Engineer 4 World Trade Center, 46th Floor, New
> York, NY 10007 swiesman@mediamath.com <fl...@mediamath.com>
>
>
>
>
>
> *From: *Seth Wiesman <sw...@mediamath.com>
> *Date: *Friday, March 9, 2018 at 11:53 AM
> *To: *"user@flink.apache.org" <us...@flink.apache.org>
> *Subject: *PartitionNotFoundException when restarting from checkpoint
>
>
>
> Hi,
>
>
>
> We are running Flink 1.4.0 with a yarn deployment on ec2 instances, rocks
> dB and incremental checkpointing, last night a job failed and became stuck
> in a restart cycle with a PartitionNotFound. We tried restarting the
> checkpoint on a fresh Flink session with no luck. Looking through the logs
> we can see that the specified partition is never registered with the
> ResultPartitionManager.
>
>
>
> My questions are:
>
> 1)      Are partitions a part of state or are the ephemeral to the job
>
> 2)      If they are not part of state, where would the task managers be
> getting that partition id to begin with
>
> 3)      Right now we are logging everything under
> org.apache.flink.runtime.io.network, is there anywhere else to look
>
>
>
> Thank you,
>
>
>
> *Seth Wiesman*| Software Engineer 4 World Trade Center, 46th Floor, New
> York, NY 10007 swiesman@mediamath.com <fl...@mediamath.com>
>
>
>

Re: PartitionNotFoundException when restarting from checkpoint

Posted by Seth Wiesman <sw...@mediamath.com>.

It turns out the issue was due to our zookeeper installation being in a bad state. I am not clear enough on flink’s networking internals to explain how this manifested as a partition not found exception, but hopefully this can serve as a starting point for other’s who run into the same issue.

[cid:image001.png@01D3BAAE.915F15C0]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 10007 swiesman@mediamath.com<ma...@mediamath.com>

From: Seth Wiesman <sw...@mediamath.com>
Date: Friday, March 9, 2018 at 11:53 AM
To: "user@flink.apache.org" <us...@flink.apache.org>
Subject: PartitionNotFoundException when restarting from checkpoint

Hi,

We are running Flink 1.4.0 with a yarn deployment on ec2 instances, rocks dB and incremental checkpointing, last night a job failed and became stuck in a restart cycle with a PartitionNotFound. We tried restarting the checkpoint on a fresh Flink session with no luck. Looking through the logs we can see that the specified partition is never registered with the ResultPartitionManager.

My questions are:

1)      Are partitions a part of state or are the ephemeral to the job

2)      If they are not part of state, where would the task managers be getting that partition id to begin with

3)      Right now we are logging everything under org.apache.flink.runtime.io.network, is there anywhere else to look

Thank you,

[cid:image002.png@01D3BAAE.915F15C0]

Seth Wiesman| Software Engineer 4 World Trade Center, 46th Floor, New York, NY 10007 swiesman@mediamath.com<ma...@mediamath.com>