You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Richard Deurwaarder <ri...@xeli.eu> on 2019/07/17 17:49:43 UTC

Flink Zookeeper HA: FileNotFoundException blob - Jobmanager not starting up

Hello,

I've got a problem with our flink cluster where the jobmanager is not
starting up anymore, because it tries to download non existant (blob) file
from the zookeeper storage dir.

We're running flink 1.8.0 on a kubernetes cluster and use the google
storage connector [1] to store checkpoints, savepoints and zookeeper data.

When I noticed the jobmanager was having problems, it was in a crashloop
throwing file not found exceptions [2]
Caused by: java.io.FileNotFoundException: Item not found:
some-project-flink-state/recovery/hunch/blob/job_e6ad857af7f09b56594e95fe273e9eff/blob_p-486d68fa98fa05665f341d79302c40566b81034e-306d493f5aa810b5f4f7d8d63f5b18b5.
If you enabled STRICT generation consistency, it is possible that the live
version is still available but the intended generation is deleted.

I looked in the blob directory and I can only find:
/recovery/hunch/blob/job_1dccee15d84e1d2cededf89758ac2482 I've tried to
fiddle around in zookeeper to see if I could find anything [3], but I do
not really know what to look for.

How could this have happened and how should I recover the job from this
situation?

Thanks,

Richard

[1]
https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/connectors.html#using-hadoop-file-system-implementations
[2] https://gist.github.com/Xeli/0321031655e47006f00d38fc4bc08e16
[3] https://gist.github.com/Xeli/04f6d861c5478071521ac6d2c582832a

Re: Flink Zookeeper HA: FileNotFoundException blob - Jobmanager not starting up

Posted by Till Rohrmann <tr...@apache.org>.

Hi Richard,

it looks as if the zNode of a completed job has not been properly removed.
Without the logs of the respective JobMaster, it is hard to debug any
further. However, I suspect that this is an instance of FLINK-11665. I am
currently working on a fix for it.

[1] https://issues.apache.org/jira/browse/FLINK-11665

Cheers,
Till

On Tue, Jul 23, 2019 at 2:38 PM Richard Deurwaarder <ri...@xeli.eu> wrote:

> Hi Fabian,
>
> I followed the advice of another flink user who mailed me directly, he has
> the same problem and told me to use something like: rmr zgrep /flink/hunch/jobgraphs/1dccee15d84e1d2cededf89758ac2482
> which allowed us to start the job again.
>
> It might be nice to investigate what went wrong as it didn't feel good to
> have our production clustered crippled like this.
>
> Richard
>
> On Tue, Jul 23, 2019 at 12:47 PM Fabian Hueske <fh...@gmail.com> wrote:
>
>> Hi Richard,
>>
>> I hope you could resolve the problem in the meantime.
>>
>> Nonetheless, maybe Till (in CC) has an idea what could have gone wrong.
>>
>> Best, Fabian
>>
>> Am Mi., 17. Juli 2019 um 19:50 Uhr schrieb Richard Deurwaarder <
>> richard@xeli.eu>:
>>
>>> Hello,
>>>
>>> I've got a problem with our flink cluster where the jobmanager is not
>>> starting up anymore, because it tries to download non existant (blob) file
>>> from the zookeeper storage dir.
>>>
>>> We're running flink 1.8.0 on a kubernetes cluster and use the google
>>> storage connector [1] to store checkpoints, savepoints and zookeeper data.
>>>
>>> When I noticed the jobmanager was having problems, it was in a crashloop
>>> throwing file not found exceptions [2]
>>> Caused by: java.io.FileNotFoundException: Item not found:
>>> some-project-flink-state/recovery/hunch/blob/job_e6ad857af7f09b56594e95fe273e9eff/blob_p-486d68fa98fa05665f341d79302c40566b81034e-306d493f5aa810b5f4f7d8d63f5b18b5.
>>> If you enabled STRICT generation consistency, it is possible that the live
>>> version is still available but the intended generation is deleted.
>>>
>>> I looked in the blob directory and I can only find:
>>> /recovery/hunch/blob/job_1dccee15d84e1d2cededf89758ac2482 I've tried to
>>> fiddle around in zookeeper to see if I could find anything [3], but I do
>>> not really know what to look for.
>>>
>>> How could this have happened and how should I recover the job from this
>>> situation?
>>>
>>> Thanks,
>>>
>>> Richard
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/connectors.html#using-hadoop-file-system-implementations
>>> [2] https://gist.github.com/Xeli/0321031655e47006f00d38fc4bc08e16
>>> [3] https://gist.github.com/Xeli/04f6d861c5478071521ac6d2c582832a
>>>
>>

Re: Flink Zookeeper HA: FileNotFoundException blob - Jobmanager not starting up

Posted by Fabian Hueske <fh...@gmail.com>.

Good to know that you were able to fix the issue!

I definitely agree that it would be good to know why this situation
occurred.

Am Di., 23. Juli 2019 um 14:38 Uhr schrieb Richard Deurwaarder <
richard@xeli.eu>:

> Hi Fabian,
>
> I followed the advice of another flink user who mailed me directly, he has
> the same problem and told me to use something like: rmr zgrep /flink/hunch/jobgraphs/1dccee15d84e1d2cededf89758ac2482
> which allowed us to start the job again.
>
> It might be nice to investigate what went wrong as it didn't feel good to
> have our production clustered crippled like this.
>
> Richard
>
> On Tue, Jul 23, 2019 at 12:47 PM Fabian Hueske <fh...@gmail.com> wrote:
>
>> Hi Richard,
>>
>> I hope you could resolve the problem in the meantime.
>>
>> Nonetheless, maybe Till (in CC) has an idea what could have gone wrong.
>>
>> Best, Fabian
>>
>> Am Mi., 17. Juli 2019 um 19:50 Uhr schrieb Richard Deurwaarder <
>> richard@xeli.eu>:
>>
>>> Hello,
>>>
>>> I've got a problem with our flink cluster where the jobmanager is not
>>> starting up anymore, because it tries to download non existant (blob) file
>>> from the zookeeper storage dir.
>>>
>>> We're running flink 1.8.0 on a kubernetes cluster and use the google
>>> storage connector [1] to store checkpoints, savepoints and zookeeper data.
>>>
>>> When I noticed the jobmanager was having problems, it was in a crashloop
>>> throwing file not found exceptions [2]
>>> Caused by: java.io.FileNotFoundException: Item not found:
>>> some-project-flink-state/recovery/hunch/blob/job_e6ad857af7f09b56594e95fe273e9eff/blob_p-486d68fa98fa05665f341d79302c40566b81034e-306d493f5aa810b5f4f7d8d63f5b18b5.
>>> If you enabled STRICT generation consistency, it is possible that the live
>>> version is still available but the intended generation is deleted.
>>>
>>> I looked in the blob directory and I can only find:
>>> /recovery/hunch/blob/job_1dccee15d84e1d2cededf89758ac2482 I've tried to
>>> fiddle around in zookeeper to see if I could find anything [3], but I do
>>> not really know what to look for.
>>>
>>> How could this have happened and how should I recover the job from this
>>> situation?
>>>
>>> Thanks,
>>>
>>> Richard
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/connectors.html#using-hadoop-file-system-implementations
>>> [2] https://gist.github.com/Xeli/0321031655e47006f00d38fc4bc08e16
>>> [3] https://gist.github.com/Xeli/04f6d861c5478071521ac6d2c582832a
>>>
>>

Re: Flink Zookeeper HA: FileNotFoundException blob - Jobmanager not starting up

Posted by Richard Deurwaarder <ri...@xeli.eu>.

Hi Fabian,

I followed the advice of another flink user who mailed me directly, he has
the same problem and told me to use something like: rmr zgrep
/flink/hunch/jobgraphs/1dccee15d84e1d2cededf89758ac2482
which allowed us to start the job again.

It might be nice to investigate what went wrong as it didn't feel good to
have our production clustered crippled like this.

Richard

On Tue, Jul 23, 2019 at 12:47 PM Fabian Hueske <fh...@gmail.com> wrote:

> Hi Richard,
>
> I hope you could resolve the problem in the meantime.
>
> Nonetheless, maybe Till (in CC) has an idea what could have gone wrong.
>
> Best, Fabian
>
> Am Mi., 17. Juli 2019 um 19:50 Uhr schrieb Richard Deurwaarder <
> richard@xeli.eu>:
>
>> Hello,
>>
>> I've got a problem with our flink cluster where the jobmanager is not
>> starting up anymore, because it tries to download non existant (blob) file
>> from the zookeeper storage dir.
>>
>> We're running flink 1.8.0 on a kubernetes cluster and use the google
>> storage connector [1] to store checkpoints, savepoints and zookeeper data.
>>
>> When I noticed the jobmanager was having problems, it was in a crashloop
>> throwing file not found exceptions [2]
>> Caused by: java.io.FileNotFoundException: Item not found:
>> some-project-flink-state/recovery/hunch/blob/job_e6ad857af7f09b56594e95fe273e9eff/blob_p-486d68fa98fa05665f341d79302c40566b81034e-306d493f5aa810b5f4f7d8d63f5b18b5.
>> If you enabled STRICT generation consistency, it is possible that the live
>> version is still available but the intended generation is deleted.
>>
>> I looked in the blob directory and I can only find:
>> /recovery/hunch/blob/job_1dccee15d84e1d2cededf89758ac2482 I've tried to
>> fiddle around in zookeeper to see if I could find anything [3], but I do
>> not really know what to look for.
>>
>> How could this have happened and how should I recover the job from this
>> situation?
>>
>> Thanks,
>>
>> Richard
>>
>> [1]
>> https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/connectors.html#using-hadoop-file-system-implementations
>> [2] https://gist.github.com/Xeli/0321031655e47006f00d38fc4bc08e16
>> [3] https://gist.github.com/Xeli/04f6d861c5478071521ac6d2c582832a
>>
>

Re: Flink Zookeeper HA: FileNotFoundException blob - Jobmanager not starting up

Posted by Fabian Hueske <fh...@gmail.com>.

Hi Richard,

I hope you could resolve the problem in the meantime.

Nonetheless, maybe Till (in CC) has an idea what could have gone wrong.

Best, Fabian

Am Mi., 17. Juli 2019 um 19:50 Uhr schrieb Richard Deurwaarder <
richard@xeli.eu>:

> Hello,
>
> I've got a problem with our flink cluster where the jobmanager is not
> starting up anymore, because it tries to download non existant (blob) file
> from the zookeeper storage dir.
>
> We're running flink 1.8.0 on a kubernetes cluster and use the google
> storage connector [1] to store checkpoints, savepoints and zookeeper data.
>
> When I noticed the jobmanager was having problems, it was in a crashloop
> throwing file not found exceptions [2]
> Caused by: java.io.FileNotFoundException: Item not found:
> some-project-flink-state/recovery/hunch/blob/job_e6ad857af7f09b56594e95fe273e9eff/blob_p-486d68fa98fa05665f341d79302c40566b81034e-306d493f5aa810b5f4f7d8d63f5b18b5.
> If you enabled STRICT generation consistency, it is possible that the live
> version is still available but the intended generation is deleted.
>
> I looked in the blob directory and I can only find:
> /recovery/hunch/blob/job_1dccee15d84e1d2cededf89758ac2482 I've tried to
> fiddle around in zookeeper to see if I could find anything [3], but I do
> not really know what to look for.
>
> How could this have happened and how should I recover the job from this
> situation?
>
> Thanks,
>
> Richard
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/connectors.html#using-hadoop-file-system-implementations
> [2] https://gist.github.com/Xeli/0321031655e47006f00d38fc4bc08e16
> [3] https://gist.github.com/Xeli/04f6d861c5478071521ac6d2c582832a
>