You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Abhishek Rai <ab...@netspring.io> on 2021/08/16 21:16:41 UTC

Flink taskmanager in crash loop

Hello,

In our production environment, running Flink 1.13 (Scala 2.11), where Flink
has been working without issues with a dozen or so jobs running for a
while, Flink taskmanager started crash looping with a period of ~4 minutes
per crash.  The stack trace is not very informative, therefore reaching out
for help, see below.

The only other thing that's unusual is that due to what might be a product
issue (custom job code running on Flink), some or all of our tasks are also
in a crash loop.  Still, I wasn't expecting taskmanager itself to die.
Does taskmanager have some built in feature to crash if all/most tasks are
crashing?

2021-08-16 15:58:23.984 [main] ERROR
org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Terminating
TaskManagerRunner with exit code 1.
org.apache.flink.util.FlinkException: Unexpected failure during
runtime of TaskManagerRunner.
  at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:382)
  at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:413)
  at java.base/java.security.AccessController.doPrivileged(Native Method)
  at java.base/javax.security.auth.Subject.doAs(Unknown Source)
  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
  at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
  at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:413)
  at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:396)
  at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:354)
Caused by: java.util.concurrent.TimeoutException: null
  at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255)
  at org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
  at org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582)
  at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown
Source)
  at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
  at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
Source)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown
Source)
  at java.base/java.lang.Thread.run(Unknown Source)
2021-08-16 15:58:23.986 [TaskExecutorLocalStateStoresManager shutdown
hook] INFO  o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager
 - Shutting down TaskExecutorLocalStateStoresManager.


Thanks very much!

Abhishek

Re: Flink taskmanager in crash loop

Posted by Yangze Guo <ka...@gmail.com>.
> 2021-08-16 15:58:13.986 [Cancellation Watchdog for Source: MASKED] ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Fatal > error occurred while executing the TaskManager. Shutting it down...
> org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds.

It seems the Task 'MASKED' can not be terminated within the timeout. I
think this would be the root cause of TaskManager's termination. We
need to find why Task 'MASKED' has been canceled. Can you provide some
logs related to it? Maybe you can search the "CANCELING" in jm and tm
logs.

Best,
Yangze Guo

On Wed, Aug 18, 2021 at 1:20 AM Abhishek Rai <ab...@netspring.io> wrote:
>
> Before these message, there is the following message in the log:
>
> 2021-08-12 23:02:58.015 [Canceler/Interrupts for Source: MASKED]) (1/1)#29103' did not react to cancelling signal for 30 seconds, but is stuck in method:
>  java.base@11.0.11/jdk.internal.misc.Unsafe.park(Native Method)
> java.base@11.0.11/java.util.concurrent.locks.LockSupport.parkNanos(Unknown Source)
> java.base@11.0.11/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown Source)
> app//org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take(TaskMailboxImpl.java:149)
> app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:341)
> app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:330)
> app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:202)
> app//org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:661)
> app//org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
> app//org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
> app//org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
> java.base@11.0.11/java.lang.Thread.run(Unknown Source)
>
> On Tue, Aug 17, 2021 at 9:22 AM Abhishek Rai <ab...@netspring.io> wrote:
>>
>> Thanks Yangze, indeed, I see the following in the log about 10s before the final crash (masked some sensitive data using `MASKED`):
>>
>> 2021-08-16 15:58:13.985 [Canceler/Interrupts for Source: MAKSED] WARN org.apache.flink.runtime.taskmanager.Task  - Task 'MASKED' did not react to cancelling signal for 30 seconds, but is stuck in method:
>>  java.base@11.0.11/jdk.internal.misc.Unsafe.park(Native Method)
>> java.base@11.0.11/java.util.concurrent.locks.LockSupport.park(Unknown Source)
>> java.base@11.0.11/java.util.concurrent.CompletableFuture$Signaller.block(Unknown Source)
>> java.base@11.0.11/java.util.concurrent.ForkJoinPool.managedBlock(Unknown Source)
>> java.base@11.0.11/java.util.concurrent.CompletableFuture.waitingGet(Unknown Source)
>> java.base@11.0.11/java.util.concurrent.CompletableFuture.join(Unknown Source)
>> app//org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:705)
>> app//org.apache.flink.streaming.runtime.tasks.SourceStreamTask.cleanUpInvoke(SourceStreamTask.java:186)
>> app//org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637)
>> app//org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
>> app//org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
>> java.base@11.0.11/java.lang.Thread.run(Unknown Source)
>>
>> 2021-08-16 15:58:13.986 [Cancellation Watchdog for Source: MASKED] ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Fatal error occurred while executing the TaskManager. Shutting it down...
>> org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully within 180 + seconds.
>>   at org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1718)
>>   at java.base/java.lang.Thread.run(Unknown Source)
>>
>>
>>
>> On Mon, Aug 16, 2021 at 7:05 PM Yangze Guo <ka...@gmail.com> wrote:
>>>
>>> Hi, Abhishek,
>>>
>>> Do you see something like "Fatal error occurred while executing the
>>> TaskManager" in your log or would you like to provide the whole task
>>> manager log?
>>>
>>> Best,
>>> Yangze Guo
>>>
>>> On Tue, Aug 17, 2021 at 5:17 AM Abhishek Rai <ab...@netspring.io> wrote:
>>> >
>>> > Hello,
>>> >
>>> > In our production environment, running Flink 1.13 (Scala 2.11), where Flink has been working without issues with a dozen or so jobs running for a while, Flink taskmanager started crash looping with a period of ~4 minutes per crash.  The stack trace is not very informative, therefore reaching out for help, see below.
>>> >
>>> > The only other thing that's unusual is that due to what might be a product issue (custom job code running on Flink), some or all of our tasks are also in a crash loop.  Still, I wasn't expecting taskmanager itself to die.  Does taskmanager have some built in feature to crash if all/most tasks are crashing?
>>> >
>>> > 2021-08-16 15:58:23.984 [main] ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Terminating TaskManagerRunner with exit code 1.
>>> > org.apache.flink.util.FlinkException: Unexpected failure during runtime of TaskManagerRunner.
>>> >   at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:382)
>>> >   at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:413)
>>> >   at java.base/java.security.AccessController.doPrivileged(Native Method)
>>> >   at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>>> >   at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>>> >   at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>>> >   at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:413)
>>> >   at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:396)
>>> >   at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:354)
>>> > Caused by: java.util.concurrent.TimeoutException: null
>>> >   at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255)
>>> >   at org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
>>> >   at org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582)
>>> >   at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>>> >   at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
>>> >   at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
>>> >   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>>> >   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>>> >   at java.base/java.lang.Thread.run(Unknown Source)
>>> > 2021-08-16 15:58:23.986 [TaskExecutorLocalStateStoresManager shutdown hook] INFO  o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager  - Shutting down TaskExecutorLocalStateStoresManager.
>>> >
>>> >
>>> > Thanks very much!
>>> >
>>> > Abhishek

Re: Flink taskmanager in crash loop

Posted by Abhishek Rai <ab...@netspring.io>.
Before these message, there is the following message in the log:

2021-08-12 23:02:58.015 [Canceler/Interrupts for Source: MASKED])
(1/1)#29103' did not react to cancelling signal for 30 seconds, but is
stuck in method:
 java.base@11.0.11/jdk.internal.misc.Unsafe.park(Native Method)
java.base@11.0.11/java.util.concurrent.locks.LockSupport.parkNanos(Unknown
Source)
java.base@11.0.11/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(Unknown
Source)
app//org.apache.flink.streaming.runtime.tasks.mailbox.TaskMailboxImpl.take(TaskMailboxImpl.java:149)
app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsWhenDefaultActionUnavailable(MailboxProcessor.java:341)
app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:330)
app//org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:202)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:661)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:623)
app//org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
app//org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
java.base@11.0.11/java.lang.Thread.run(Unknown Source)

On Tue, Aug 17, 2021 at 9:22 AM Abhishek Rai <ab...@netspring.io> wrote:

> Thanks Yangze, indeed, I see the following in the log about 10s before the
> final crash (masked some sensitive data using `MASKED`):
>
> 2021-08-16 15:58:13.985 [Canceler/Interrupts for Source: MAKSED] WARN
> org.apache.flink.runtime.taskmanager.Task  - Task 'MASKED' did not react to
> cancelling signal for 30 seconds, but is stuck in method:
>  java.base@11.0.11/jdk.internal.misc.Unsafe.park(Native Method)
> java.base@11.0.11/java.util.concurrent.locks.LockSupport.park(Unknown
> Source)
> java.base@11.0.11/java.util.concurrent.CompletableFuture$Signaller.block(Unknown
> Source)
> java.base@11.0.11/java.util.concurrent.ForkJoinPool.managedBlock(Unknown
> Source)
> java.base@11.0.11/java.util.concurrent.CompletableFuture.waitingGet(Unknown
> Source)
> java.base@11.0.11/java.util.concurrent.CompletableFuture.join(Unknown
> Source)
>
> app//org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:705)
>
> app//org.apache.flink.streaming.runtime.tasks.SourceStreamTask.cleanUpInvoke(SourceStreamTask.java:186)
>
> app//org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637)
> app//org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
> app//org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
> java.base@11.0.11/java.lang.Thread.run(Unknown Source)
>
> 2021-08-16 15:58:13.986 [Cancellation Watchdog for Source: MASKED] ERROR
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Fatal error
> occurred while executing the TaskManager. Shutting it down...
> org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully
> within 180 + seconds.
>   at
> org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1718)
>   at java.base/java.lang.Thread.run(Unknown Source)
>
>
>
> On Mon, Aug 16, 2021 at 7:05 PM Yangze Guo <ka...@gmail.com> wrote:
>
>> Hi, Abhishek,
>>
>> Do you see something like "Fatal error occurred while executing the
>> TaskManager" in your log or would you like to provide the whole task
>> manager log?
>>
>> Best,
>> Yangze Guo
>>
>> On Tue, Aug 17, 2021 at 5:17 AM Abhishek Rai <ab...@netspring.io>
>> wrote:
>> >
>> > Hello,
>> >
>> > In our production environment, running Flink 1.13 (Scala 2.11), where
>> Flink has been working without issues with a dozen or so jobs running for a
>> while, Flink taskmanager started crash looping with a period of ~4 minutes
>> per crash.  The stack trace is not very informative, therefore reaching out
>> for help, see below.
>> >
>> > The only other thing that's unusual is that due to what might be a
>> product issue (custom job code running on Flink), some or all of our tasks
>> are also in a crash loop.  Still, I wasn't expecting taskmanager itself to
>> die.  Does taskmanager have some built in feature to crash if all/most
>> tasks are crashing?
>> >
>> > 2021-08-16 15:58:23.984 [main] ERROR
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Terminating
>> TaskManagerRunner with exit code 1.
>> > org.apache.flink.util.FlinkException: Unexpected failure during runtime
>> of TaskManagerRunner.
>> >   at
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:382)
>> >   at
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:413)
>> >   at java.base/java.security.AccessController.doPrivileged(Native
>> Method)
>> >   at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>> >   at
>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>> >   at
>> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>> >   at
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:413)
>> >   at
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:396)
>> >   at
>> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:354)
>> > Caused by: java.util.concurrent.TimeoutException: null
>> >   at
>> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255)
>> >   at
>> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
>> >   at
>> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582)
>> >   at
>> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown
>> Source)
>> >   at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
>> >   at
>> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
>> Source)
>> >   at
>> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>> >   at
>> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> >   at java.base/java.lang.Thread.run(Unknown Source)
>> > 2021-08-16 15:58:23.986 [TaskExecutorLocalStateStoresManager shutdown
>> hook] INFO  o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager  -
>> Shutting down TaskExecutorLocalStateStoresManager.
>> >
>> >
>> > Thanks very much!
>> >
>> > Abhishek
>>
>

Re: Flink taskmanager in crash loop

Posted by Abhishek Rai <ab...@netspring.io>.
Thanks Yangze, indeed, I see the following in the log about 10s before the
final crash (masked some sensitive data using `MASKED`):

2021-08-16 15:58:13.985 [Canceler/Interrupts for Source: MAKSED] WARN
org.apache.flink.runtime.taskmanager.Task  - Task 'MASKED' did not react to
cancelling signal for 30 seconds, but is stuck in method:
 java.base@11.0.11/jdk.internal.misc.Unsafe.park(Native Method)
java.base@11.0.11/java.util.concurrent.locks.LockSupport.park(Unknown
Source)
java.base@11.0.11/java.util.concurrent.CompletableFuture$Signaller.block(Unknown
Source)
java.base@11.0.11/java.util.concurrent.ForkJoinPool.managedBlock(Unknown
Source)
java.base@11.0.11/java.util.concurrent.CompletableFuture.waitingGet(Unknown
Source)
java.base@11.0.11/java.util.concurrent.CompletableFuture.join(Unknown
Source)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.cleanUpInvoke(StreamTask.java:705)
app//org.apache.flink.streaming.runtime.tasks.SourceStreamTask.cleanUpInvoke(SourceStreamTask.java:186)
app//org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:637)
app//org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:776)
app//org.apache.flink.runtime.taskmanager.Task.run(Task.java:563)
java.base@11.0.11/java.lang.Thread.run(Unknown Source)

2021-08-16 15:58:13.986 [Cancellation Watchdog for Source: MASKED] ERROR
org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Fatal error
occurred while executing the TaskManager. Shutting it down...
org.apache.flink.util.FlinkRuntimeException: Task did not exit gracefully
within 180 + seconds.
  at
org.apache.flink.runtime.taskmanager.Task$TaskCancelerWatchDog.run(Task.java:1718)
  at java.base/java.lang.Thread.run(Unknown Source)



On Mon, Aug 16, 2021 at 7:05 PM Yangze Guo <ka...@gmail.com> wrote:

> Hi, Abhishek,
>
> Do you see something like "Fatal error occurred while executing the
> TaskManager" in your log or would you like to provide the whole task
> manager log?
>
> Best,
> Yangze Guo
>
> On Tue, Aug 17, 2021 at 5:17 AM Abhishek Rai <ab...@netspring.io>
> wrote:
> >
> > Hello,
> >
> > In our production environment, running Flink 1.13 (Scala 2.11), where
> Flink has been working without issues with a dozen or so jobs running for a
> while, Flink taskmanager started crash looping with a period of ~4 minutes
> per crash.  The stack trace is not very informative, therefore reaching out
> for help, see below.
> >
> > The only other thing that's unusual is that due to what might be a
> product issue (custom job code running on Flink), some or all of our tasks
> are also in a crash loop.  Still, I wasn't expecting taskmanager itself to
> die.  Does taskmanager have some built in feature to crash if all/most
> tasks are crashing?
> >
> > 2021-08-16 15:58:23.984 [main] ERROR
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Terminating
> TaskManagerRunner with exit code 1.
> > org.apache.flink.util.FlinkException: Unexpected failure during runtime
> of TaskManagerRunner.
> >   at
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:382)
> >   at
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:413)
> >   at java.base/java.security.AccessController.doPrivileged(Native Method)
> >   at java.base/javax.security.auth.Subject.doAs(Unknown Source)
> >   at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
> >   at
> org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
> >   at
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:413)
> >   at
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:396)
> >   at
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:354)
> > Caused by: java.util.concurrent.TimeoutException: null
> >   at
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255)
> >   at
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
> >   at
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582)
> >   at
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown
> Source)
> >   at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
> >   at
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown
> Source)
> >   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown
> Source)
> >   at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
> >   at java.base/java.lang.Thread.run(Unknown Source)
> > 2021-08-16 15:58:23.986 [TaskExecutorLocalStateStoresManager shutdown
> hook] INFO  o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager  -
> Shutting down TaskExecutorLocalStateStoresManager.
> >
> >
> > Thanks very much!
> >
> > Abhishek
>

Re: Flink taskmanager in crash loop

Posted by Yangze Guo <ka...@gmail.com>.
Hi, Abhishek,

Do you see something like "Fatal error occurred while executing the
TaskManager" in your log or would you like to provide the whole task
manager log?

Best,
Yangze Guo

On Tue, Aug 17, 2021 at 5:17 AM Abhishek Rai <ab...@netspring.io> wrote:
>
> Hello,
>
> In our production environment, running Flink 1.13 (Scala 2.11), where Flink has been working without issues with a dozen or so jobs running for a while, Flink taskmanager started crash looping with a period of ~4 minutes per crash.  The stack trace is not very informative, therefore reaching out for help, see below.
>
> The only other thing that's unusual is that due to what might be a product issue (custom job code running on Flink), some or all of our tasks are also in a crash loop.  Still, I wasn't expecting taskmanager itself to die.  Does taskmanager have some built in feature to crash if all/most tasks are crashing?
>
> 2021-08-16 15:58:23.984 [main] ERROR org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - Terminating TaskManagerRunner with exit code 1.
> org.apache.flink.util.FlinkException: Unexpected failure during runtime of TaskManagerRunner.
>   at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManager(TaskManagerRunner.java:382)
>   at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.lambda$runTaskManagerProcessSecurely$3(TaskManagerRunner.java:413)
>   at java.base/java.security.AccessController.doPrivileged(Native Method)
>   at java.base/javax.security.auth.Subject.doAs(Unknown Source)
>   at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
>   at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:413)
>   at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.runTaskManagerProcessSecurely(TaskManagerRunner.java:396)
>   at org.apache.flink.runtime.taskexecutor.TaskManagerRunner.main(TaskManagerRunner.java:354)
> Caused by: java.util.concurrent.TimeoutException: null
>   at org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:1255)
>   at org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:217)
>   at org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$15(FutureUtils.java:582)
>   at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>   at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
>   at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
>   at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>   at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>   at java.base/java.lang.Thread.run(Unknown Source)
> 2021-08-16 15:58:23.986 [TaskExecutorLocalStateStoresManager shutdown hook] INFO  o.a.flink.runtime.state.TaskExecutorLocalStateStoresManager  - Shutting down TaskExecutorLocalStateStoresManager.
>
>
> Thanks very much!
>
> Abhishek