You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Malte Schwarzer <im...@mieo.de> on 2017/01/27 15:13:28 UTC

TaskManager randomly dies

Hi all,

when running a Flink batch job, from time to time a TaskManager dies
randomly, which makes the full job failing. All other nodes then throw
the following exception:

Error obtaining the sorted input: Thread 'SortMerger Reading Thread'
terminated due to an exception: Connection unexpectedly closed by remote
task manager 'dyingnode' ...

However, there are no error messages in the log of 'dyingnode'.

But in the PID thread dump of 'dyingnode' I found this:

# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x00003fff701afa4c, pid=1119228,
tid=0x00003ff38a3ff1b0
#
# JRE version: OpenJDK Runtime Environment (8.0_101-b14) (build
1.8.0_101-b14)
# Java VM: OpenJDK 64-Bit Server VM (25.101-b14 mixed mode linux-ppc64 )
# Problematic frame:
# J 433 C2 org.apache.flink.runtime.util.DataOutputSerializer.write(I)V
(40 bytes) @ 0x00003fff701afa4c [0x00003fff701afa00+0x4c]
# ...

What can cause this? And is this Flink related?


Best regards,
Malte

Re: TaskManager randomly dies

Posted by Robert Metzger <rm...@apache.org>.
Hi,
which Flink version are you using?

This issue occurred quite freqently in the 1.2.0 RC0 and should be fixed in
later RCs.

On Fri, Jan 27, 2017 at 4:13 PM, Malte Schwarzer <im...@mieo.de> wrote:

> Hi all,
>
> when running a Flink batch job, from time to time a TaskManager dies
> randomly, which makes the full job failing. All other nodes then throw
> the following exception:
>
> Error obtaining the sorted input: Thread 'SortMerger Reading Thread'
> terminated due to an exception: Connection unexpectedly closed by remote
> task manager 'dyingnode' ...
>
> However, there are no error messages in the log of 'dyingnode'.
>
> But in the PID thread dump of 'dyingnode' I found this:
>
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGBUS (0x7) at pc=0x00003fff701afa4c, pid=1119228,
> tid=0x00003ff38a3ff1b0
> #
> # JRE version: OpenJDK Runtime Environment (8.0_101-b14) (build
> 1.8.0_101-b14)
> # Java VM: OpenJDK 64-Bit Server VM (25.101-b14 mixed mode linux-ppc64 )
> # Problematic frame:
> # J 433 C2 org.apache.flink.runtime.util.DataOutputSerializer.write(I)V
> (40 bytes) @ 0x00003fff701afa4c [0x00003fff701afa00+0x4c]
> # ...
>
> What can cause this? And is this Flink related?
>
>
> Best regards,
> Malte
>