You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Flavio Pompermaier <po...@okkam.it> on 2018/05/14 12:12:33 UTC

Taskmanager JVM crash

Hi to all,
I have a Flink 1.3.1 job that runs multiple times.
Everything goes well for some time (e.g. 10 jobs). Then, one or more TMs
suddently die.

In the .out file I find something like this:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f6f3897712f, pid=18794, tid=140110535448320
#
# JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build
1.8.0_72-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# C  [libc.so.6+0x7f12f]
#
# Failed to write core dump. Core dumps have been disabled. To enable core
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /home/user/hs_err_pid18794.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#


Attached the produced error report. Do you find anything useful?
I can even send you the job's jar with the data but it requires about 200
MB..

Best,
Flavio

Re: Taskmanager JVM crash

Posted by Stefan Richter <s....@data-artisans.com>.

No, that problem I mentioned does not affect batch jobs. Must be something different then, but unfortunately the dump looks not very helpful to me because of the „error occurred during error reporting (printing native stack)“.

> Am 14.05.2018 um 14:26 schrieb Flavio Pompermaier <po...@okkam.it>:
> 
> My job is a batch one, not a streaming job. Is it possible that the cause is the one you mentioned?
> 
> On Mon, 14 May 2018, 14:23 Stefan Richter, <s.richter@data-artisans.com <ma...@data-artisans.com>> wrote:
> Hi,
> 
> that looks like a known issue where Flink did not wait for the shutdown of the timer service before disposing state backends. This is problem fixed in the >= 1.4 branches.
> 
> Best,
> Stefan 
> 
>> Am 14.05.2018 um 14:12 schrieb Flavio Pompermaier <pompermaier@okkam.it <ma...@okkam.it>>:
>> 
>> Hi to all,
>> I have a Flink 1.3.1 job that runs multiple times.
>> Everything goes well for some time (e.g. 10 jobs). Then, one or more TMs suddently die.
>> 
>> In the .out file I find something like this:
>> 
>> #
>> # A fatal error has been detected by the Java Runtime Environment:
>> #
>> #  SIGSEGV (0xb) at pc=0x00007f6f3897712f, pid=18794, tid=140110535448320
>> #
>> # JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15)
>> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops)
>> # Problematic frame:
>> # C  [libc.so.6+0x7f12f]
>> #
>> # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
>> #
>> # An error report file with more information is saved as:
>> # /home/user/hs_err_pid18794.log
>> #
>> # If you would like to submit a bug report, please visit:
>> #   http://bugreport.java.com/bugreport/crash.jsp <http://bugreport.java.com/bugreport/crash.jsp>
>> #
>> 
>> 
>> Attached the produced error report. Do you find anything useful?
>> I can even send you the job's jar with the data but it requires about 200 MB..
>> 
>> Best,
>> Flavio
>> <hs_err_pid18794.log>
>

Re: Taskmanager JVM crash

Posted by Flavio Pompermaier <po...@okkam.it>.

My job is a batch one, not a streaming job. Is it possible that the cause
is the one you mentioned?

On Mon, 14 May 2018, 14:23 Stefan Richter, <s....@data-artisans.com>
wrote:

> Hi,
>
> that looks like a known issue where Flink did not wait for the shutdown of
> the timer service before disposing state backends. This is problem fixed in
> the >= 1.4 branches.
>
> Best,
> Stefan
>
> Am 14.05.2018 um 14:12 schrieb Flavio Pompermaier <po...@okkam.it>:
>
> Hi to all,
> I have a Flink 1.3.1 job that runs multiple times.
> Everything goes well for some time (e.g. 10 jobs). Then, one or more TMs
> suddently die.
>
> In the .out file I find something like this:
>
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007f6f3897712f, pid=18794, tid=140110535448320
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build
> 1.8.0_72-b15)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode
> linux-amd64 compressed oops)
> # Problematic frame:
> # C  [libc.so.6+0x7f12f]
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /home/user/hs_err_pid18794.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp
> #
>
>
> Attached the produced error report. Do you find anything useful?
> I can even send you the job's jar with the data but it requires about 200
> MB..
>
> Best,
> Flavio
> <hs_err_pid18794.log>
>
>
>

Re: Taskmanager JVM crash

Posted by Stefan Richter <s....@data-artisans.com>.

Hi,

that looks like a known issue where Flink did not wait for the shutdown of the timer service before disposing state backends. This is problem fixed in the >= 1.4 branches.

Best,
Stefan 

> Am 14.05.2018 um 14:12 schrieb Flavio Pompermaier <po...@okkam.it>:
> 
> Hi to all,
> I have a Flink 1.3.1 job that runs multiple times.
> Everything goes well for some time (e.g. 10 jobs). Then, one or more TMs suddently die.
> 
> In the .out file I find something like this:
> 
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007f6f3897712f, pid=18794, tid=140110535448320
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops)
> # Problematic frame:
> # C  [libc.so.6+0x7f12f]
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
> #
> # An error report file with more information is saved as:
> # /home/user/hs_err_pid18794.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://bugreport.java.com/bugreport/crash.jsp <http://bugreport.java.com/bugreport/crash.jsp>
> #
> 
> 
> Attached the produced error report. Do you find anything useful?
> I can even send you the job's jar with the data but it requires about 200 MB..
> 
> Best,
> Flavio
> <hs_err_pid18794.log>