You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Robert Metzger <rm...@apache.org> on 2016/12/15 15:05:13 UTC
Re: Task manager processes crashing one after the other

I experienced a quite similar issue with RocksDB on my cluster, also after
some retries (with the Flink 1.1.4 RC3)

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f1611829f4e, pid=3545, tid=139732543575808
#
# JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build
1.7.0_67-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# C  [ld-linux-x86-64.so.2+0x9f4e]  _dl_rtld_di_serinfo+0x86e
#
# Failed to write core dump. Core dumps have been disabled. To enable core
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /yarn/nm/usercache/robert/appcache/application_
1481291289979_0024/container_1481291289979_0024_01_008775/hs_err_pid3545.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.




On Fri, Aug 26, 2016 at 9:32 AM, Gyula Fóra <gy...@gmail.com> wrote:

> Some addtitional info:
>
> It doesn't seem to happen the first time I start the jobs / restore them
> from a savepoint. It happens as jobs are failing over after a task manager
> failure.
>
> This could be an issue caused by a non-empty rocks directory (that was
> somehow in an inconsistent state) but that should not happen as the
> instanceDbPath is deleted before opening.
>
> Gyula
>
> Gyula Fóra <gy...@gmail.com> ezt írta (időpont: 2016. aug. 25., Cs,
> 23:28):
>
> > Stephan,
> >
> > I ported the fix for the concurrency issue from the Flink commit so now
> > that should be fine. I ran some fail/restore tests and that specific
> issue
> > hasn't appeared again.
> >
> > However I now get many segfaults in the initializeForJob method where the
> > RocksDb instance is opened. Just for the record this is the same exact
> code
> > as we have in Flink now.:
> >
> > #
> > # A fatal error has been detected by the Java Runtime Environment:
> > #
> > #  SIGSEGV (0xb) at pc=0x00007f12b018f51f, pid=12576, tid=139668190197504
> > #
> > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build
> > 1.8.0_60-b27)
> > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode
> > linux-amd64 )
> > # Problematic frame:
> > # C  [libc.so.6+0x7b51f]
> > ...
> > Stack: [0x00007f0708ccf000,0x00007f0708dd0000],  sp=0x00007f0708dccd20,
> >  free space=1015k
> > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
> > code)
> > C  [libc.so.6+0x7b51f]
> >
> > Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> > j
> >  org.rocksdb.RocksDB.open(JLjava/lang/String;Ljava/util/List;
> I)Ljava/util/List;+0
> > j
> >  org.rocksdb.RocksDB.open(Lorg/rocksdb/DBOptions;Ljava/lang/S
> tring;Ljava/util/List;Ljava/util/List;)Lorg/rocksdb/RocksDB;+23
> > j
> >  com.king.rbea.backend.state.rocksdb.RocksDBStateBackend.init
> ializeForJob...
> >
> > And this happens fairly frequently when the jobs are restarting after
> > failure.
> >
> > Cheers,
> > Gyula
> >
> > Gyula Fóra <gy...@gmail.com> ezt írta (időpont: 2016. aug. 25., Cs,
> > 19:07):
> >
> >> Yes seems like that, I remember the fix in Flink. I apparently made a
> >> mistake somewhere in our code :)
> >>
> >> Thanks,
> >> Gyula
> >>
> >> On Thu, Aug 25, 2016, 18:59 Stephan Ewen <se...@apache.org> wrote:
> >>
> >>> We saw some crashes in earlier versions when native handles in RocksDB
> >>> (even for config option objects) were manually and too eagerly
> released.
> >>>
> >>> Maybe you have a similar issue here?
> >>>
> >>> On Thu, Aug 25, 2016 at 6:27 PM, Gyula Fóra <gy...@gmail.com>
> >>> wrote:
> >>>
> >>> > Hi,
> >>> > This seems to be a sneaky concurrency issue in our custom
> statebackend
> >>> > implementation.
> >>> >
> >>> > I made some changes, will keep you posted.
> >>> >
> >>> > Cheers,
> >>> > Gyula
> >>> >
> >>> > On Thu, Aug 25, 2016, 10:54 Gyula Fóra <gy...@gmail.com> wrote:
> >>> >
> >>> > > Hi,
> >>> > >
> >>> > > Sure I am sending the TM logs in priv.
> >>> > >
> >>> > > Currently what I did was to bump the Rocks version to 4.9.0 let's
> >>> see if
> >>> > > that helps.
> >>> > >
> >>> > > Cheers,
> >>> > > Gyula
> >>> > >
> >>> > > Till Rohrmann <tr...@apache.org> ezt írta (időpont: 2016. aug.
> >>> 25.,
> >>> > > Cs, 10:35):
> >>> > >
> >>> > >> Hi Gyula,
> >>> > >>
> >>> > >> I haven't seen this problem before. Do you have the logs of the
> >>> failed
> >>> > TMs
> >>> > >> so that we have some more context what was going on?
> >>> > >>
> >>> > >> Cheers,
> >>> > >> Till
> >>> > >>
> >>> > >> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra <gy...@apache.org>
> >>> wrote:
> >>> > >>
> >>> > >> > Hi guys,
> >>> > >> >
> >>> > >> > For quite some time now we fairly frequently experience a task
> >>> manager
> >>> > >> > crashes around the time new streaming jobs are deployed. We use
> >>> > RocksDB
> >>> > >> > backend so this might be related.
> >>> > >> >
> >>> > >> > We tried changing the GC from G1 to CMS that didnt help.
> >>> > >> >
> >>> > >> > Yesterday for instance 6 task managers crashed one ofter the
> other
> >>> > with
> >>> > >> > similar errors:
> >>> > >> >
> >>> > >> > *** Error in `java': double free or corruption (!prev):
> >>> > >> 0x00007fac0414d760
> >>> > >> > ***
> >>> > >> > *** Error in `java': free(): invalid pointer: 0x00007f8dcc0026c0
> >>> ***
> >>> > >> > *** Error in `java': double free or corruption (!prev):
> >>> > >> 0x00007f15247f9a90
> >>> > >> > ***
> >>> > >> > ...
> >>> > >> >
> >>> > >> > Does anyone have any clue what might cause this or how to debug?
> >>> > >> > This is very a critical issue :(
> >>> > >> >
> >>> > >> > Cheers,
> >>> > >> > Gyula
> >>> > >> >
> >>> > >>
> >>> > >
> >>> >
> >>>
> >>
>