You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by "Lin Wen (JIRA)" <ji...@apache.org> on 2017/01/20 06:54:26 UTC

[jira] [Assigned] (HAWQ-1284) HAWQ master is coredump when kill all process on master and standby

     [ https://issues.apache.org/jira/browse/HAWQ-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lin Wen reassigned HAWQ-1284:
-----------------------------

    Assignee: Lin Wen  (was: Ed Espino)

> HAWQ master is coredump when kill all process on master and standby
> -------------------------------------------------------------------
>
>                 Key: HAWQ-1284
>                 URL: https://issues.apache.org/jira/browse/HAWQ-1284
>             Project: Apache HAWQ
>          Issue Type: Bug
>            Reporter: Lin Wen
>            Assignee: Lin Wen
>         Attachments: hawq-2017-01-17_054054.csv
>
>
> When hawq cluster is running(no active queries), kill all postgres processes in master(with command "killall postgres") and then kill all processes in standby(with command "killall gpsyncmaster"), hawq master will generate coredump randomly.
> The callstack is:
> #0  0x00000032214325e5 in raise () from /lib64/libc.so.6
> #1  0x0000003221433dc5 in abort () from /lib64/libc.so.6
> #2  0x00000000008cce7f in errfinish (dummy=Unhandled dwarf expression opcode 0xf3
> ) at elog.c:686
> #3  0x00000000008cf032 in elog_finish (elevel=Unhandled dwarf expression opcode 0xf3
> ) at elog.c:1463
> #4  0x00000000007d4912 in proc_exit_prepare (code=1) at ipc.c:153
> #5  0x00000000007d4a38 in proc_exit (code=1) at ipc.c:93
> #6  0x00000000008ccc7e in errfinish (dummy=Unhandled dwarf expression opcode 0xf3
> ) at elog.c:670
> #7  0x000000000078dea1 in ServiceDoConnect (listenerPort=64556, complain=Unhandled dwarf expression opcode 0xf3
> ) at service.c:165
> #8  0x00000000004efd5a in XLogQDMirrorWrite (WriteRqst=<value optimized out>, flexible=0 '\000', xlog_switch=0 '\000') at xlog.c:1981
> #9  XLogWrite (WriteRqst=<value optimized out>, flexible=0 '\000', xlog_switch=0 '\000') at xlog.c:2354
> #10 0x00000000004f2242 in XLogFlush (record=...) at xlog.c:2572
> #11 0x00000000004f7288 in CreateCheckPoint (shutdown=Unhandled dwarf expression opcode 0xf3
> ) at xlog.c:8136
> #12 0x00000000004f9f72 in ShutdownXLOG (code=Unhandled dwarf expression opcode 0xf3
> ) at xlog.c:7865
> #13 0x000000000078b2b0 in BackgroundWriterMain () at bgwriter.c:318
> #14 0x000000000055a870 in AuxiliaryProcessMain (argc=<value optimized out>, argv=0x7fff02330850) at bootstrap.c:467
> #15 0x000000000079b4f0 in StartChildProcess (type=Unhandled dwarf expression opcode 0xf3
> ) at postmaster.c:6836
> #16 0x000000000079b7aa in CommenceNormalOperations () at postmaster.c:3618
> #17 0x000000000079fee4 in do_reaper () at postmaster.c:3831
> #18 ServerLoop () at postmaster.c:2136
> #19 0x00000000007a2179 in PostmasterMain (argc=Unhandled dwarf expression opcode 0xf3
> ) at postmaster.c:1454
> #20 0x00000000004a4f99 in main (argc=9, argv=0x2a4f010) at main.c:226
> The reason is the "WAL Send Server process" is killed firstly, when writer process gets a shutdown request, it begins to create a checkpoint and sync xlog to standby master, however at this point, wal send server process has been killed. So writer process failed in connecting wal send server process, then ereport ERROR, 
> 				ereport(ERROR, (errcode(ERRCODE_GP_INTERCONNECTION_ERROR),
> 								errmsg("Could not connect to '%s': %s",
> 									   serviceConfig->title,
> 									   strerror(saved_err))));
> line:165, service.c
> From the call stack we can see, when ereport() is called, proc_exit_prepare() will be called. And at line:152, CritSectionCount is larger than 0, so PANIC occurs and a coredump is generated. CritSectionCount is added when writer process calls XLogFlush().
> 	if (CritSectionCount > 0)
> 		elog(PANIC, "process is dying from critical section");
>  
> A possible solution is before writer process write log to standby, check if wal send server process exists. If not, don't call call WalSendServerClientConnect() to connect wal send server process. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)