You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by "Kuien Liu (JIRA)" <ji...@apache.org> on 2017/09/25 06:53:02 UTC

[jira] [Comment Edited] (HAWQ-1529) "segment resource manager" will NOT exit when postmaster died

    [ https://issues.apache.org/jira/browse/HAWQ-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16178608#comment-16178608 ] 

Kuien Liu edited comment on HAWQ-1529 at 9/25/17 6:52 AM:
----------------------------------------------------------

A possible patch looks strange but does work.

{code:diff}
--- a/src/backend/resourcemanager/resourcemanager_RMSEG.c
+++ b/src/backend/resourcemanager/resourcemanager_RMSEG.c
@@ -26,6 +26,7 @@
 #include "communication/rmcomm_MessageServer.h"
 #include "communication/rmcomm_RMSEG2RM.h"
 #include "resourceenforcer/resourceenforcer.h"
+#include "storage/pmsignal.h" /* PostmasterIsAlive */
 #include "cdb/cdbtmpdir.h"

 int ResManagerMainSegment2ndPhase(void)
@@ -156,7 +157,7 @@ int MainHandlerLoop_RMSEG(void)
        DRMGlobalInstance->ResourceManagerStartTime = gettime_microsec();
        while( DRMGlobalInstance->ResManagerMainKeepRun ) {

-               if (!PostmasterIsAlive(true)) {
+               if (0 == PostmasterIsAlive(true)) {
                        DRMGlobalInstance->ResManagerMainKeepRun = false;
                        elog(LOG, "Postmaster is not alive, resource manager exits");
                        break;
{code}



was (Author: kuien):
A possible patch looks strange but does work.

{code:c}
--- a/src/backend/resourcemanager/resourcemanager_RMSEG.c
+++ b/src/backend/resourcemanager/resourcemanager_RMSEG.c
@@ -26,6 +26,7 @@
 #include "communication/rmcomm_MessageServer.h"
 #include "communication/rmcomm_RMSEG2RM.h"
 #include "resourceenforcer/resourceenforcer.h"
+#include "storage/pmsignal.h" /* PostmasterIsAlive */
 #include "cdb/cdbtmpdir.h"

 int ResManagerMainSegment2ndPhase(void)
@@ -156,7 +157,7 @@ int MainHandlerLoop_RMSEG(void)
        DRMGlobalInstance->ResourceManagerStartTime = gettime_microsec();
        while( DRMGlobalInstance->ResManagerMainKeepRun ) {

-               if (!PostmasterIsAlive(true)) {
+               if (0 == PostmasterIsAlive(true)) {
                        DRMGlobalInstance->ResManagerMainKeepRun = false;
                        elog(LOG, "Postmaster is not alive, resource manager exits");
                        break;
{code}


> "segment resource manager" will NOT exit when postmaster died
> -------------------------------------------------------------
>
>                 Key: HAWQ-1529
>                 URL: https://issues.apache.org/jira/browse/HAWQ-1529
>             Project: Apache HAWQ
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Kuien Liu
>            Assignee: Radar Lei
>
> If I send SIGKILL to postmaster of segment by 'kill -9', then postmaster dies, BUT "segment resource manager" and "logger process" are still alive and flushing "WARNING" each 30s.
> To my understanding, "logger process" is waiting for "segment resource manager", but the resource manager will not detect the alive-status of postmaster and continue waiting. Does it make sense? Why not quit in case of postmaster gone? 
> The call stack of RM when postmaster is killed:
> #0  0x00007f19023ccab6 in poll () from /lib64/libc.so.6
> #1  0x0000000000a48c9e in processAllCommFileDescs () at rmcomm_AsyncComm.c:156
> #2  0x0000000000a8ce5e in MainHandlerLoop_RMSEG () at resourcemanager_RMSEG.c:166
> #3  0x0000000000a8cba3 in ResManagerMainSegment2ndPhase () at resourcemanager_RMSEG.c:71
> #4  0x0000000000a8d966 in ResManagerMain (argc=0x3, argv=0x7fffa018b890) at resourcemanager.c:346
> #5  0x0000000000a8db45 in ResManagerProcessStartup () at resourcemanager.c:411
> #6  0x0000000000899b89 in CommenceNormalOperations () at postmaster.c:3673
> #7  0x000000000089a562 in do_reaper () at postmaster.c:4021
> #8  0x00000000008969bb in ServerLoop () at postmaster.c:2136
> #9  0x0000000000895a78 in PostmasterMain (argc=0xc, argv=0x229a730) at postmaster.c:1454
> #10 0x00000000007b185d in main (argc=0xc, argv=0x229a730) at main.c:226
> #11 0x00007f190231e994 in __libc_start_main () from /lib64/libc.so.6
> #12 0x00000000004bde89 in _start ()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)