You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/01/13 11:40:39 UTC
[jira] [Commented] (HAWQ-272) Segment status will not be down after killing postmaster process of segment

    [ https://issues.apache.org/jira/browse/HAWQ-272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15095988#comment-15095988 ] 

ASF GitHub Bot commented on HAWQ-272:
-------------------------------------

GitHub user linwen opened a pull request:

    https://github.com/apache/incubator-hawq/pull/264

    HAWQ-272. Quit resource manager process on segment if postmaster is n…

    This commit is to fix the bug, segment's postmaster process is killed, but segment's resource manager still alive and send IMAlive message to master resource manager. But actually, this segment can't work.
    Please review. Thanks!


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/linwen/incubator-hawq HAWQ-272

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/incubator-hawq/pull/264.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #264
    
----
commit ca9cdbdd582caf0b1014e4cac369be5da4a488fb
Author: Wen Lin <wl...@pivotal.io>
Date:   2016-01-13T10:29:34Z

    HAWQ-272. Quit resource manager process on segment if postmaster is not alive

----


> Segment status will not be down after killing postmaster process of segment 
> ----------------------------------------------------------------------------
>
>                 Key: HAWQ-272
>                 URL: https://issues.apache.org/jira/browse/HAWQ-272
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: Fault Tolerance
>            Reporter: Dong Li
>            Assignee: Lin Wen
>
> At the cluster, if it has QE, and you kill the postmaster pocess of segment(pid=59335), it can also work and the state of the segment in gp_segment_configuration is up.
> {code}
> ps -ef |grep postgres
>   502 59309     1   0 10:07AM ??         0:05.39 /Users/intern/work/code/main/hawq-db-devel/bin/postgres -D /Users/intern/hawq-data-directory/masterdd -i -M master -p 5432 --silent-mode=true
>   502 59310 59309   0 10:07AM ??         0:00.38 postgres: port  5432, master logger process
>   502 59313 59309   0 10:07AM ??         0:00.16 postgres: port  5432, stats collector process
>   502 59314 59309   0 10:07AM ??         0:01.89 postgres: port  5432, writer process
>   502 59315 59309   0 10:07AM ??         0:00.27 postgres: port  5432, checkpoint process
>   502 59316 59309   0 10:07AM ??         0:00.09 postgres: port  5432, seqserver process
>   502 59317 59309   0 10:07AM ??         0:00.29 postgres: port  5432, WAL Send Server process
>   502 59318 59309   0 10:07AM ??         0:00.01 postgres: port  5432, DFS Metadata Cache process
>   502 59319 59309   0 10:07AM ??         0:10.02 postgres: port  5432, master resource manager
>   502 59335     1   0 10:07AM ??         0:12.94 /Users/intern/work/code/main/hawq-db-devel/bin/postgres -D /Users/intern/hawq-data-directory/segmentdd -i -M segment -p 40000 --silent-mode=true
>   502 59336 59335   0 10:07AM ??         0:00.61 postgres: port 40000, logger process
>   502 59403 59309   0 10:07AM ??         0:02.28 postgres: port  5432, intern intern [local] con11 cmd63 idle [local]
>   502 63451 59335   0 10:25AM ??         0:00.12 postgres: port 40000, stats collector process
>   502 63452 59335   0 10:25AM ??         0:01.43 postgres: port 40000, writer process
>   502 63453 59335   0 10:25AM ??         0:00.20 postgres: port 40000, checkpoint process
>   502 63454 59335   0 10:25AM ??         0:03.64 postgres: port 40000, segment resource manager
>   502 63966 59335   0 10:27AM ??         0:04.88 postgres: port 40000, intern intern 127.0.0.1(56871) con11 seg0 idle
>   502 63967 59335   0 10:27AM ??         0:04.90 postgres: port 40000, intern intern 127.0.0.1(56873) con11 seg1 idle
>   502 63968 59335   0 10:27AM ??         0:07.12 postgres: port 40000, intern intern 127.0.0.1(56875) con11 seg2 idle
>   502 63969 59335   0 10:27AM ??         0:07.12 postgres: port 40000, intern intern 127.0.0.1(56877) con11 seg3 idle
>   502 63970 59335   0 10:27AM ??         0:04.89 postgres: port 40000, intern intern 127.0.0.1(56879) con11 seg4 idle
>   502 63971 59335   0 10:27AM ??         0:04.86 postgres: port 40000, intern intern 127.0.0.1(56881) con11 seg5 idle
> kill -9 59335
> ps -ef |grep postgres
>   502 59309     1   0 10:07AM ??         0:05.64 /Users/intern/work/code/main/hawq-db-devel/bin/postgres -D /Users/intern/hawq-data-directory/masterdd -i -M master -p 5432 --silent-mode=true
>   502 59310 59309   0 10:07AM ??         0:00.40 postgres: port  5432, master logger process
>   502 59313 59309   0 10:07AM ??         0:00.17 postgres: port  5432, stats collector process
>   502 59314 59309   0 10:07AM ??         0:02.01 postgres: port  5432, writer process
>   502 59315 59309   0 10:07AM ??         0:00.28 postgres: port  5432, checkpoint process
>   502 59316 59309   0 10:07AM ??         0:00.09 postgres: port  5432, seqserver process
>   502 59317 59309   0 10:07AM ??         0:00.31 postgres: port  5432, WAL Send Server process
>   502 59318 59309   0 10:07AM ??         0:00.01 postgres: port  5432, DFS Metadata Cache process
>   502 59319 59309   0 10:07AM ??         0:10.64 postgres: port  5432, master resource manager
>   502 59336     1   0 10:07AM ??         0:00.64 postgres: port 40000, logger process
>   502 59403 59309   0 10:07AM ??         0:02.40 postgres: port  5432, intern intern [local] con11 cmd67 idle [local]
>   502 63454     1   0 10:25AM ??         0:03.96 postgres: port 40000, segment resource manager
>   502 63966     1   0 10:27AM ??         0:04.96 postgres: port 40000, intern intern 127.0.0.1(56871) con11 seg0 idle
>   502 63967     1   0 10:27AM ??         0:04.98 postgres: port 40000, intern intern 127.0.0.1(56873) con11 seg1 idle
>   502 63968     1   0 10:27AM ??         0:07.20 postgres: port 40000, intern intern 127.0.0.1(56875) con11 seg2 idle
>   502 63969     1   0 10:27AM ??         0:07.21 postgres: port 40000, intern intern 127.0.0.1(56877) con11 seg3 idle
>   502 63970     1   0 10:27AM ??         0:04.98 postgres: port 40000, intern intern 127.0.0.1(56879) con11 seg4 idle
>   502 63971     1   0 10:27AM ??         0:04.94 postgres: port 40000, intern intern 127.0.0.1(56881) con11 seg5 idle
> {code}
> Then we execute insert sql.
> {code}
> intern=# select count(*) from b;
>   count
> ----------
>  41058000
> (1 row)
> intern=# insert into b VALUES (1);
> INSERT 0 1
> intern=# select count(*) from b;
>   count
> ----------
>  41058001
> (1 row)
> intern=# select * from gp_segment_configuration ;
>  registration_order | role | status | port  |  hostname  |  address
> --------------------+------+--------+-------+------------+------------
>                   0 | m    | u      |  5432 | doli.local | doli.local
>                   1 | p    | u      | 40000 | localhost  | 127.0.0.1
> (2 rows)
> {code}
> If your QE is enough to execute the query, it will success. Otherwise it will call postmaster to create QE, and it will find postmaster is not alive and mark it as down.
> The problem is that we should check the postmaster process of the segment live state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)