You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by "Yi Jin (JIRA)" <ji...@apache.org> on 2016/04/06 07:48:25 UTC
[jira] [Closed] (HAWQ-564) QD hangs when connecting to resource manager

     [ https://issues.apache.org/jira/browse/HAWQ-564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yi Jin closed HAWQ-564.
-----------------------

> QD hangs when connecting to resource manager
> --------------------------------------------
>
>                 Key: HAWQ-564
>                 URL: https://issues.apache.org/jira/browse/HAWQ-564
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: Resource Manager
>    Affects Versions: 2.0.0
>            Reporter: Chunling Wang
>            Assignee: Yi Jin
>             Fix For: 2.0.0
>
>
> When first inject panic in QE process, we run a query and segment is down. After the segment is up, we run another query and get correct answer. Then we inject the same panic second time. After the segment is down and then up again, we run a query and find QD process hangs when connecting to resource manager. Here is the backtrace when QD hangs:
> {code}
> * thread #1: tid = 0x21d8be, 0x00007fff890355be libsystem_kernel.dylib`poll + 10, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
>   * frame #0: 0x00007fff890355be libsystem_kernel.dylib`poll + 10
>     frame #1: 0x0000000101daeafe postgres`processAllCommFileDescs + 158 at rmcomm_AsyncComm.c:156
>     frame #2: 0x0000000101db85f5 postgres`callSyncRPCRemote(hostname=0x00007f9c19e00cd0, port=5437, sendbuff=0x00007f9c1b918f50, sendbuffsize=80, sendmsgid=259, exprecvmsgid=2307, recvsmb=<unavailable>, errorbuf=0x000000010230c1a0, errorbufsize=<unavailable>) + 645 at rmcomm_SyncComm.c:122
>     frame #3: 0x0000000101db2d85 postgres`acquireResourceFromRM [inlined] callSyncRPCToRM(sendbuff=0x00007f9c1b918f50, sendbuffsize=<unavailable>, sendmsgid=259, exprecvmsgid=2307, recvsmb=0x00007f9c1b918e70, errorbuf=<unavailable>, errorbufsize=1024) + 73 at rmcomm_QD2RM.c:2780
>     frame #4: 0x0000000101db2d3c postgres`acquireResourceFromRM(index=<unavailable>, sessionid=12, slice_size=462524016, iobytes=134217728, preferred_nodes=0x00007f9c1a02d398, preferred_nodes_size=<unavailable>, max_seg_count_fix=<unavailable>, min_seg_count_fix=<unavailable>, errorbuf=<unavailable>, errorbufsize=<unavailable>) + 572 at rmcomm_QD2RM.c:742
>     frame #5: 0x0000000101c979e7 postgres`AllocateResource(life=QRL_ONCE, slice_size=5, iobytes=134217728, max_target_segment_num=1, min_target_segment_num=1, vol_info=0x00007f9c1a02d398, vol_info_size=1) + 631 at pquery.c:796
>     frame #6: 0x0000000101e8c60f postgres`calculate_planner_segment_num(query=<unavailable>, resourceLife=QRL_ONCE, fullRangeTable=<unavailable>, intoPolicy=<unavailable>, sliceNum=5) + 14287 at cdbdatalocality.c:4207
>     frame #7: 0x0000000101c0f671 postgres`planner + 106 at planner.c:496
>     frame #8: 0x0000000101c0f607 postgres`planner(parse=0x00007f9c1a02a140, cursorOptions=<unavailable>, boundParams=0x0000000000000000, resourceLife=QRL_ONCE) + 311 at planner.c:310
>     frame #9: 0x0000000101c8eb33 postgres`pg_plan_query(querytree=0x00007f9c1a02a140, boundParams=0x0000000000000000, resource_life=QRL_ONCE) + 99 at postgres.c:837
>     frame #10: 0x0000000101c956ae postgres`exec_simple_query + 21 at postgres.c:911
>     frame #11: 0x0000000101c95699 postgres`exec_simple_query(query_string=0x00007f9c1a028a30, seqServerHost=0x0000000000000000, seqServerPort=-1) + 1577 at postgres.c:1671
>     frame #12: 0x0000000101c91a4c postgres`PostgresMain(argc=<unavailable>, argv=<unavailable>, username=0x00007f9c1b808cf0) + 9404 at postgres.c:4754
>     frame #13: 0x0000000101c4ae02 postgres`ServerLoop [inlined] BackendRun + 105 at postmaster.c:5889
>     frame #14: 0x0000000101c4ad99 postgres`ServerLoop at postmaster.c:5484
>     frame #15: 0x0000000101c4ad99 postgres`ServerLoop + 9593 at postmaster.c:2163
>     frame #16: 0x0000000101c47d3b postgres`PostmasterMain(argc=<unavailable>, argv=<unavailable>) + 5019 at postmaster.c:1454
>     frame #17: 0x0000000101bb1aa9 postgres`main(argc=9, argv=0x00007f9c19c1eef0) + 1433 at main.c:209
>     frame #18: 0x00007fff95e8c5c9 libdyld.dylib`start + 1
>   thread #2: tid = 0x21d8bf, 0x00007fff890355be libsystem_kernel.dylib`poll + 10
>     frame #0: 0x00007fff890355be libsystem_kernel.dylib`poll + 10
>     frame #1: 0x0000000101dfe723 postgres`rxThreadFunc(arg=<unavailable>) + 2163 at ic_udp.c:6251
>     frame #2: 0x00007fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
>     frame #3: 0x00007fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
>     frame #4: 0x00007fff95e804b1 libsystem_pthread.dylib`thread_start + 13
>   thread #3: tid = 0x21d9c2, 0x00007fff890343f6 libsystem_kernel.dylib`__select + 10
>     frame #0: 0x00007fff890343f6 libsystem_kernel.dylib`__select + 10
>     frame #1: 0x0000000101e9d42e postgres`pg_usleep(microsec=<unavailable>) + 78 at pgsleep.c:43
>     frame #2: 0x0000000101db1a66 postgres`generateResourceRefreshHeartBeat(arg=0x00007f9c19f02480) + 166 at rmcomm_QD2RM.c:1519
>     frame #3: 0x00007fff95e822fc libsystem_pthread.dylib`_pthread_body + 131
>     frame #4: 0x00007fff95e82279 libsystem_pthread.dylib`_pthread_start + 176
>     frame #5: 0x00007fff95e804b1 libsystem_pthread.dylib`thread_start + 13
> {code}
> And here is the operations:
> 1. Before injection, get query answer correctly.
> {code}
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
>  count
> -------
>   3725
> (1 row)
> {code}
> 2. Inject panic, fault triggered, and segment is down.
> {code}
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
> ERROR:  fault triggered, fault name:'fail_qe_when_do_query' fault type:'panic' (faultinjector.c:656)  (seg0 localhost:40000 pid=26936)
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
> ERROR:  failed to acquire resource from resource manager, 1 of 1 segments is unavailable (pquery.c:807)
> {code}
> 3. After a while and when segment is up, get correct answer.
> {code}
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
>  count
> -------
>   3725
> (1 row)
> {code}
> 4. Inject again, fault triggered, and segment is down.
> {code}
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
> ERROR:  fault triggered, fault name:'fail_qe_when_do_query' fault type:'panic' (faultinjector.c:656)  (seg0 localhost:40000 pid=26994)
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
> ERROR:  failed to acquire resource from resource manager, 1 of 1 segments is unavailable (pquery.c:807)
> {code}
> 5. After a while, run query and find QD hangs.
> {code}
> dispatch=# select count(*) from test_dispatch as t1, test_dispatch as t2, test_dispatch as t3 where t1.id *2 = t2.id and t1.id < t3.id;
> {code}
> 6. Open another terminal, find segment is already up.
> {code}
> dispatch=# select * from gp_segment_configuration;
>  registration_order | role | status | port  |          hostname           |           address           | description
> --------------------+------+--------+-------+-----------------------------+-----------------------------+-------------
>                   0 | m    | u      |  5432 | ChunlingdeMacBook-Pro.local | ChunlingdeMacBook-Pro.local |
>                   1 | p    | u      | 40000 | localhost                   | 127.0.0.1                   |
> (2 rows)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)