You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by "Ming LI (JIRA)" <ji...@apache.org> on 2017/02/23 06:33:44 UTC
[jira] [Commented] (HAWQ-1342) QE process hang in shared input scan on segment node

    [ https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879968#comment-15879968 ] 

Ming LI commented on HAWQ-1342:
-------------------------------

The basic idea for this kinds of hung problem is to:
(1) The error thrown segment will invoke rollback the whole transaction, and all related fd will be closed during transaction end.
(2) The other segment just act as before, when wait for select(), it will loop until the specific fd is closed, then the code will run until process interrupt (the rollback transaction will send cancel signal) again in other place afterward.

So some previous fix (HAWQ-166,  HAWQ-1282) will be changed accordingly.
(1) HAWQ-166: we don't need to skip sending info
(2) HAWQ-1282:
  - we don't need to close the fd, it will be closed automatically during transaction end.
  - we just end loop if we find the related FD has already been closed.

> QE process hang in shared input scan on segment node
> ----------------------------------------------------
>
>                 Key: HAWQ-1342
>                 URL: https://issues.apache.org/jira/browse/HAWQ-1342
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: Query Execution
>    Affects Versions: 2.0.0.0-incubating
>            Reporter: Amy
>            Assignee: Ming LI
>             Fix For: backlog
>
>
> QE process hang on some segment node while QD and QE on other segment nodes terminated.
> {code}
> [gpadmin@test1 ~]$ cat hostfile
> test1   master   secondary namenode
> test2   segment   datanode
> test3   segment   datanode
> test4   segment   datanode
> test5   segment   namenode
> [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep
> gpadmin   41877      1  0 05:35 ?        00:01:04 /usr/local/hawq_2_1_0_0/bin/postgres -D /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd -i -M segment -p 20100 --silent-mode=true
> gpadmin   41878  41877  0 05:35 ?        00:00:02 postgres: port 20100, logger process
> gpadmin   41881  41877  0 05:35 ?        00:00:00 postgres: port 20100, stats collector process
> gpadmin   41882  41877  0 05:35 ?        00:00:07 postgres: port 20100, writer process
> gpadmin   41883  41877  0 05:35 ?        00:00:01 postgres: port 20100, checkpoint process
> gpadmin   41884  41877  0 05:35 ?        00:00:11 postgres: port 20100, segment resource manager
> gpadmin   42108  41877  0 05:35 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC SELECT
> gpadmin   42416  41877  0 05:35 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC SELECT
> gpadmin   44807  41877  0 05:36 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC SELECT
> gpadmin   44819  41877  0 05:36 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC SELECT
> gpadmin   44821  41877  0 05:36 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC SELECT
> gpadmin   45447  41877  0 05:36 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC SELECT
> gpadmin   49859  41877  0 05:38 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC SELECT
> gpadmin   49881  41877  0 05:38 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC SELECT
> gpadmin   51937  41877  0 05:39 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC SELECT
> gpadmin   51939  41877  0 05:39 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC SELECT
> gpadmin   51941  41877  0 05:39 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC SELECT
> gpadmin   51943  41877  0 05:39 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC SELECT
> gpadmin   51953  41877  0 05:39 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC SELECT
> gpadmin   53436  41877  0 05:40 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC SELECT
> gpadmin   57095  41877  0 05:41 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC SELECT
> gpadmin   57097  41877  0 05:41 ?        00:00:04 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(8451) con782 seg0 cmd2 slice11 MPPEXEC SELECT
> gpadmin   63159  41877  0 05:43 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(11474) con1082 seg0 cmd2 slice15 MPPEXEC SELECT
> gpadmin   64018  41877  0 05:44 ?        00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(11905) con1121 seg0 cmd2 slice5 MPPEXEC SELECT
> {code}
> The stack info is as below and it seems that QE hang in shared input scan.
> {code}
> [gpadmin@test3 ~]$ gdb -p 42108
> (gdb) info threads
>   2 Thread 0x7f4f6b335700 (LWP 42109)  0x00000032214df283 in poll () from /lib64/libc.so.6
> * 1 Thread 0x7f4f9041c920 (LWP 42108)  0x00000032214e1523 in select () from /lib64/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread 0x7f4f9041c920 (LWP 42108))]#1  0x00000000007199be in shareinput_reader_waitready (share_id=1, planGen=PLANGEN_PLANNER)
>     at nodeShareInputScan.c:760
> 760	in nodeShareInputScan.c
> (gdb) bt
> #0  0x00000032214e1523 in select () from /lib64/libc.so.6
> #1  0x00000000007199be in shareinput_reader_waitready (share_id=1, planGen=PLANGEN_PLANNER) at nodeShareInputScan.c:760
> #2  0x0000000000718c68 in ExecSliceDependencyShareInputScan (node=0x7f4f6aeb1d68) at nodeShareInputScan.c:344
> #3  0x00000000006e490f in ExecSliceDependencyNode (node=0x7f4f6aeb1d68) at execProcnode.c:774
> #4  0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb17c0) at execProcnode.c:797
> #5  0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb1230) at execProcnode.c:797
> #6  0x00000000006dee81 in ExecutePlan (estate=0x3462b50, planstate=0x7f4f6aeb1230, operation=CMD_SELECT, numberTuples=0, direction=ForwardScanDirection, dest=0x7f4f6b229118)
>     at execMain.c:3178
> #7  0x00000000006dbba1 in ExecutorRun (queryDesc=0x346b570, direction=ForwardScanDirection, count=0) at execMain.c:1197
> #8  0x000000000088e973 in PortalRunSelect (portal=0x3467c40, forward=1 '\001', count=0, dest=0x7f4f6b229118) at pquery.c:1731
> #9  0x000000000088e58e in PortalRun (portal=0x3467c40, count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f4f6b229118, altdest=0x7f4f6b229118,
>     completionTag=0x7fff4286c090 "") at pquery.c:1553
> #10 0x0000000000883e32 in exec_mpp_query (
>     query_string=0x348fa92 "SELECT sale.qty,sale.pn,sale.pn,GROUPING(sale.pn,sale.qty,sale.pn), TO_CHAR(COALESCE(COVAR_POP(floor(sale.vn*sale.qty),floor(sale.pn/sale.pn)),0),'99999999.9999999'),TO_CHAR(COALESCE(AVG(DISTINCT floo"..., serializedQuerytree=0x0, serializedQuerytreelen=0, serializedPlantree=0x348fe94 "y6\002", serializedPlantreelen=9061,
>     serializedParams=0x0, serializedParamslen=0, serializedSliceInfo=0x34921f9 "\355\002", serializedSliceInfolen=230, serializedResource=0x349232c "(",
>     serializedResourceLen=37, seqServerHost=0x3492351 "10.32.35.192", seqServerPort=44013, localSlice=9) at postgres.c:1487
> #11 0x00000000008898fb in PostgresMain (argc=254, argv=0x3311360, username=0x32f2f10 "hawqsuperuser") at postgres.c:5051
> #12 0x000000000083c198 in BackendRun (port=0x32af5f0) at postmaster.c:5915
> #13 0x000000000083b5e3 in BackendStartup (port=0x32af5f0) at postmaster.c:5484
> #14 0x0000000000835df0 in ServerLoop () at postmaster.c:2163
> #15 0x0000000000834ebb in PostmasterMain (argc=9, argv=0x32b7d10) at postmaster.c:1454
> #16 0x000000000076115b in main (argc=9, argv=0x32b7d10) at main.c:226
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7f4f6b335700 (LWP 42109))]#0  0x00000032214df283 in poll () from /lib64/libc.so.6
> (gdb) bt
> #0  0x00000032214df283 in poll () from /lib64/libc.so.6
> #1  0x0000000000a29d03 in rxThreadFunc (arg=0x0) at ic_udp.c:6278
> #2  0x0000003221807aa1 in start_thread () from /lib64/libpthread.so.0
> #3  0x00000032214e8aad in clone () from /lib64/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread 0x7f4f9041c920 (LWP 42108))]#0  0x00000032214e1523 in select () from /lib64/libc.so.6
> (gdb) bt
> #0  0x00000032214e1523 in select () from /lib64/libc.so.6
> #1  0x00000000007199be in shareinput_reader_waitready (share_id=1, planGen=PLANGEN_PLANNER) at nodeShareInputScan.c:760
> #2  0x0000000000718c68 in ExecSliceDependencyShareInputScan (node=0x7f4f6aeb1d68) at nodeShareInputScan.c:344
> #3  0x00000000006e490f in ExecSliceDependencyNode (node=0x7f4f6aeb1d68) at execProcnode.c:774
> #4  0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb17c0) at execProcnode.c:797
> #5  0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb1230) at execProcnode.c:797
> #6  0x00000000006dee81 in ExecutePlan (estate=0x3462b50, planstate=0x7f4f6aeb1230, operation=CMD_SELECT, numberTuples=0, direction=ForwardScanDirection, dest=0x7f4f6b229118)
>     at execMain.c:3178
> #7  0x00000000006dbba1 in ExecutorRun (queryDesc=0x346b570, direction=ForwardScanDirection, count=0) at execMain.c:1197
> #8  0x000000000088e973 in PortalRunSelect (portal=0x3467c40, forward=1 '\001', count=0, dest=0x7f4f6b229118) at pquery.c:1731
> #9  0x000000000088e58e in PortalRun (portal=0x3467c40, count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f4f6b229118, altdest=0x7f4f6b229118,
>     completionTag=0x7fff4286c090 "") at pquery.c:1553
> #10 0x0000000000883e32 in exec_mpp_query (
>     query_string=0x348fa92 "SELECT sale.qty,sale.pn,sale.pn,GROUPING(sale.pn,sale.qty,sale.pn), TO_CHAR(COALESCE(COVAR_POP(floor(sale.vn*sale.qty),floor(sale.pn/sale.pn)),0),'99999999.9999999'),TO_CHAR(COALESCE(AVG(DISTINCT floo"..., serializedQuerytree=0x0, serializedQuerytreelen=0, serializedPlantree=0x348fe94 "y6\002", serializedPlantreelen=9061,
>     serializedParams=0x0, serializedParamslen=0, serializedSliceInfo=0x34921f9 "\355\002", serializedSliceInfolen=230, serializedResource=0x349232c "(",
>     serializedResourceLen=37, seqServerHost=0x3492351 "10.32.35.192", seqServerPort=44013, localSlice=9) at postgres.c:1487
> #11 0x00000000008898fb in PostgresMain (argc=254, argv=0x3311360, username=0x32f2f10 "hawqsuperuser") at postgres.c:5051
> #12 0x000000000083c198 in BackendRun (port=0x32af5f0) at postmaster.c:5915
> #13 0x000000000083b5e3 in BackendStartup (port=0x32af5f0) at postmaster.c:5484
> #14 0x0000000000835df0 in ServerLoop () at postmaster.c:2163
> #15 0x0000000000834ebb in PostmasterMain (argc=9, argv=0x32b7d10) at postmaster.c:1454
> #16 0x000000000076115b in main (argc=9, argv=0x32b7d10) at main.c:226
> (gdb) f 1
> #1  0x00000000007199be in shareinput_reader_waitready (share_id=1, planGen=PLANGEN_PLANNER) at nodeShareInputScan.c:760
> 760	in nodeShareInputScan.c
> (gdb) p n
> $1 = 0
> (gdb) p errno
> $2 = 17
> (gdb) p InterruptPending
> $3 = 0 '\000'
> (gdb) p QueryCancelPending
> $4 = 0 '\000'
> (gdb) p ProcDiePending
> $5 = 0 '\000'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)