You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by "Ming LI (JIRA)" <ji...@apache.org> on 2017/02/23 06:33:44 UTC
[jira] [Commented] (HAWQ-1342) QE process hang in shared input scan
on segment node
[ https://issues.apache.org/jira/browse/HAWQ-1342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879968#comment-15879968 ]
Ming LI commented on HAWQ-1342:
-------------------------------
The basic idea for this kinds of hung problem is to:
(1) The error thrown segment will invoke rollback the whole transaction, and all related fd will be closed during transaction end.
(2) The other segment just act as before, when wait for select(), it will loop until the specific fd is closed, then the code will run until process interrupt (the rollback transaction will send cancel signal) again in other place afterward.
So some previous fix (HAWQ-166, HAWQ-1282) will be changed accordingly.
(1) HAWQ-166: we don't need to skip sending info
(2) HAWQ-1282:
- we don't need to close the fd, it will be closed automatically during transaction end.
- we just end loop if we find the related FD has already been closed.
> QE process hang in shared input scan on segment node
> ----------------------------------------------------
>
> Key: HAWQ-1342
> URL: https://issues.apache.org/jira/browse/HAWQ-1342
> Project: Apache HAWQ
> Issue Type: Bug
> Components: Query Execution
> Affects Versions: 2.0.0.0-incubating
> Reporter: Amy
> Assignee: Ming LI
> Fix For: backlog
>
>
> QE process hang on some segment node while QD and QE on other segment nodes terminated.
> {code}
> [gpadmin@test1 ~]$ cat hostfile
> test1 master secondary namenode
> test2 segment datanode
> test3 segment datanode
> test4 segment datanode
> test5 segment namenode
> [gpadmin@test3 ~]$ ps -ef | grep postgres | grep -v grep
> gpadmin 41877 1 0 05:35 ? 00:01:04 /usr/local/hawq_2_1_0_0/bin/postgres -D /data/pulse-agent-data/HAWQ-main-FeatureTest-opt-Multinode-parallel/product/segmentdd -i -M segment -p 20100 --silent-mode=true
> gpadmin 41878 41877 0 05:35 ? 00:00:02 postgres: port 20100, logger process
> gpadmin 41881 41877 0 05:35 ? 00:00:00 postgres: port 20100, stats collector process
> gpadmin 41882 41877 0 05:35 ? 00:00:07 postgres: port 20100, writer process
> gpadmin 41883 41877 0 05:35 ? 00:00:01 postgres: port 20100, checkpoint process
> gpadmin 41884 41877 0 05:35 ? 00:00:11 postgres: port 20100, segment resource manager
> gpadmin 42108 41877 0 05:35 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(65193) con35 seg0 cmd2 slice9 MPPEXEC SELECT
> gpadmin 42416 41877 0 05:35 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(65359) con53 seg0 cmd2 slice11 MPPEXEC SELECT
> gpadmin 44807 41877 0 05:36 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(2272) con183 seg0 cmd2 slice31 MPPEXEC SELECT
> gpadmin 44819 41877 0 05:36 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(2278) con183 seg0 cmd2 slice10 MPPEXEC SELECT
> gpadmin 44821 41877 0 05:36 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(2279) con183 seg0 cmd2 slice25 MPPEXEC SELECT
> gpadmin 45447 41877 0 05:36 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(2605) con207 seg0 cmd2 slice9 MPPEXEC SELECT
> gpadmin 49859 41877 0 05:38 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(4805) con432 seg0 cmd2 slice20 MPPEXEC SELECT
> gpadmin 49881 41877 0 05:38 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(4816) con432 seg0 cmd2 slice7 MPPEXEC SELECT
> gpadmin 51937 41877 0 05:39 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(5877) con517 seg0 cmd2 slice7 MPPEXEC SELECT
> gpadmin 51939 41877 0 05:39 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(5878) con517 seg0 cmd2 slice9 MPPEXEC SELECT
> gpadmin 51941 41877 0 05:39 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(5879) con517 seg0 cmd2 slice11 MPPEXEC SELECT
> gpadmin 51943 41877 0 05:39 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(5880) con517 seg0 cmd2 slice13 MPPEXEC SELECT
> gpadmin 51953 41877 0 05:39 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(5885) con517 seg0 cmd2 slice26 MPPEXEC SELECT
> gpadmin 53436 41877 0 05:40 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(6634) con602 seg0 cmd2 slice15 MPPEXEC SELECT
> gpadmin 57095 41877 0 05:41 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(8450) con782 seg0 cmd2 slice10 MPPEXEC SELECT
> gpadmin 57097 41877 0 05:41 ? 00:00:04 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(8451) con782 seg0 cmd2 slice11 MPPEXEC SELECT
> gpadmin 63159 41877 0 05:43 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(11474) con1082 seg0 cmd2 slice15 MPPEXEC SELECT
> gpadmin 64018 41877 0 05:44 ? 00:00:03 postgres: port 20100, hawqsuperuser olap_group 10.32.35.192(11905) con1121 seg0 cmd2 slice5 MPPEXEC SELECT
> {code}
> The stack info is as below and it seems that QE hang in shared input scan.
> {code}
> [gpadmin@test3 ~]$ gdb -p 42108
> (gdb) info threads
> 2 Thread 0x7f4f6b335700 (LWP 42109) 0x00000032214df283 in poll () from /lib64/libc.so.6
> * 1 Thread 0x7f4f9041c920 (LWP 42108) 0x00000032214e1523 in select () from /lib64/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread 0x7f4f9041c920 (LWP 42108))]#1 0x00000000007199be in shareinput_reader_waitready (share_id=1, planGen=PLANGEN_PLANNER)
> at nodeShareInputScan.c:760
> 760 in nodeShareInputScan.c
> (gdb) bt
> #0 0x00000032214e1523 in select () from /lib64/libc.so.6
> #1 0x00000000007199be in shareinput_reader_waitready (share_id=1, planGen=PLANGEN_PLANNER) at nodeShareInputScan.c:760
> #2 0x0000000000718c68 in ExecSliceDependencyShareInputScan (node=0x7f4f6aeb1d68) at nodeShareInputScan.c:344
> #3 0x00000000006e490f in ExecSliceDependencyNode (node=0x7f4f6aeb1d68) at execProcnode.c:774
> #4 0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb17c0) at execProcnode.c:797
> #5 0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb1230) at execProcnode.c:797
> #6 0x00000000006dee81 in ExecutePlan (estate=0x3462b50, planstate=0x7f4f6aeb1230, operation=CMD_SELECT, numberTuples=0, direction=ForwardScanDirection, dest=0x7f4f6b229118)
> at execMain.c:3178
> #7 0x00000000006dbba1 in ExecutorRun (queryDesc=0x346b570, direction=ForwardScanDirection, count=0) at execMain.c:1197
> #8 0x000000000088e973 in PortalRunSelect (portal=0x3467c40, forward=1 '\001', count=0, dest=0x7f4f6b229118) at pquery.c:1731
> #9 0x000000000088e58e in PortalRun (portal=0x3467c40, count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f4f6b229118, altdest=0x7f4f6b229118,
> completionTag=0x7fff4286c090 "") at pquery.c:1553
> #10 0x0000000000883e32 in exec_mpp_query (
> query_string=0x348fa92 "SELECT sale.qty,sale.pn,sale.pn,GROUPING(sale.pn,sale.qty,sale.pn), TO_CHAR(COALESCE(COVAR_POP(floor(sale.vn*sale.qty),floor(sale.pn/sale.pn)),0),'99999999.9999999'),TO_CHAR(COALESCE(AVG(DISTINCT floo"..., serializedQuerytree=0x0, serializedQuerytreelen=0, serializedPlantree=0x348fe94 "y6\002", serializedPlantreelen=9061,
> serializedParams=0x0, serializedParamslen=0, serializedSliceInfo=0x34921f9 "\355\002", serializedSliceInfolen=230, serializedResource=0x349232c "(",
> serializedResourceLen=37, seqServerHost=0x3492351 "10.32.35.192", seqServerPort=44013, localSlice=9) at postgres.c:1487
> #11 0x00000000008898fb in PostgresMain (argc=254, argv=0x3311360, username=0x32f2f10 "hawqsuperuser") at postgres.c:5051
> #12 0x000000000083c198 in BackendRun (port=0x32af5f0) at postmaster.c:5915
> #13 0x000000000083b5e3 in BackendStartup (port=0x32af5f0) at postmaster.c:5484
> #14 0x0000000000835df0 in ServerLoop () at postmaster.c:2163
> #15 0x0000000000834ebb in PostmasterMain (argc=9, argv=0x32b7d10) at postmaster.c:1454
> #16 0x000000000076115b in main (argc=9, argv=0x32b7d10) at main.c:226
> (gdb) thread 2
> [Switching to thread 2 (Thread 0x7f4f6b335700 (LWP 42109))]#0 0x00000032214df283 in poll () from /lib64/libc.so.6
> (gdb) bt
> #0 0x00000032214df283 in poll () from /lib64/libc.so.6
> #1 0x0000000000a29d03 in rxThreadFunc (arg=0x0) at ic_udp.c:6278
> #2 0x0000003221807aa1 in start_thread () from /lib64/libpthread.so.0
> #3 0x00000032214e8aad in clone () from /lib64/libc.so.6
> (gdb) thread 1
> [Switching to thread 1 (Thread 0x7f4f9041c920 (LWP 42108))]#0 0x00000032214e1523 in select () from /lib64/libc.so.6
> (gdb) bt
> #0 0x00000032214e1523 in select () from /lib64/libc.so.6
> #1 0x00000000007199be in shareinput_reader_waitready (share_id=1, planGen=PLANGEN_PLANNER) at nodeShareInputScan.c:760
> #2 0x0000000000718c68 in ExecSliceDependencyShareInputScan (node=0x7f4f6aeb1d68) at nodeShareInputScan.c:344
> #3 0x00000000006e490f in ExecSliceDependencyNode (node=0x7f4f6aeb1d68) at execProcnode.c:774
> #4 0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb17c0) at execProcnode.c:797
> #5 0x00000000006e49fb in ExecSliceDependencyNode (node=0x7f4f6aeb1230) at execProcnode.c:797
> #6 0x00000000006dee81 in ExecutePlan (estate=0x3462b50, planstate=0x7f4f6aeb1230, operation=CMD_SELECT, numberTuples=0, direction=ForwardScanDirection, dest=0x7f4f6b229118)
> at execMain.c:3178
> #7 0x00000000006dbba1 in ExecutorRun (queryDesc=0x346b570, direction=ForwardScanDirection, count=0) at execMain.c:1197
> #8 0x000000000088e973 in PortalRunSelect (portal=0x3467c40, forward=1 '\001', count=0, dest=0x7f4f6b229118) at pquery.c:1731
> #9 0x000000000088e58e in PortalRun (portal=0x3467c40, count=9223372036854775807, isTopLevel=1 '\001', dest=0x7f4f6b229118, altdest=0x7f4f6b229118,
> completionTag=0x7fff4286c090 "") at pquery.c:1553
> #10 0x0000000000883e32 in exec_mpp_query (
> query_string=0x348fa92 "SELECT sale.qty,sale.pn,sale.pn,GROUPING(sale.pn,sale.qty,sale.pn), TO_CHAR(COALESCE(COVAR_POP(floor(sale.vn*sale.qty),floor(sale.pn/sale.pn)),0),'99999999.9999999'),TO_CHAR(COALESCE(AVG(DISTINCT floo"..., serializedQuerytree=0x0, serializedQuerytreelen=0, serializedPlantree=0x348fe94 "y6\002", serializedPlantreelen=9061,
> serializedParams=0x0, serializedParamslen=0, serializedSliceInfo=0x34921f9 "\355\002", serializedSliceInfolen=230, serializedResource=0x349232c "(",
> serializedResourceLen=37, seqServerHost=0x3492351 "10.32.35.192", seqServerPort=44013, localSlice=9) at postgres.c:1487
> #11 0x00000000008898fb in PostgresMain (argc=254, argv=0x3311360, username=0x32f2f10 "hawqsuperuser") at postgres.c:5051
> #12 0x000000000083c198 in BackendRun (port=0x32af5f0) at postmaster.c:5915
> #13 0x000000000083b5e3 in BackendStartup (port=0x32af5f0) at postmaster.c:5484
> #14 0x0000000000835df0 in ServerLoop () at postmaster.c:2163
> #15 0x0000000000834ebb in PostmasterMain (argc=9, argv=0x32b7d10) at postmaster.c:1454
> #16 0x000000000076115b in main (argc=9, argv=0x32b7d10) at main.c:226
> (gdb) f 1
> #1 0x00000000007199be in shareinput_reader_waitready (share_id=1, planGen=PLANGEN_PLANNER) at nodeShareInputScan.c:760
> 760 in nodeShareInputScan.c
> (gdb) p n
> $1 = 0
> (gdb) p errno
> $2 = 17
> (gdb) p InterruptPending
> $3 = 0 '\000'
> (gdb) p QueryCancelPending
> $4 = 0 '\000'
> (gdb) p ProcDiePending
> $5 = 0 '\000'
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)