You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by "Ruilong Huo (JIRA)" <ji...@apache.org> on 2015/11/17 03:17:11 UTC

[jira] [Commented] (HAWQ-166) Orphaned share input scan QE process after OLAP grouping query

    [ https://issues.apache.org/jira/browse/HAWQ-166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15007850#comment-15007850 ] 

Ruilong Huo commented on HAWQ-166:
----------------------------------

Root cause analysis: 

Share input scan are done by one writer QE process and several reader QE processes. At first, the communication between writer and readers is established. Then, the readers start to read data and notify done after all data is read.

There are two situations can cause readers fails to send done notification to writer, and thus causes writer hang there forever waiting for readers' done notification:

1. The transaction is aborted due to interrupts or exceptions, i.e., user cancels query, division by zero on some segment. In this case, QD send cancel signal to all QEs (including writer QE and reader QEs), and reader QEs may cleanup and exit without sending done notification to writer QE.

2. If there are logic error in reader, i.e., segment fault, the reader may exit unexpectedly without cleanup, without sending done notification to writer QE.

> Orphaned share input scan QE process after OLAP grouping query
> --------------------------------------------------------------
>
>                 Key: HAWQ-166
>                 URL: https://issues.apache.org/jira/browse/HAWQ-166
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: Query Execution
>            Reporter: Ruilong Huo
>            Assignee: George Caragea
>         Attachments: olap_setup.sql
>
>
> Some OLAP grouping query may error out with "division by zero", and then leaves orphaned share input scan QE process on segment hosts. It causes HAWQ fails to shutdown, restart, etc.
> {code}
> [gpadmin@test2 ~]$ ps -ef | grep postgres
> gpadmin  115703      1  0 Nov15 ?        00:02:32 /data/pulse-agent-data/HAWQ-main-FeatureTest-dbg-mutilnode-huor/product/hawq-2.0.0.0/bin/postgres -D /data/pulse-agent-data/HAWQ-main-FeatureTest-dbg-mutilnode-huor/product/segmentdd -i -M segment -p 21100 --silent-mode=true
> gpadmin  115704 115703  0 Nov15 ?        00:06:10 postgres: port 21100, logger process
> gpadmin  115707 115703  0 Nov15 ?        00:00:01 postgres: port 21100, stats collector process
> gpadmin  115708 115703  0 Nov15 ?        00:00:17 postgres: port 21100, writer process
> gpadmin  115709 115703  0 Nov15 ?        00:00:03 postgres: port 21100, checkpoint process
> gpadmin  115710 115703  0 Nov15 ?        00:00:38 postgres: port 21100, segment resource manager
> gpadmin  118013 115703  0 Nov15 ?        00:00:28 postgres: port 21100, gpadmin olap_group 10.32.35.205(46459) con58 seg1 cmd2 slice10 MPPEXEC SELECT
> gpadmin  118611 115703  0 Nov15 ?        00:00:28 postgres: port 21100, gpadmin olap_group 10.32.35.205(47395) con63 seg5 cmd2 slice36 MPPEXEC SELECT
> gpadmin  173633 115703  0 Nov15 ?        00:00:28 postgres: port 21100, gpadmin olap_group 10.32.35.205(3398) con616 seg3 cmd2 slice11 MPPEXEC SELECT
> gpadmin  180033 115703  0 Nov15 ?        00:00:28 postgres: port 21100, gpadmin olap_group 10.32.35.205(13067) con676 seg1 cmd2 slice16 MPPEXEC SELECT
> gpadmin  223009 115703  0 Nov15 ?        00:00:27 postgres: port 21100, gpadmin olap_group 10.32.35.205(13860) con1098 seg7 cmd2 slice21 MPPEXEC SELECT
> gpadmin  414471 414449  0 17:46 pts/1    00:00:00 grep postgres
> [gpadmin@test2 ~]$ sudo gdb -p 173633
> (gdb) bt
> #0  0x0000003c1e8e15e3 in select () from /lib64/libc.so.6
> #1  0x000000000076724d in shareinput_writer_waitdone (ctxt=0x2b6cc10, share_id=0, nsharer_xslice=4) at nodeShareInputScan.c:1011
> #2  0x000000000075b241 in ExecEndMaterial (node=0x2a98700) at nodeMaterial.c:496
> #3  0x00000000007250b0 in ExecEndNode (node=0x2a98700) at execProcnode.c:1690
> #4  0x000000000076539c in ExecEndShareInputScan (node=0x2a98250) at nodeShareInputScan.c:367
> #5  0x000000000072506a in ExecEndNode (node=0x2a98250) at execProcnode.c:1681
> #6  0x00000000007684e0 in ExecEndSort (node=0x2a97c20) at nodeSort.c:588
> #7  0x00000000007250f6 in ExecEndNode (node=0x2a97c20) at execProcnode.c:1696
> #8  0x000000000074b0c1 in ExecEndAgg (node=0x2a61df8) at nodeAgg.c:2644
> #9  0x0000000000725107 in ExecEndNode (node=0x2a61df8) at execProcnode.c:1700
> #10 0x00000000007639c1 in ExecEndRepeat (node=0x2a608e0) at nodeRepeat.c:218
> #11 0x000000000072517e in ExecEndNode (node=0x2a608e0) at execProcnode.c:1728
> #12 0x0000000000745ad5 in ExecEndAppend (node=0x2a603b8) at nodeAppend.c:411
> #13 0x0000000000724ec9 in ExecEndNode (node=0x2a603b8) at execProcnode.c:1582
> #14 0x0000000000745ad5 in ExecEndAppend (node=0x2a5ff08) at nodeAppend.c:411
> #15 0x0000000000724ec9 in ExecEndNode (node=0x2a5ff08) at execProcnode.c:1582
> #16 0x0000000000760997 in ExecEndMotion (node=0x2a79950) at nodeMotion.c:1232
> #17 0x000000000072516d in ExecEndNode (node=0x2a79950) at execProcnode.c:1724
> #18 0x000000000071c9e9 in ExecEndPlan (planstate=0x2a79950, estate=0x29d4318) at execMain.c:2891
> #19 0x0000000000719952 in ExecutorEnd (queryDesc=0x29cc208) at execMain.c:1365
> #20 0x00000000006b452b in PortalCleanupHelper (portal=0x29e2528) at portalcmds.c:347
> #21 0x00000000006b4454 in PortalCleanup (portal=0x29e2528) at portalcmds.c:299
> #22 0x0000000000a1c8ca in AtAbort_Portals () at portalmem.c:677
> #23 0x0000000000521b9b in AbortTransaction () at xact.c:2819
> #24 0x0000000000522753 in AbortCurrentTransaction () at xact.c:3332
> #25 0x00000000008fc43d in PostgresMain (argc=276, argv=0x2912038, username=0x28f57f8 "gpadmin") at postgres.c:4469
> #26 0x000000000089f48a in BackendRun (port=0x28bde30) at postmaster.c:5844
> #27 0x000000000089e914 in BackendStartup (port=0x28bde30) at postmaster.c:5437
> #28 0x0000000000899032 in ServerLoop () at postmaster.c:2139
> #29 0x0000000000898108 in PostmasterMain (argc=9, argv=0x28a6130) at postmaster.c:1431
> #30 0x00000000007b199a in main (argc=9, argv=0x28a6130) at main.c:226
> {code}
> The reproduction steps are:
> {code}
> Step 1: Prepare schema and data using attached olap_setup.sql
> Step 2: Run below OLAP grouping query
> -- OLAP query involving MAX() function
> SELECT sale.vn,sale.cn,sale.dt,GROUPING(sale.vn), TO_CHAR(COALESCE(MAX(DISTINCT floor(sale.vn+sale.qty)),0),'99999999.9999999'),TO_CHAR(COALESCE(VAR_SAMP(floor(sale.pn/sale.prc)),0),'99999999.9999999'),TO_CHAR(COALESCE(COUNT(floor(sale.qty+sale.prc)),0),'99999999.9999999')
> FROM sale,customer,vendor
> WHERE sale.cn=customer.cn AND sale.vn=vendor.vn
> GROUP BY ROLLUP((sale.prc),(sale.vn,sale.vn),(sale.pn,sale.pn),(sale.dt),(sale.qty,sale.vn,sale.qty)),ROLLUP((sale.pn),(sale.vn,sale.pn),(sale.qty)),(),sale.cn HAVING COALESCE(VAR_POP(sale.cn),0) >= 45.5839785564113;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)