You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by "Xiang Sheng (JIRA)" <ji...@apache.org> on 2016/11/23 14:01:05 UTC
[jira] [Comment Edited] (HAWQ-1170) Crash at cleanup_allocation_algorithm() when enable '--enable-cassert' option

    [ https://issues.apache.org/jira/browse/HAWQ-1170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15689282#comment-15689282 ] 

Xiang Sheng edited comment on HAWQ-1170 at 11/23/16 2:00 PM:
-------------------------------------------------------------

RCA :
when add option '--enable-cassert' in hawq configure, hawq will crash when run a test case of gpload. In the case, the default bucket num is 12 and the expected virtual segment number is 24 when external table exists.
So it will log ERROR and do cleanup_allocation_algorithm() for several times. In current code, the function will be called first time and elog(ERROR), second time in the PG_CATCH .
In the function cleanup_allocation_algorithm(), it will call a function MemoryContextResetAndDeleteChildren. This will do a memset to 7F while hawq configure with --enable-cassert option. So it will crash when cleanup_allocation_algorithm was called the second time since it will read the memory that has been flushed.


was (Author: xsheng):
RCA :
when add option '--enable-cassert' in hawq configure, hawq will crash when run a test case of gpload. In the case, the default bucket num is 12 and the expected virtual segment number is 24 when external table exists.
So it will log ERROR and do cleanup_allocation_algorithm() for several times. In current code, the function will be called first time and elog(ERROR), second time in the PG_CATCH and will be called third times after PG_END_TRY.
In the function cleanup_allocation_algorithm(), it will call a function MemoryContextResetAndDeleteChildren. This will do a memset to 7F while hawq configure with --enable-cassert option. So it will crash when cleanup_allocation_algorithm was called the second time since it will read the memory that has been flushed.

> Crash at cleanup_allocation_algorithm()  when enable '--enable-cassert' option
> ------------------------------------------------------------------------------
>
>                 Key: HAWQ-1170
>                 URL: https://issues.apache.org/jira/browse/HAWQ-1170
>             Project: Apache HAWQ
>          Issue Type: Bug
>          Components: Core
>            Reporter: Xiang Sheng
>            Assignee: Xiang Sheng
>             Fix For: 2.0.1.0-incubating
>
>         Attachments: config_file, create.sql, lineitem.tbl.small
>
>
> when add option '--enable-cassert' in hawq configure, hawq will crash when run a test case of gpload. In the case, the default bucket num is 12 and the expected virtual segment number is 24 when external table exists.  
> reproduce steps : 
> # hawq config -c default_hash_table_bucket_number -v 12 --skipvalidation
> # restart hawq
> # create table lineitem with the create.sql
> # update config_file, replace all $VAR with the correct value on your environment.
> # gpload -f config_file 
> {color:red} THE LOG {color}.
> 2016-11-15 06:43:45.635469 PST,,,p177585,th0,,,2016-11-15 06:43:38 PST,0,con5182,cmd6,seg-10000,,,,,"PANIC","XX000","Unexpected internal error: Master process received signal SIGSEGV",,,,,,,0,,,,"1    0x9ce6f2 postgres <symbol not found> (elog.c:4510)
> 2    0x9ce969 postgres StandardHandlerForSigillSigsegvSigbus_OnMainThread (elog.c:4597)
> 3    0x8e1c36 postgres CdbProgramErrorHandler (postgres.c:3513)
> 4    0x3e4380f7e0 libpthread.so.0 <symbol not found> (??:0)
> 5    0xb66a11 postgres calculate_planner_segment_num (cdbdatalocality.c:4431)
> 6    0x836555 postgres <symbol not found> (planner.c:667)
> 7    0x835d5e postgres planner (planner.c:475)
> 8    0x8dd67b postgres pg_plan_query (postgres.c:908)
> 9    0x8dd786 postgres pg_plan_queries (postgres.c:982)
> 10   0x8ded29 postgres <symbol not found> (postgres.c:1742)
> "
> 2016-11-15 06:44:26.162520 PST,,,p78695,th-1193023200,,,,0,,,seg-10000,,,,,"LOG","00000","server process (PID 177585) was terminated by signal 11: Segmentation fault",,,,,,,0,,"postmaster.c",4748,
> 2016-11-15 06:44:26.162587 PST,,,p78695,th-1193023200,,,,0,,,seg-10000,,,,,"LOG","00000","terminating any other active server processes",,,,,,,0,,"postmaster.c",4486,
> {color:red} CORE DUMPED {color}
> (gdb) bt
> #0  0x0000003e4380f6ab in raise () from /lib64/libpthread.so.0
> #1  0x00000000009ce73f in SafeHandlerForSegvBusIll (processName=0xd27713 "Master process", postgres_signal_arg=11) at elog.c:4519
> #2  0x00000000009ce969 in StandardHandlerForSigillSigsegvSigbus_OnMainThread (processName=0xd27713 "Master process", postgres_signal_arg=11) at elog.c:4597
> #3  0x00000000008e1c36 in CdbProgramErrorHandler (postgres_signal_arg=11) at postgres.c:3512
> #4  <signal handler called>
> #5  0x0000000000b65a16 in cleanup_allocation_algorithm (context=0x7fffbd549cd0) at cdbdatalocality.c:3980
> #6  0x0000000000b66a11 in calculate_planner_segment_num (query=0x307d5d0, resourceLife=QRL_ONCE, fullRangeTable=0x2ff2480, intoPolicy=0x0, sliceNum=2,
>     fixedVsegNum=-1) at cdbdatalocality.c:4430
> #7  0x0000000000836555 in resource_negotiator (parse=0x307d708, cursorOptions=0, boundParams=0x0, resourceLife=QRL_ONCE, result=0x7fffbd54a018)
>     at planner.c:667
> #8  0x0000000000835d5e in planner (parse=0x307d708, cursorOptions=0, boundParams=0x0, resourceLife=QRL_ONCE) at planner.c:473
> #9  0x00000000008dd67b in pg_plan_query (querytree=0x307d708, boundParams=0x0, resource_life=QRL_ONCE) at postgres.c:908
> #10 0x00000000008dd786 in pg_plan_queries (querytrees=0x2ed7960, boundParams=0x0, needSnapshot=0 '\000', resource_life=QRL_ONCE) at postgres.c:982
> #11 0x00000000008ded29 in exec_simple_query (
>     query_string=0x2ed0898 "INSERT INTO public.\"lineitem\" (\"l_orderkey\",\"l_partkey\",\"l_suppkey\",\"l_linenumber\",\"l_quantity\",\"l_extendedprice\",\"l_discount\",\"l_tax\",\"l_returnflag\",\"l_linestatus\",\"l_shipdate\",\"l_commitdate\",\"l_rece"..., seqServerHost=0x0, seqServerPort=-1) at postgres.c:1742
> #12 0x00000000008e3b44 in PostgresMain (argc=4, argv=0x2d38e60, username=0x2d38de0 "gpadmin") at postgres.c:4840
> #13 0x000000000088d9eb in BackendRun (port=0x2d0a5f0) at postmaster.c:5915
> #14 0x000000000088ce0a in BackendStartup (port=0x2d0a5f0) at postmaster.c:5484
> #15 0x0000000000886e86 in ServerLoop () at postmaster.c:2163
> #16 0x0000000000885e93 in PostmasterMain (argc=9, argv=0x2d13010) at postmaster.c:1454
> #17 0x00000000007a3097 in main (argc=9, argv=0x2d13010) at main.c:226
> (gdb)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)