You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@trafodion.apache.org by "Zhu, Wen-Jun" <we...@esgyn.cn> on 2018/09/26 10:09:17 UTC

command `shell -c node info` blocks

Hi,

Recently I find that the `shell` program has blocked.

As I run
       sqcheck
Which invokes
       shell -c node info
it blocks.

After some debugging, I find that there are two threads within `shell`,
Stacks of one thread looks like this:
#0  0x0000007fb7e292fc in pthread_cond_wait@@GLIBC_2.17 () from /lib/aarch64-linux-gnu/libpthread.so.0
#1  0x000000000042ab54 in Local_IO_To_Monitor::wait_on_cv (this=0x600b40) at clio.cxx:2240
#2  0x0000000000429888 in Local_IO_To_Monitor::send_recv (this=0x600b40, pp_msg=0x7fb6bc95bc, pv_nw=false) at clio.cxx:1675
#3  0x000000000040aeec in attach (nid=0, name=0x5d0240 "SHELL", program=0x4e84c8 "shell") at shell.cxx:995
#4  0x0000000000421a58 in main (argc=4, argv=0x7fffff2d58) at shell.cxx:8849
Which is wait for `iv_sr_cv`

Stacks of the other thread:
    #0  local_monitor_reader (pp_arg=0x63e3) at clio.cxx:285
#1  0x0000007fb7e22fb4 in start_thread () from /lib/aarch64-linux-gnu/libpthread.so.0
Which wait `monitor` for the signal SQ_LIO_SIGNAL_REQUEST_REPLY.

If `monitor` send the signal, then `shell` would receive it, continue, and finish its job.
But `monitor` do not send the signal.



After some searching, I find that there is only one piece of code sending the signal:
       513         pthread_kill(iv_worker_thread_id, SQ_LIO_SIGNAL_REQUEST_REPLY);
In function Local_IO_To_Monitor::~Local_IO_To_Monitor() of file core/sqf/monitor/linux/clio.cxx.

As my understanding, this function should be invoked in `monitor` program, but when I attach to that `monitor`,
whose pid is got from function `local_monitor_reader()`, and add a breakpoint on ` Local_IO_To_Monitor::~Local_IO_To_Monitor()`,
it does not break there.


So, what should the normal procedure be? Is it incorrect for `monitor` not to invoking Local_IO_To_Monitor::~Local_IO_To_Monitor() ?


Thank you.

Wenjun Zhu

RE: command `shell -c node info` blocks

Posted by Selva Govindarajan <se...@esgyn.com>.
The backup command can block any SQL query that would make changes to the database to ensure that the database is backed up in a consistent manner.

My guess is both drop and create should have been blocked and waiting for the backup to complete.

Selva
-----Original Message-----
From: Zhu, Wen-Jun <we...@esgyn.cn> 
Sent: Sunday, September 30, 2018 2:58 AM
To: dev@trafodion.apache.org
Subject: 答复: command `shell -c node info` blocks

Hi,

There is another block:

trafodion@kylin:~$ offender -s active
EsgynDB Advanced Conversational Interface 2.4.5 Copyright (c) 2015-2018 Esgyn Corporation Interpreter has not been linked in.EXITING FROM layoutNativeCode() -could not create function !!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+>+>+>+>+>+>+>+>+>+>+>+>+>Interpreter has not been linked in.EXITING FROM layoutNativeCode() -could not create function !!

CURRENT_TIMESTAMP           LAST_ACTIVITY_SECS    QUERY_ID


                                      EXECUTE_STATE                   SOURCE_TEXT
--------------------------  --------------------  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  ------------------------------  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

2018-09-30 17:33:57.186697                 94105  MXID11000005682212404965134575421000000000206U3333308T150000000_1450_SQL_CUR_29


                                      EXECUTE                         "drop table TESTBIGINT_SIGNED




2018-09-30 17:33:57.186697                 83196  MXID11000002449212404945321083089000000000606U3333308T150000000_1210_SQL_CUR_28


                                      EXECUTE                         "backup trafodion, tag 'test3_incbk2', table(tab3), incremental,override




2018-09-30 17:33:57.186697                 83152  MXID11000002449212404945321083089000000000606U3333308T150000000_1228_1077


                                      EXECUTE                         "create table if not exists TRAFODION."_BACKUP_test3_incbk2_".TABLE_CONSTRAINTS like TRAFODION."_MD_".TABLE_CONSTRAINTS;




2018-09-30 17:33:57.186697                 83151  MXID11000006502212404976246567184000000000106U3333308T150000000_1871_1871


                                      EXECUTE                         "create table TRAFODION."_BACKUP_test3_incbk2_".TABLE_CONSTRAINTS    (     "TABLE_UID"                      LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
      NOT SERIALIZED  , "CONSTRAINT_UID"                 LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE





--- 4 row(s) selected.





And the compiler processes:
trafodion@kylin:~$ ps aux|grep tdm_arkcmp
trafodi+  6253  0.1  0.9 1097316 314744 ?      SNl  9月29   2:53 tdm_arkcmp SQMON1.1 00000 00000 006253 $Z00053N 172.16.20.18:47936 00004 00000 00235 00001 -guardian
trafodi+  6454  0.1  0.9 1090060 323544 ?      SNl  9月29   2:29 tdm_arkcmp SQMON1.1 00000 00000 006454 $Z00059E 172.16.20.18:47936 00004 00000 00236 00001 -guardian
trafodi+  6502  0.2  0.9 1094252 317376 ?      SNl  9月29   2:49 tdm_arkcmp SQMON1.1 00000 00000 006502 $Z0005AS 172.16.20.18:47936 00004 00000 00246 00001 -guardian
trafodi+  6616  0.1  0.8 1052036 265712 ?      SNl  9月29   1:57 tdm_arkcmp SQMON1.1 00000 00000 006616 $Z0005E1 172.16.20.18:47936 00004 00000 00237 00001 -guardian
trafodi+  6692  0.2  0.9 1098312 303212 ?      SNl  9月29   3:04 tdm_arkcmp SQMON1.1 00000 00000 006692 $Z0005G7 172.16.20.18:47936 00004 00000 00247 00001 -guardian
trafodi+  7252  0.1  0.9 1092792 299088 ?      SNl  9月29   2:42 tdm_arkcmp SQMON1.1 00000 00000 007252 $Z0005X7 172.16.20.18:47936 00004 00000 00248 00001 -guardian
trafodi+ 10234  0.1  0.7 1054868 259388 ?      SNl  9月29   1:53 tdm_arkcmp SQMON1.1 00000 00000 010234 $Z0008CE 172.16.20.18:47936 00004 00000 00239 00001 -guardian
trafodi+ 13950  0.1  0.7 1051052 254552 ?      SNl  9月29   1:39 tdm_arkcmp SQMON1.1 00000 00000 013950 $Z000BDK 172.16.20.18:47936 00004 00000 00249 00001 -guardian
trafodi+ 17307  0.0  0.0  10980   592 pts/13   S+   17:39   0:00 grep tdm_arkcmp
trafodi+ 19708  1.3  0.9 1107432 308124 ?      SNl  16:14   1:10 tdm_arkcmp SQMON1.1 00000 00000 019708 $Z000G33 172.16.20.18:47936 00004 00000 00263 00001 -guardian
trafodi+ 20169  1.2  0.9 1092104 301936 ?      SNl  16:14   1:04 tdm_arkcmp SQMON1.1 00000 00000 020169 $Z000GG9 172.16.20.18:47936 00004 00000 00264 00001 -guardian
trafodi+ 20314  0.4  0.7 1052780 258584 ?      SNl  16:15   0:25 tdm_arkcmp SQMON1.1 00000 00000 020314 $Z000GKE 172.16.20.18:47936 00004 00000 00265 00001 -guardian
trafodi+ 22355  0.3  0.7 1052176 253904 ?      SNl  16:19   0:19 tdm_arkcmp SQMON1.1 00000 00000 022355 $Z000I8Q 172.16.20.18:47936 00004 00000 00266 00001 -guardian





And attach to process 22355:
(gdb) bt
#0  0x0000007fa7f55120 in pthread_join () from /lib/aarch64-linux-gnu/libpthread.so.0
#1  0x0000000000408f0c in main (argc=2, argv=0x7ffdcbc178) at ../bin/arkcmp.cpp:388
(gdb) f 1
#1  0x0000000000408f0c in main (argc=2, argv=0x7ffdcbc178) at ../bin/arkcmp.cpp:388
388       s = pthread_join(gv_main_thread_id, &res);
(gdb) p /x gv_main_thread_id
$1 = 0x7fa1d1a320
(gdb) thread 5
[Switching to thread 5 (Thread 0x7fa1d1a320 (LWP 22359))]
#0  0x0000007fa7f5a2fc in pthread_cond_wait@@GLIBC_2.17 () from /lib/aarch64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x0000007fa7f5a2fc in pthread_cond_wait@@GLIBC_2.17 () from /lib/aarch64-linux-gnu/libpthread.so.0
#1  0x0000007fa6afcf84 in SB_Thread::CV::wait (this=0x3283ec68) at /work/esgyndb/core/sqf/export/include/seabed/int/thread.inl:555
#2  0x0000007fa6afd018 in SB_Thread::CV::wait (this=0x3283ec68, pv_lock=true) at /work/esgyndb/core/sqf/export/include/seabed/int/thread.inl:590
#3  0x0000007fa6fa99a8 in SB_Ms_Event_Mgr::wait (this=0x3283eb70, pv_us=-1) at mseventmgr.inl:346
#4  0x0000007fa6fd088c in XWAIT_com (pv_mask=256, pv_time=-1, pv_residual=true) at pctl.cpp:982
#5  0x0000007fa6fd05dc in XWAIT (pv_mask=256, pv_time=-1) at pctl.cpp:878
#6  0x0000007fa6f32258 in fs_int_fs_file_awaitiox (pp_filenum=0x3283fdbc, ppp_buf=0x7fa1d18878, pp_xfercount=0x7fa1d18874, pp_tag=0x7fa1d18880, pv_timeout=-1, pp_segid=0x7fa1d1886e, pv_int=false, pv_ts=false) at fsi.cpp:1426
#7  0x0000007fa6f2afdc in BAWAITIOX (pp_filenum=0x3283fdbc, ppp_buf=0x7fa1d18b40, pp_xfercount=0x7fa1d18b04, pp_tag=0x7fa1d18b48, pv_timeout=-1, pp_segid=0x0) at fs.cpp:563
#8  0x0000007fab6e5698 in GuaReceiveControlConnection::wait (this=0x3283fda0, timeout=-1, eventConsumed=0x0, ipcAwaitiox=0x0) at ../common/IpcGuardian.cpp:2773
#9  0x0000007fab6e4378 in GuaConnectionToClient::wait (this=0x3283f040, timeout=-1, eventConsumed=0x0, ipcAwaitiox=0x0) at ../common/IpcGuardian.cpp:2252
#10 0x0000007fab6c5e3c in IpcWaitableSetOfConnections::waitOnSet (this=0x7fa1d19738, timeout=-1, calledByESP=0, timedout=0x0) at ../common/Ipc.cpp:2006
#11 0x0000007fab6c9418 in IpcMessageStream::waitOnMsgStream (this=0x7fa1d19608, timeout=-1) at ../common/Ipc.cpp:3593
#12 0x0000007fab6c9380 in IpcMessageStream::receive (this=0x7fa1d19608, waited=1) at ../common/Ipc.cpp:3575
#13 0x0000000000408c5c in thread_main (p_arg=0x0) at ../bin/arkcmp.cpp:326
#14 0x0000007fa7f53fb4 in start_thread () from /lib/aarch64-linux-gnu/libpthread.so.0
#15 0x0000007fa6c1abd0 in ?? () from /lib/aarch64-linux-gnu/libc.so.6

It seems that thread 1 is waiting for thread 5, and thread 5 is waiting for something unknow.
As my understanding, this `tdm_arkcmp` process is short, and should not wait for something for a long time, right?

So what's wrong here? How can I find which thread(or process) is thread 5 waiting for?



-----邮件原件-----
发件人: Zhu, Wen-Jun <we...@esgyn.cn>
发送时间: 2018年9月26日 18:09
收件人: dev@trafodion.apache.org
主题: command `shell -c node info` blocks

Hi,

Recently I find that the `shell` program has blocked.

As I run
       sqcheck
Which invokes
       shell -c node info
it blocks.

After some debugging, I find that there are two threads within `shell`, Stacks of one thread looks like this:
#0  0x0000007fb7e292fc in pthread_cond_wait@@GLIBC_2.17 () from /lib/aarch64-linux-gnu/libpthread.so.0
#1  0x000000000042ab54 in Local_IO_To_Monitor::wait_on_cv (this=0x600b40) at clio.cxx:2240
#2  0x0000000000429888 in Local_IO_To_Monitor::send_recv (this=0x600b40, pp_msg=0x7fb6bc95bc, pv_nw=false) at clio.cxx:1675
#3  0x000000000040aeec in attach (nid=0, name=0x5d0240 "SHELL", program=0x4e84c8 "shell") at shell.cxx:995
#4  0x0000000000421a58 in main (argc=4, argv=0x7fffff2d58) at shell.cxx:8849 Which is wait for `iv_sr_cv`

Stacks of the other thread:
    #0  local_monitor_reader (pp_arg=0x63e3) at clio.cxx:285
#1  0x0000007fb7e22fb4 in start_thread () from /lib/aarch64-linux-gnu/libpthread.so.0
Which wait `monitor` for the signal SQ_LIO_SIGNAL_REQUEST_REPLY.

If `monitor` send the signal, then `shell` would receive it, continue, and finish its job.
But `monitor` do not send the signal.



After some searching, I find that there is only one piece of code sending the signal:
       513         pthread_kill(iv_worker_thread_id, SQ_LIO_SIGNAL_REQUEST_REPLY);
In function Local_IO_To_Monitor::~Local_IO_To_Monitor() of file core/sqf/monitor/linux/clio.cxx.

As my understanding, this function should be invoked in `monitor` program, but when I attach to that `monitor`, whose pid is got from function `local_monitor_reader()`, and add a breakpoint on ` Local_IO_To_Monitor::~Local_IO_To_Monitor()`,
it does not break there.


So, what should the normal procedure be? Is it incorrect for `monitor` not to invoking Local_IO_To_Monitor::~Local_IO_To_Monitor() ?


Thank you.

Wenjun Zhu

答复: command `shell -c node info` blocks

Posted by "Zhu, Wen-Jun" <we...@esgyn.cn>.
Hi,

There is another block:

trafodion@kylin:~$ offender -s active
EsgynDB Advanced Conversational Interface 2.4.5
Copyright (c) 2015-2018 Esgyn Corporation
Interpreter has not been linked in.EXITING FROM layoutNativeCode() -could not create function !!
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>+>+>+>+>+>+>+>+>+>+>+>+>+>Interpreter has not been linked in.EXITING FROM layoutNativeCode() -could not create function !!

CURRENT_TIMESTAMP           LAST_ACTIVITY_SECS    QUERY_ID


                                      EXECUTE_STATE                   SOURCE_TEXT
--------------------------  --------------------  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  ------------------------------  ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

2018-09-30 17:33:57.186697                 94105  MXID11000005682212404965134575421000000000206U3333308T150000000_1450_SQL_CUR_29


                                      EXECUTE                         "drop table TESTBIGINT_SIGNED




2018-09-30 17:33:57.186697                 83196  MXID11000002449212404945321083089000000000606U3333308T150000000_1210_SQL_CUR_28


                                      EXECUTE                         "backup trafodion, tag 'test3_incbk2', table(tab3), incremental,override




2018-09-30 17:33:57.186697                 83152  MXID11000002449212404945321083089000000000606U3333308T150000000_1228_1077


                                      EXECUTE                         "create table if not exists TRAFODION."_BACKUP_test3_incbk2_".TABLE_CONSTRAINTS like TRAFODION."_MD_".TABLE_CONSTRAINTS;




2018-09-30 17:33:57.186697                 83151  MXID11000006502212404976246567184000000000106U3333308T150000000_1871_1871


                                      EXECUTE                         "create table TRAFODION."_BACKUP_test3_incbk2_".TABLE_CONSTRAINTS    (     "TABLE_UID"                      LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
      NOT SERIALIZED  , "CONSTRAINT_UID"                 LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE





--- 4 row(s) selected.





And the compiler processes:
trafodion@kylin:~$ ps aux|grep tdm_arkcmp
trafodi+  6253  0.1  0.9 1097316 314744 ?      SNl  9月29   2:53 tdm_arkcmp SQMON1.1 00000 00000 006253 $Z00053N 172.16.20.18:47936 00004 00000 00235 00001 -guardian
trafodi+  6454  0.1  0.9 1090060 323544 ?      SNl  9月29   2:29 tdm_arkcmp SQMON1.1 00000 00000 006454 $Z00059E 172.16.20.18:47936 00004 00000 00236 00001 -guardian
trafodi+  6502  0.2  0.9 1094252 317376 ?      SNl  9月29   2:49 tdm_arkcmp SQMON1.1 00000 00000 006502 $Z0005AS 172.16.20.18:47936 00004 00000 00246 00001 -guardian
trafodi+  6616  0.1  0.8 1052036 265712 ?      SNl  9月29   1:57 tdm_arkcmp SQMON1.1 00000 00000 006616 $Z0005E1 172.16.20.18:47936 00004 00000 00237 00001 -guardian
trafodi+  6692  0.2  0.9 1098312 303212 ?      SNl  9月29   3:04 tdm_arkcmp SQMON1.1 00000 00000 006692 $Z0005G7 172.16.20.18:47936 00004 00000 00247 00001 -guardian
trafodi+  7252  0.1  0.9 1092792 299088 ?      SNl  9月29   2:42 tdm_arkcmp SQMON1.1 00000 00000 007252 $Z0005X7 172.16.20.18:47936 00004 00000 00248 00001 -guardian
trafodi+ 10234  0.1  0.7 1054868 259388 ?      SNl  9月29   1:53 tdm_arkcmp SQMON1.1 00000 00000 010234 $Z0008CE 172.16.20.18:47936 00004 00000 00239 00001 -guardian
trafodi+ 13950  0.1  0.7 1051052 254552 ?      SNl  9月29   1:39 tdm_arkcmp SQMON1.1 00000 00000 013950 $Z000BDK 172.16.20.18:47936 00004 00000 00249 00001 -guardian
trafodi+ 17307  0.0  0.0  10980   592 pts/13   S+   17:39   0:00 grep tdm_arkcmp
trafodi+ 19708  1.3  0.9 1107432 308124 ?      SNl  16:14   1:10 tdm_arkcmp SQMON1.1 00000 00000 019708 $Z000G33 172.16.20.18:47936 00004 00000 00263 00001 -guardian
trafodi+ 20169  1.2  0.9 1092104 301936 ?      SNl  16:14   1:04 tdm_arkcmp SQMON1.1 00000 00000 020169 $Z000GG9 172.16.20.18:47936 00004 00000 00264 00001 -guardian
trafodi+ 20314  0.4  0.7 1052780 258584 ?      SNl  16:15   0:25 tdm_arkcmp SQMON1.1 00000 00000 020314 $Z000GKE 172.16.20.18:47936 00004 00000 00265 00001 -guardian
trafodi+ 22355  0.3  0.7 1052176 253904 ?      SNl  16:19   0:19 tdm_arkcmp SQMON1.1 00000 00000 022355 $Z000I8Q 172.16.20.18:47936 00004 00000 00266 00001 -guardian





And attach to process 22355:
(gdb) bt
#0  0x0000007fa7f55120 in pthread_join () from /lib/aarch64-linux-gnu/libpthread.so.0
#1  0x0000000000408f0c in main (argc=2, argv=0x7ffdcbc178) at ../bin/arkcmp.cpp:388
(gdb) f 1
#1  0x0000000000408f0c in main (argc=2, argv=0x7ffdcbc178) at ../bin/arkcmp.cpp:388
388       s = pthread_join(gv_main_thread_id, &res);
(gdb) p /x gv_main_thread_id
$1 = 0x7fa1d1a320
(gdb) thread 5
[Switching to thread 5 (Thread 0x7fa1d1a320 (LWP 22359))]
#0  0x0000007fa7f5a2fc in pthread_cond_wait@@GLIBC_2.17 () from /lib/aarch64-linux-gnu/libpthread.so.0
(gdb) bt
#0  0x0000007fa7f5a2fc in pthread_cond_wait@@GLIBC_2.17 () from /lib/aarch64-linux-gnu/libpthread.so.0
#1  0x0000007fa6afcf84 in SB_Thread::CV::wait (this=0x3283ec68) at /work/esgyndb/core/sqf/export/include/seabed/int/thread.inl:555
#2  0x0000007fa6afd018 in SB_Thread::CV::wait (this=0x3283ec68, pv_lock=true) at /work/esgyndb/core/sqf/export/include/seabed/int/thread.inl:590
#3  0x0000007fa6fa99a8 in SB_Ms_Event_Mgr::wait (this=0x3283eb70, pv_us=-1) at mseventmgr.inl:346
#4  0x0000007fa6fd088c in XWAIT_com (pv_mask=256, pv_time=-1, pv_residual=true) at pctl.cpp:982
#5  0x0000007fa6fd05dc in XWAIT (pv_mask=256, pv_time=-1) at pctl.cpp:878
#6  0x0000007fa6f32258 in fs_int_fs_file_awaitiox (pp_filenum=0x3283fdbc, ppp_buf=0x7fa1d18878, pp_xfercount=0x7fa1d18874, pp_tag=0x7fa1d18880, pv_timeout=-1, pp_segid=0x7fa1d1886e, pv_int=false, pv_ts=false) at fsi.cpp:1426
#7  0x0000007fa6f2afdc in BAWAITIOX (pp_filenum=0x3283fdbc, ppp_buf=0x7fa1d18b40, pp_xfercount=0x7fa1d18b04, pp_tag=0x7fa1d18b48, pv_timeout=-1, pp_segid=0x0) at fs.cpp:563
#8  0x0000007fab6e5698 in GuaReceiveControlConnection::wait (this=0x3283fda0, timeout=-1, eventConsumed=0x0, ipcAwaitiox=0x0) at ../common/IpcGuardian.cpp:2773
#9  0x0000007fab6e4378 in GuaConnectionToClient::wait (this=0x3283f040, timeout=-1, eventConsumed=0x0, ipcAwaitiox=0x0) at ../common/IpcGuardian.cpp:2252
#10 0x0000007fab6c5e3c in IpcWaitableSetOfConnections::waitOnSet (this=0x7fa1d19738, timeout=-1, calledByESP=0, timedout=0x0) at ../common/Ipc.cpp:2006
#11 0x0000007fab6c9418 in IpcMessageStream::waitOnMsgStream (this=0x7fa1d19608, timeout=-1) at ../common/Ipc.cpp:3593
#12 0x0000007fab6c9380 in IpcMessageStream::receive (this=0x7fa1d19608, waited=1) at ../common/Ipc.cpp:3575
#13 0x0000000000408c5c in thread_main (p_arg=0x0) at ../bin/arkcmp.cpp:326
#14 0x0000007fa7f53fb4 in start_thread () from /lib/aarch64-linux-gnu/libpthread.so.0
#15 0x0000007fa6c1abd0 in ?? () from /lib/aarch64-linux-gnu/libc.so.6

It seems that thread 1 is waiting for thread 5, and thread 5 is waiting for something unknow.
As my understanding, this `tdm_arkcmp` process is short, and should not wait for something for a long time, right?

So what's wrong here? How can I find which thread(or process) is thread 5 waiting for?



-----邮件原件-----
发件人: Zhu, Wen-Jun <we...@esgyn.cn> 
发送时间: 2018年9月26日 18:09
收件人: dev@trafodion.apache.org
主题: command `shell -c node info` blocks

Hi,

Recently I find that the `shell` program has blocked.

As I run
       sqcheck
Which invokes
       shell -c node info
it blocks.

After some debugging, I find that there are two threads within `shell`, Stacks of one thread looks like this:
#0  0x0000007fb7e292fc in pthread_cond_wait@@GLIBC_2.17 () from /lib/aarch64-linux-gnu/libpthread.so.0
#1  0x000000000042ab54 in Local_IO_To_Monitor::wait_on_cv (this=0x600b40) at clio.cxx:2240
#2  0x0000000000429888 in Local_IO_To_Monitor::send_recv (this=0x600b40, pp_msg=0x7fb6bc95bc, pv_nw=false) at clio.cxx:1675
#3  0x000000000040aeec in attach (nid=0, name=0x5d0240 "SHELL", program=0x4e84c8 "shell") at shell.cxx:995
#4  0x0000000000421a58 in main (argc=4, argv=0x7fffff2d58) at shell.cxx:8849 Which is wait for `iv_sr_cv`

Stacks of the other thread:
    #0  local_monitor_reader (pp_arg=0x63e3) at clio.cxx:285
#1  0x0000007fb7e22fb4 in start_thread () from /lib/aarch64-linux-gnu/libpthread.so.0
Which wait `monitor` for the signal SQ_LIO_SIGNAL_REQUEST_REPLY.

If `monitor` send the signal, then `shell` would receive it, continue, and finish its job.
But `monitor` do not send the signal.



After some searching, I find that there is only one piece of code sending the signal:
       513         pthread_kill(iv_worker_thread_id, SQ_LIO_SIGNAL_REQUEST_REPLY);
In function Local_IO_To_Monitor::~Local_IO_To_Monitor() of file core/sqf/monitor/linux/clio.cxx.

As my understanding, this function should be invoked in `monitor` program, but when I attach to that `monitor`, whose pid is got from function `local_monitor_reader()`, and add a breakpoint on ` Local_IO_To_Monitor::~Local_IO_To_Monitor()`,
it does not break there.


So, what should the normal procedure be? Is it incorrect for `monitor` not to invoking Local_IO_To_Monitor::~Local_IO_To_Monitor() ?


Thank you.

Wenjun Zhu