You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafodion.apache.org by "Anuradha Hegde (JIRA)" <ji...@apache.org> on 2016/06/06 16:31:21 UTC

[jira] [Work started] (TRAFODION-887) LP Bug: 1410504 - Lots of mxosrvrs kicked off for proper number to stay up.

     [ https://issues.apache.org/jira/browse/TRAFODION-887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on TRAFODION-887 started by Anuradha Hegde.
------------------------------------------------
> LP Bug: 1410504 - Lots of mxosrvrs kicked off for proper number to stay up.
> ---------------------------------------------------------------------------
>
>                 Key: TRAFODION-887
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-887
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: connectivity-dcs
>            Reporter: Guy Groulx
>            Assignee: Anuradha Hegde
>             Fix For: 2.1-incubating
>
>
> We’ve noticed this before but never make anything out of it since we did not have proof.   But now that master_exec files are created in the logs directory, I believe we have proof that on dcsstart, we need to start many mxosrvrs before we get to the number of mxosrvrs we actually want.   This concerns us a bit as it means that we are creating processes that die right away.
> The last start time on zircon4 was around at 15:48.
> /opt/hp/squser4/git150112dcs/logs> ps -C mxosrvr | wc –l
> 129
> So this is showing that we have 128 mxosrvrs running on this node + 1 header line from the PS command.
> /opt/hp/squser4/git150112dcs/logs> ls master_exec* | wc –l
> 598
> This is telling that we have 598 master_exec* file under the directory.    And yes, these are all from this start as the mod date is all 15:48.
> /opt/hp/squser4/git150112dcs/logs> ls -lt master*
> -rw-r----- 1 squser4 seaquest 42037 Jan 13 15:48 master_exec_0_21145.log
> -rw-r----- 1 squser4 seaquest 42543 Jan 13 15:48 master_exec_0_21018.log
> -rw-r----- 1 squser4 seaquest 41706 Jan 13 15:48 master_exec_0_21012.log
> -rw-r----- 1 squser4 seaquest 41375 Jan 13 15:48 master_exec_0_20991.log
> …
> -rw-r----- 1 squser4 seaquest  1499 Jan 13 15:48 master_exec_0_63779.log
> -rw-r----- 1 squser4 seaquest  1168 Jan 13 15:48 master_exec_0_63462.log
> -rw-r----- 1 squser4 seaquest  1499 Jan 13 15:48 master_exec_0_63652.log
> -rw-r----- 1 squser4 seaquest   662 Jan 13 15:48 master_exec_0_63673.log
> -rw-r----- 1 squser4 seaquest   331 Jan 13 15:48 master_exec_0_63598.log
> -rw-r----- 1 squser4 seaquest   837 Jan 13 15:48 master_exec_0_63667.log
> -rw-r----- 1 squser4 seaquest     0 Jan 13 15:48 master_exec_0_63594.log
> -rw-r----- 1 squser4 seaquest     0 Jan 13 15:47 master_exec
> Looking at some of those:
> /opt/hp/squser4/git150112dcs/logs> ps 63667
>   PID TTY      STAT   TIME COMMAND
> /opt/hp/squser4/git150112dcs/logs> cat master_exec_0_63667.log 
> 2015-01-13 15:48:27,292, ERROR, MXOSRVR, Node Number: 0, CPU: 0, PIN:63667, Process Name:$Z001GZ2 , , ,A network component error  Platform: NSK, Transport: TCPIP, Api: UNKNOWN_API, Error type: LISTENER, Process: ListenToPort, Operation: INIT_PROCESS, function: ACCEPT, error: 98, error_detail: 0.  has occurred.
> CEE Error Text: The specified address is already in use.
> 2015-01-13 15:48:27,293, ERROR, MXOSRVR, Node Number: 0, CPU: 0, PIN:63667, Process Name:$Z001GZ2 , , ,A network component error UNKNOWN_ERROR_TYPE 0, Process: bailout ListenToPort[21002][-2], Operation: INIT_PROCESS, function: SOCKET, error: 0, error_detail: 0.  has occurred.
> CEE Error Text: .
> 2015-01-13 15:48:27,293, ERROR, MXOSRVR, Node Number: 0, CPU: 0, PIN:63667, Process Name:$Z001GZ2 , , ,A network component error 5012 has occurred.
> CEE Error Text: <5>.
> The above is showing that 63667 did not survive as an mxosrvr.
> /opt/hp/squser4/git150112dcs/logs> ps 65102
>   PID TTY      STAT   TIME COMMAND
> 65102 ?        Sl     0:07 mxosrvr -ZKHOST n014.cm.cluster:2181,n015.cm.cluster:2181,n016.cm.cluster:2181 -RZ zircon-n011.usa.hp.com
> /opt/hp/squser4/git150112dcs/logs> cat master_exec_0_65102.log 
> 2015-01-13 15:48:27,832, ERROR, MXOSRVR, Node Number: 0, CPU: 0, PIN:65102, Process Name:$Z001I52 , , ,A network component error UNKNOWN_ERROR_TYPE 0, Process: verifyPortAvailable:[21002], Operation: INIT_PROCESS, function: BIND, error: 98, error_detail: 0.  has occurred.
> CEE Error Text: The specified address is already in use.
> 2015-01-13 15:48:27,832, ERROR, MXOSRVR, Node Number: 0, CPU: 0, PIN:65102, Process Name:$Z001I52 , , ,A network component error UNKNOWN_ERROR_TYPE 0, Process: verifyPortAvailable:[21003], Operation: INIT_PROCESS, function: BIND, error: 98, error_detail: 0.  has occurred.
> CEE Error Text: The specified address is already in use.
> ****   The above msg is giving 13 more times.  (Exact same date time).
> But finally it worked as the process is running.
> If I was to look at all the master_exec above I would find all of them have at least one error and (598  - 128) ~> 470 of them did not start.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)