You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@trafodion.apache.org by Radu Marias <ra...@gmail.com> on 2015/10/06 18:21:24 UTC

trafodion won't start core files are generated

Hi,

At some point a node from the 5 nodes cluster has stopped and we needed to
restart it, After that I've restarted all the ambari and hdp services but
trafodion fails to start.

Bellow are some stack traces and details for files that I'm not getting any
stack. Files are from node1 and node2 and were in Oct  2 (when I think node
2 was down) and Oct  6 (when re rebooted the node and tried to start
trafodion). Feel free to connect and debug the issue on our cluster, Amanda
has the credentials.

*FROM NODE1*

Oct  2 22:27 core.39347
core.39347: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
from 'tm SQMON1.1 00000 00000 039347 $TM0 188.138.61.175:60186 00002 00000
00009 SPAR'
gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
core.39347
no stack

Oct  2 22:41 core.15144
Program terminated with signal 6, Aborted.
#0  0x00007f77bcbbb625 in ?? ()
#1  0x00007f77bcbbce05 in ?? ()
#2  0x0000000000000010 in ?? () at ../common/Collections.cpp:109
#3  0x00007f77bee62130 in ?? ()
#4  0x00007ffe8e796ec0 in ?? ()
#5  0x00007f77bdeced00 in ?? ()
#6  0x0000000000000004 in ?? () at ../common/Collections.cpp:109
#7  0x0000000001b3a310 in ?? ()
#8  0x0000000000000000 in ?? ()

Oct  2 22:41 core.39240
#0  0x00007f534d03c625 in raise () from /lib64/libc.so.6
#1  0x00007f534d03de05 in abort () from /lib64/libc.so.6
#2  0x00007f534d03574e in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007f534d035810 in __assert_fail () from /lib64/libc.so.6
#4  0x000000000046e213 in CExtTmLeaderReq::performRequest
(this=0x7f53340008c0) at reqtmleader.cxx:126
#5  0x000000000045a64a in CReqWorker::reqWorkerThread (this=<value
optimized out>) at reqworker.cxx:79
#6  0x000000000045a86d in reqWorker (arg=0xc6f9a0) at reqworker.cxx:147
#7  0x00007f534db45a51 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f534d0f29ad in clone () from /lib64/libc.so.6

Oct  2 22:41 core.15309
core.15309: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
from 'tm SQMON1.1 00000 00000 015309 $TM0 188.138.61.175:60186 00002 00000
00134 SPAR'
gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
core.15309
no stack


*FROM NODE2*

Oct  2 22:29 core.39491
core.39491: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
from 'tm SQMON1.1 00001 00001 039491 $TM1 188.138.61.177:38680 00002 00001
00003 SPAR'
gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
core.39491
no stack

Oct  6 15:23 core.1394
Program terminated with signal 6, Aborted.
#0  0x00007fb97acbf625 in raise () from /lib64/libc.so.6
#1  0x00007fb97acc0e05 in abort () from /lib64/libc.so.6
#2  0x000000000041d07d in CProcessContainer::CProcessContainer
(this=0x2071880, nodeContainer=<value optimized out>) at process.cxx:3366
#3  0x0000000000453f5c in CNode::CNode (this=0x2071880, name=0x204c448
"euve79672", pnid=0, rank=0) at pnode.cxx:153
#4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value optimized
out>) at pnode.cxx:1564
#5  0x00000000004169a5 in CCluster::InitializeConfigCluster
(this=0x20757b0) at cluster.cxx:2740
#6  0x0000000000417645 in CCluster::CCluster (this=0x20757b0) at
cluster.cxx:567
#7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
(this=0x20757b0) at tmsync.cxx:137
#8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x20757b0,
procTermSig=9) at monitor.cxx:323
#9  0x00000000004086ad in main (argc=2, argv=0x7fff8322e298) at
monitor.cxx:1152

Oct  6 15:43 core.17626
Program terminated with signal 6, Aborted.
#0  0x00007fcf11aea625 in raise () from /lib64/libc.so.6
#1  0x00007fcf11aebe05 in abort () from /lib64/libc.so.6
#2  0x000000000041d07d in CProcessContainer::CProcessContainer
(this=0x1182890, nodeContainer=<value optimized out>) at process.cxx:3366
#3  0x0000000000453f5c in CNode::CNode (this=0x1182890, name=0x115d458
"euve79672", pnid=0, rank=0) at pnode.cxx:153
#4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value optimized
out>) at pnode.cxx:1564
#5  0x00000000004169a5 in CCluster::InitializeConfigCluster
(this=0x11867c0) at cluster.cxx:2740
#6  0x0000000000417645 in CCluster::CCluster (this=0x11867c0) at
cluster.cxx:567
#7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
(this=0x11867c0) at tmsync.cxx:137
#8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x11867c0,
procTermSig=9) at monitor.cxx:323
#9  0x00000000004086ad in main (argc=2, argv=0x7ffcaca91f68) at
monitor.cxx:1152

-- 
And in the end, it's not the years in your life that count. It's the life
in your years.

Re: trafodion won't start core files are generated

Posted by Radu Marias <ra...@gmail.com>.
Also java -version has the problem. But it's for when accessing hbase and
also hbase shell.
Trying now with a wrapper over java file from jdk/bin to start it with
-Xmx512m. Seems to work now but will see the impact on other java processes

On Tue, Oct 13, 2015, 17:54 Suresh Subbiah <su...@gmail.com>
wrote:

> Hi Radu,
>
> Is it possible to tell which process is unable to start java? Or is it that
> none of the java processes are starting, including datanodes and
> regionservers ?
>
> Thanks
> Suresh
>
> On Tue, Oct 13, 2015 at 4:27 AM, Radu Marias <ra...@gmail.com> wrote:
>
> > Managed to start trafodion with the latest daily build. But now I'm
> having
> > some OpenVZ container issues when a java process is started:
> > Error occurred during initialization of VM
> > Could not reserve enough space for object heap
> >
> > Tried with alias to java as alias java="java -Xms128m -Xmx512m" and also
> > with JAVA_TOOL_OPTIONS but the same. Searching now for other fixes.
> >
> > On Tue, Oct 13, 2015 at 12:02 AM Steve Varnau <st...@esgyn.com>
> > wrote:
> >
> > > In upcoming changes to Jenkins automation, I will add the daily build
> > > downloads link to the daily-build test result email that gets sent to
> > this
> > > list.
> > >
> > > --Steve
> > >
> > > -----Original Message-----
> > > From: Roberta Marton [mailto:roberta.marton@esgyn.com]
> > > Sent: Thursday, October 8, 2015 9:51 AM
> > > To: dev@trafodion.incubator.apache.org
> > > Subject: RE: trafodion won't start core files are generated
> > >
> > > Is this something that should be added to the Apache Trafodion
> > > website/wiki?
> > >
> > >      Roberta
> > >
> > > -----Original Message-----
> > > From: Steve Varnau [mailto:steve.varnau@esgyn.com]
> > > Sent: Thursday, October 8, 2015 9:47 AM
> > > To: dev@trafodion.incubator.apache.org
> > > Subject: RE: trafodion won't start core files are generated
> > >
> > > Daily builds for development/test are posted at
> > > http://traf-downloads.esgyn.com/
> > >
> > > --Steve
> > >
> > > -----Original Message-----
> > > From: Suresh Subbiah [mailto:suresh.subbiah60@gmail.com]
> > > Sent: Thursday, October 8, 2015 7:10 AM
> > > To: dev@trafodion.incubator.apache.org
> > > Subject: Re: trafodion won't start core files are generated
> > >
> > > Hi,
> > >
> > > What is the suggested procedure to pick up a daily build?
> > >
> > > Thanks
> > > Suresh
> > >
> > > On Thu, Oct 8, 2015 at 1:02 AM, Prashanth Vasudev <
> > > prashanth.vasudev@esgyn.com> wrote:
> > >
> > > > Memorymonitor.cpp fix is part of this
> > > > https://issues.apache.org/jira/browse/TRAFODION-1492
> > > > Please pick up latest daily build.
> > > >
> > > > Also max locked memory 64kb below appears very small.
> > > >
> > > > Regards,
> > > > Prashanth
> > > >
> > > > -----Original Message-----
> > > > From: Radu Marias [mailto:radumarias@gmail.com]
> > > > Sent: Wednesday, October 7, 2015 8:45 AM
> > > > To: dev <de...@trafodion.incubator.apache.org>
> > > > Subject: Re: trafodion won't start core files are generated
> > > >
> > > > Hi,
> > > >
> > > > I have these:
> > > >
> > > > # pwd
> > > > /dev/shm
> > > > # ls -la
> > > > total 4
> > > > drwxrwxrwx 2 root      root        60 Oct  6 21:07 .
> > > > drwxr-xr-x 9 root      root      2180 Oct  2 22:28 ..
> > > > -rw-r--r-- 1 trafodion trafodion   32 Oct  6 21:07
> > > > sem.monitor.sem.trafodion
> > > >
> > > > kernel.shmmax = 68719476736
> > > > kernel.shmall = 4294967296
> > > >
> > > > # ulimit -a
> > > > core file size          (blocks, -c) 0
> > > > data seg size           (kbytes, -d) unlimited
> > > > scheduling priority             (-e) 0
> > > > file size               (blocks, -f) unlimited
> > > > pending signals                 (-i) 1805076
> > > > max locked memory       (kbytes, -l) 64
> > > > max memory size         (kbytes, -m) unlimited
> > > > open files                      (-n) 65535
> > > > pipe size            (512 bytes, -p) 8
> > > > POSIX message queues     (bytes, -q) 819200
> > > > real-time priority              (-r) 0
> > > > stack size              (kbytes, -s) 10240
> > > > cpu time               (seconds, -t) unlimited
> > > > max user processes              (-u) 65535
> > > > virtual memory          (kbytes, -v) unlimited
> > > > file locks                      (-x) unlimited
> > > >
> > > > I would try to reinstall trafodion to see it something got corrupted
> > > > and maybe that would fix the issue but I know there was a crash on
> > > > sqstart and one of your guys fixed it and copied the lib file to our
> > > > cluster:
> > > >
> > > > This is a response from Narendra in a previous thread where the issue
> > > > was fixed to start the trafodion:
> > > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > *I updated the code: sql/cli/memmonitor.cpp, so that if
> > > > > /proc/meminfo does not have the ‘Committed_AS’ entry, it will
> ignore
> > > > > it. Built it and put the
> > > > > binary: libcli.so on the veracity box (in the
> > > > > $MY_SQROOT/export/lib64 directory – on all the nodes). Restarted
> the
> > > > > env and ‘sqlci’ worked fine.
> > > > > Was able to ‘initialize trafodion’ and create a table.*
> > > >
> > > >
> > > > There was another one similar which I see it's closed
> > > > https://issues.apache.org/jira/browse/TRAFODION-1492
> > > >
> > > > So the idea is are these fixes in the latest daily build and I can
> try
> > > > to reinstall? Or please send the changed files so I can override
> after
> > > > reinstall.
> > > >
> > > > On Wed, Oct 7, 2015 at 6:02 PM, Selva Govindarajan <
> > > > selva.govindarajan@esgyn.com> wrote:
> > > >
> > > > > You would want to retain the shared segment size across reboots.
> So,
> > > > > please check if the following settings are available in
> > > > > /etc/sysctl.conf
> > > > >
> > > > > # Controls the maximum shared segment size, in bytes kernel.shmmax
> =
> > > > > 134217728
> > > > >
> > > > > # Controls the maximum number of shared memory segments, in pages
> > > > > kernel.shmall = 4294967296
> > > > >
> > > > >
> > > > > shmmax needs to be at least 64 MB. By default, Trafodion RMS shared
> > > > > segment size is 64 MB. Trafodion RMS shared segment can be expanded
> > > > > to
> > > > > 128 MB. So, it is better to set shmmax to 128 mb, just in case we
> > > > > need to expand it later.
> > > > >
> > > > > Selva
> > > > >
> > > > > -----Original Message-----
> > > > > From: Prashanth Vasudev [mailto:prashanth.vasudev@esgyn.com]
> > > > > Sent: Tuesday, October 6, 2015 2:19 PM
> > > > > To: dev@trafodion.incubator.apache.org
> > > > > Subject: RE: trafodion won't start core files are generated
> > > > >
> > > > > Hi,
> > > > > From the stack trace below, it appears trafodion monitor is unable
> > > > > to create shared memory objects.
> > > > > Please makes sure ulimit settings on all nodes have high limits for
> > > > > max locked memory.
> > > > > Also make sure /dev/shm on all nodes have the correct write
> > > > > permissions to trafodion user id.
> > > > >
> > > > > Regards,
> > > > > Prashanth
> > > > >
> > > > > -----Original Message-----
> > > > > From: Radu Marias [mailto:radumarias@gmail.com]
> > > > > Sent: Tuesday, October 6, 2015 9:21 AM
> > > > > To: dev <de...@trafodion.incubator.apache.org>
> > > > > Subject: trafodion won't start core files are generated
> > > > >
> > > > > Hi,
> > > > >
> > > > > At some point a node from the 5 nodes cluster has stopped and we
> > > > > needed to restart it, After that I've restarted all the ambari and
> > > > > hdp services but trafodion fails to start.
> > > > >
> > > > > Bellow are some stack traces and details for files that I'm not
> > > > > getting any stack. Files are from node1 and node2 and were in Oct
> 2
> > > > > (when I think node
> > > > > 2 was down) and Oct  6 (when re rebooted the node and tried to
> start
> > > > > trafodion). Feel free to connect and debug the issue on our
> cluster,
> > > > > Amanda has the credentials.
> > > > >
> > > > > *FROM NODE1*
> > > > >
> > > > > Oct  2 22:27 core.39347
> > > > > core.39347: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > > > > SVR4-style, from 'tm SQMON1.1 00000 00000 039347 $TM0
> > > > > 188.138.61.175:60186 00002 00000
> > > > > 00009 SPAR'
> > > > > gdb
> /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > > > > core.39347
> > > > > no stack
> > > > >
> > > > > Oct  2 22:41 core.15144
> > > > > Program terminated with signal 6, Aborted.
> > > > > #0  0x00007f77bcbbb625 in ?? ()
> > > > > #1  0x00007f77bcbbce05 in ?? ()
> > > > > #2  0x0000000000000010 in ?? () at ../common/Collections.cpp:109
> > > > > #3  0x00007f77bee62130 in ?? ()
> > > > > #4  0x00007ffe8e796ec0 in ?? ()
> > > > > #5  0x00007f77bdeced00 in ?? ()
> > > > > #6  0x0000000000000004 in ?? () at ../common/Collections.cpp:109
> > > > > #7  0x0000000001b3a310 in ?? ()
> > > > > #8  0x0000000000000000 in ?? ()
> > > > >
> > > > > Oct  2 22:41 core.39240
> > > > > #0  0x00007f534d03c625 in raise () from /lib64/libc.so.6
> > > > > #1  0x00007f534d03de05 in abort () from /lib64/libc.so.6
> > > > > #2  0x00007f534d03574e in __assert_fail_base () from
> > > > > /lib64/libc.so.6
> > > > > #3  0x00007f534d035810 in __assert_fail () from /lib64/libc.so.6
> > > > > #4  0x000000000046e213 in CExtTmLeaderReq::performRequest
> > > > > (this=0x7f53340008c0) at reqtmleader.cxx:126
> > > > > #5  0x000000000045a64a in CReqWorker::reqWorkerThread (this=<value
> > > > > optimized
> > > > > out>) at reqworker.cxx:79
> > > > > #6  0x000000000045a86d in reqWorker (arg=0xc6f9a0) at
> > > > > reqworker.cxx:147
> > > > > #7  0x00007f534db45a51 in start_thread () from
> > > > > /lib64/libpthread.so.0
> > > > > #8  0x00007f534d0f29ad in clone () from /lib64/libc.so.6
> > > > >
> > > > > Oct  2 22:41 core.15309
> > > > > core.15309: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > > > > SVR4-style, from 'tm SQMON1.1 00000 00000 015309 $TM0
> > > > > 188.138.61.175:60186 00002 00000
> > > > > 00134 SPAR'
> > > > > gdb
> /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > > > > core.15309
> > > > > no stack
> > > > >
> > > > >
> > > > > *FROM NODE2*
> > > > >
> > > > > Oct  2 22:29 core.39491
> > > > > core.39491: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > > > > SVR4-style, from 'tm SQMON1.1 00001 00001 039491 $TM1
> > > > > 188.138.61.177:38680 00002 00001
> > > > > 00003 SPAR'
> > > > > gdb
> /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > > > > core.39491
> > > > > no stack
> > > > >
> > > > > Oct  6 15:23 core.1394
> > > > > Program terminated with signal 6, Aborted.
> > > > > #0  0x00007fb97acbf625 in raise () from /lib64/libc.so.6
> > > > > #1  0x00007fb97acc0e05 in abort () from /lib64/libc.so.6
> > > > > #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> > > > > (this=0x2071880, nodeContainer=<value optimized out>) at
> > > > > process.cxx:3366
> > > > > #3  0x0000000000453f5c in CNode::CNode (this=0x2071880,
> > > > > name=0x204c448 "euve79672", pnid=0, rank=0) at pnode.cxx:153
> > > > > #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> > > > > optimized
> > > > > out>) at pnode.cxx:1564
> > > > > #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> > > > > (this=0x20757b0) at cluster.cxx:2740
> > > > > #6  0x0000000000417645 in CCluster::CCluster (this=0x20757b0) at
> > > > > cluster.cxx:567
> > > > > #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> > > > > (this=0x20757b0) at tmsync.cxx:137
> > > > > #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x20757b0,
> > > > > procTermSig=9) at monitor.cxx:323
> > > > > #9  0x00000000004086ad in main (argc=2, argv=0x7fff8322e298) at
> > > > > monitor.cxx:1152
> > > > >
> > > > > Oct  6 15:43 core.17626
> > > > > Program terminated with signal 6, Aborted.
> > > > > #0  0x00007fcf11aea625 in raise () from /lib64/libc.so.6
> > > > > #1  0x00007fcf11aebe05 in abort () from /lib64/libc.so.6
> > > > > #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> > > > > (this=0x1182890, nodeContainer=<value optimized out>) at
> > > > > process.cxx:3366
> > > > > #3  0x0000000000453f5c in CNode::CNode (this=0x1182890,
> > > > > name=0x115d458 "euve79672", pnid=0, rank=0) at pnode.cxx:153
> > > > > #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> > > > > optimized
> > > > > out>) at pnode.cxx:1564
> > > > > #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> > > > > (this=0x11867c0) at cluster.cxx:2740
> > > > > #6  0x0000000000417645 in CCluster::CCluster (this=0x11867c0) at
> > > > > cluster.cxx:567
> > > > > #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> > > > > (this=0x11867c0) at tmsync.cxx:137
> > > > > #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x11867c0,
> > > > > procTermSig=9) at monitor.cxx:323
> > > > > #9  0x00000000004086ad in main (argc=2, argv=0x7ffcaca91f68) at
> > > > > monitor.cxx:1152
> > > > >
> > > > > --
> > > > > And in the end, it's not the years in your life that count. It's
> the
> > > > > life in your years.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > And in the end, it's not the years in your life that count. It's the
> > > > life in your years.
> > > >
> > >
> >
>

Re: trafodion won't start core files are generated

Posted by Suresh Subbiah <su...@gmail.com>.
Hi Radu,

Is it possible to tell which process is unable to start java? Or is it that
none of the java processes are starting, including datanodes and
regionservers ?

Thanks
Suresh

On Tue, Oct 13, 2015 at 4:27 AM, Radu Marias <ra...@gmail.com> wrote:

> Managed to start trafodion with the latest daily build. But now I'm having
> some OpenVZ container issues when a java process is started:
> Error occurred during initialization of VM
> Could not reserve enough space for object heap
>
> Tried with alias to java as alias java="java -Xms128m -Xmx512m" and also
> with JAVA_TOOL_OPTIONS but the same. Searching now for other fixes.
>
> On Tue, Oct 13, 2015 at 12:02 AM Steve Varnau <st...@esgyn.com>
> wrote:
>
> > In upcoming changes to Jenkins automation, I will add the daily build
> > downloads link to the daily-build test result email that gets sent to
> this
> > list.
> >
> > --Steve
> >
> > -----Original Message-----
> > From: Roberta Marton [mailto:roberta.marton@esgyn.com]
> > Sent: Thursday, October 8, 2015 9:51 AM
> > To: dev@trafodion.incubator.apache.org
> > Subject: RE: trafodion won't start core files are generated
> >
> > Is this something that should be added to the Apache Trafodion
> > website/wiki?
> >
> >      Roberta
> >
> > -----Original Message-----
> > From: Steve Varnau [mailto:steve.varnau@esgyn.com]
> > Sent: Thursday, October 8, 2015 9:47 AM
> > To: dev@trafodion.incubator.apache.org
> > Subject: RE: trafodion won't start core files are generated
> >
> > Daily builds for development/test are posted at
> > http://traf-downloads.esgyn.com/
> >
> > --Steve
> >
> > -----Original Message-----
> > From: Suresh Subbiah [mailto:suresh.subbiah60@gmail.com]
> > Sent: Thursday, October 8, 2015 7:10 AM
> > To: dev@trafodion.incubator.apache.org
> > Subject: Re: trafodion won't start core files are generated
> >
> > Hi,
> >
> > What is the suggested procedure to pick up a daily build?
> >
> > Thanks
> > Suresh
> >
> > On Thu, Oct 8, 2015 at 1:02 AM, Prashanth Vasudev <
> > prashanth.vasudev@esgyn.com> wrote:
> >
> > > Memorymonitor.cpp fix is part of this
> > > https://issues.apache.org/jira/browse/TRAFODION-1492
> > > Please pick up latest daily build.
> > >
> > > Also max locked memory 64kb below appears very small.
> > >
> > > Regards,
> > > Prashanth
> > >
> > > -----Original Message-----
> > > From: Radu Marias [mailto:radumarias@gmail.com]
> > > Sent: Wednesday, October 7, 2015 8:45 AM
> > > To: dev <de...@trafodion.incubator.apache.org>
> > > Subject: Re: trafodion won't start core files are generated
> > >
> > > Hi,
> > >
> > > I have these:
> > >
> > > # pwd
> > > /dev/shm
> > > # ls -la
> > > total 4
> > > drwxrwxrwx 2 root      root        60 Oct  6 21:07 .
> > > drwxr-xr-x 9 root      root      2180 Oct  2 22:28 ..
> > > -rw-r--r-- 1 trafodion trafodion   32 Oct  6 21:07
> > > sem.monitor.sem.trafodion
> > >
> > > kernel.shmmax = 68719476736
> > > kernel.shmall = 4294967296
> > >
> > > # ulimit -a
> > > core file size          (blocks, -c) 0
> > > data seg size           (kbytes, -d) unlimited
> > > scheduling priority             (-e) 0
> > > file size               (blocks, -f) unlimited
> > > pending signals                 (-i) 1805076
> > > max locked memory       (kbytes, -l) 64
> > > max memory size         (kbytes, -m) unlimited
> > > open files                      (-n) 65535
> > > pipe size            (512 bytes, -p) 8
> > > POSIX message queues     (bytes, -q) 819200
> > > real-time priority              (-r) 0
> > > stack size              (kbytes, -s) 10240
> > > cpu time               (seconds, -t) unlimited
> > > max user processes              (-u) 65535
> > > virtual memory          (kbytes, -v) unlimited
> > > file locks                      (-x) unlimited
> > >
> > > I would try to reinstall trafodion to see it something got corrupted
> > > and maybe that would fix the issue but I know there was a crash on
> > > sqstart and one of your guys fixed it and copied the lib file to our
> > > cluster:
> > >
> > > This is a response from Narendra in a previous thread where the issue
> > > was fixed to start the trafodion:
> > >
> > >
> > > >
> > > >
> > > >
> > > > *I updated the code: sql/cli/memmonitor.cpp, so that if
> > > > /proc/meminfo does not have the ‘Committed_AS’ entry, it will ignore
> > > > it. Built it and put the
> > > > binary: libcli.so on the veracity box (in the
> > > > $MY_SQROOT/export/lib64 directory – on all the nodes). Restarted the
> > > > env and ‘sqlci’ worked fine.
> > > > Was able to ‘initialize trafodion’ and create a table.*
> > >
> > >
> > > There was another one similar which I see it's closed
> > > https://issues.apache.org/jira/browse/TRAFODION-1492
> > >
> > > So the idea is are these fixes in the latest daily build and I can try
> > > to reinstall? Or please send the changed files so I can override after
> > > reinstall.
> > >
> > > On Wed, Oct 7, 2015 at 6:02 PM, Selva Govindarajan <
> > > selva.govindarajan@esgyn.com> wrote:
> > >
> > > > You would want to retain the shared segment size across reboots. So,
> > > > please check if the following settings are available in
> > > > /etc/sysctl.conf
> > > >
> > > > # Controls the maximum shared segment size, in bytes kernel.shmmax =
> > > > 134217728
> > > >
> > > > # Controls the maximum number of shared memory segments, in pages
> > > > kernel.shmall = 4294967296
> > > >
> > > >
> > > > shmmax needs to be at least 64 MB. By default, Trafodion RMS shared
> > > > segment size is 64 MB. Trafodion RMS shared segment can be expanded
> > > > to
> > > > 128 MB. So, it is better to set shmmax to 128 mb, just in case we
> > > > need to expand it later.
> > > >
> > > > Selva
> > > >
> > > > -----Original Message-----
> > > > From: Prashanth Vasudev [mailto:prashanth.vasudev@esgyn.com]
> > > > Sent: Tuesday, October 6, 2015 2:19 PM
> > > > To: dev@trafodion.incubator.apache.org
> > > > Subject: RE: trafodion won't start core files are generated
> > > >
> > > > Hi,
> > > > From the stack trace below, it appears trafodion monitor is unable
> > > > to create shared memory objects.
> > > > Please makes sure ulimit settings on all nodes have high limits for
> > > > max locked memory.
> > > > Also make sure /dev/shm on all nodes have the correct write
> > > > permissions to trafodion user id.
> > > >
> > > > Regards,
> > > > Prashanth
> > > >
> > > > -----Original Message-----
> > > > From: Radu Marias [mailto:radumarias@gmail.com]
> > > > Sent: Tuesday, October 6, 2015 9:21 AM
> > > > To: dev <de...@trafodion.incubator.apache.org>
> > > > Subject: trafodion won't start core files are generated
> > > >
> > > > Hi,
> > > >
> > > > At some point a node from the 5 nodes cluster has stopped and we
> > > > needed to restart it, After that I've restarted all the ambari and
> > > > hdp services but trafodion fails to start.
> > > >
> > > > Bellow are some stack traces and details for files that I'm not
> > > > getting any stack. Files are from node1 and node2 and were in Oct  2
> > > > (when I think node
> > > > 2 was down) and Oct  6 (when re rebooted the node and tried to start
> > > > trafodion). Feel free to connect and debug the issue on our cluster,
> > > > Amanda has the credentials.
> > > >
> > > > *FROM NODE1*
> > > >
> > > > Oct  2 22:27 core.39347
> > > > core.39347: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > > > SVR4-style, from 'tm SQMON1.1 00000 00000 039347 $TM0
> > > > 188.138.61.175:60186 00002 00000
> > > > 00009 SPAR'
> > > > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > > > core.39347
> > > > no stack
> > > >
> > > > Oct  2 22:41 core.15144
> > > > Program terminated with signal 6, Aborted.
> > > > #0  0x00007f77bcbbb625 in ?? ()
> > > > #1  0x00007f77bcbbce05 in ?? ()
> > > > #2  0x0000000000000010 in ?? () at ../common/Collections.cpp:109
> > > > #3  0x00007f77bee62130 in ?? ()
> > > > #4  0x00007ffe8e796ec0 in ?? ()
> > > > #5  0x00007f77bdeced00 in ?? ()
> > > > #6  0x0000000000000004 in ?? () at ../common/Collections.cpp:109
> > > > #7  0x0000000001b3a310 in ?? ()
> > > > #8  0x0000000000000000 in ?? ()
> > > >
> > > > Oct  2 22:41 core.39240
> > > > #0  0x00007f534d03c625 in raise () from /lib64/libc.so.6
> > > > #1  0x00007f534d03de05 in abort () from /lib64/libc.so.6
> > > > #2  0x00007f534d03574e in __assert_fail_base () from
> > > > /lib64/libc.so.6
> > > > #3  0x00007f534d035810 in __assert_fail () from /lib64/libc.so.6
> > > > #4  0x000000000046e213 in CExtTmLeaderReq::performRequest
> > > > (this=0x7f53340008c0) at reqtmleader.cxx:126
> > > > #5  0x000000000045a64a in CReqWorker::reqWorkerThread (this=<value
> > > > optimized
> > > > out>) at reqworker.cxx:79
> > > > #6  0x000000000045a86d in reqWorker (arg=0xc6f9a0) at
> > > > reqworker.cxx:147
> > > > #7  0x00007f534db45a51 in start_thread () from
> > > > /lib64/libpthread.so.0
> > > > #8  0x00007f534d0f29ad in clone () from /lib64/libc.so.6
> > > >
> > > > Oct  2 22:41 core.15309
> > > > core.15309: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > > > SVR4-style, from 'tm SQMON1.1 00000 00000 015309 $TM0
> > > > 188.138.61.175:60186 00002 00000
> > > > 00134 SPAR'
> > > > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > > > core.15309
> > > > no stack
> > > >
> > > >
> > > > *FROM NODE2*
> > > >
> > > > Oct  2 22:29 core.39491
> > > > core.39491: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > > > SVR4-style, from 'tm SQMON1.1 00001 00001 039491 $TM1
> > > > 188.138.61.177:38680 00002 00001
> > > > 00003 SPAR'
> > > > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > > > core.39491
> > > > no stack
> > > >
> > > > Oct  6 15:23 core.1394
> > > > Program terminated with signal 6, Aborted.
> > > > #0  0x00007fb97acbf625 in raise () from /lib64/libc.so.6
> > > > #1  0x00007fb97acc0e05 in abort () from /lib64/libc.so.6
> > > > #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> > > > (this=0x2071880, nodeContainer=<value optimized out>) at
> > > > process.cxx:3366
> > > > #3  0x0000000000453f5c in CNode::CNode (this=0x2071880,
> > > > name=0x204c448 "euve79672", pnid=0, rank=0) at pnode.cxx:153
> > > > #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> > > > optimized
> > > > out>) at pnode.cxx:1564
> > > > #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> > > > (this=0x20757b0) at cluster.cxx:2740
> > > > #6  0x0000000000417645 in CCluster::CCluster (this=0x20757b0) at
> > > > cluster.cxx:567
> > > > #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> > > > (this=0x20757b0) at tmsync.cxx:137
> > > > #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x20757b0,
> > > > procTermSig=9) at monitor.cxx:323
> > > > #9  0x00000000004086ad in main (argc=2, argv=0x7fff8322e298) at
> > > > monitor.cxx:1152
> > > >
> > > > Oct  6 15:43 core.17626
> > > > Program terminated with signal 6, Aborted.
> > > > #0  0x00007fcf11aea625 in raise () from /lib64/libc.so.6
> > > > #1  0x00007fcf11aebe05 in abort () from /lib64/libc.so.6
> > > > #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> > > > (this=0x1182890, nodeContainer=<value optimized out>) at
> > > > process.cxx:3366
> > > > #3  0x0000000000453f5c in CNode::CNode (this=0x1182890,
> > > > name=0x115d458 "euve79672", pnid=0, rank=0) at pnode.cxx:153
> > > > #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> > > > optimized
> > > > out>) at pnode.cxx:1564
> > > > #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> > > > (this=0x11867c0) at cluster.cxx:2740
> > > > #6  0x0000000000417645 in CCluster::CCluster (this=0x11867c0) at
> > > > cluster.cxx:567
> > > > #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> > > > (this=0x11867c0) at tmsync.cxx:137
> > > > #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x11867c0,
> > > > procTermSig=9) at monitor.cxx:323
> > > > #9  0x00000000004086ad in main (argc=2, argv=0x7ffcaca91f68) at
> > > > monitor.cxx:1152
> > > >
> > > > --
> > > > And in the end, it's not the years in your life that count. It's the
> > > > life in your years.
> > > >
> > >
> > >
> > >
> > > --
> > > And in the end, it's not the years in your life that count. It's the
> > > life in your years.
> > >
> >
>

Re: trafodion won't start core files are generated

Posted by Radu Marias <ra...@gmail.com>.
Managed to start trafodion with the latest daily build. But now I'm having
some OpenVZ container issues when a java process is started:
Error occurred during initialization of VM
Could not reserve enough space for object heap

Tried with alias to java as alias java="java -Xms128m -Xmx512m" and also
with JAVA_TOOL_OPTIONS but the same. Searching now for other fixes.

On Tue, Oct 13, 2015 at 12:02 AM Steve Varnau <st...@esgyn.com>
wrote:

> In upcoming changes to Jenkins automation, I will add the daily build
> downloads link to the daily-build test result email that gets sent to this
> list.
>
> --Steve
>
> -----Original Message-----
> From: Roberta Marton [mailto:roberta.marton@esgyn.com]
> Sent: Thursday, October 8, 2015 9:51 AM
> To: dev@trafodion.incubator.apache.org
> Subject: RE: trafodion won't start core files are generated
>
> Is this something that should be added to the Apache Trafodion
> website/wiki?
>
>      Roberta
>
> -----Original Message-----
> From: Steve Varnau [mailto:steve.varnau@esgyn.com]
> Sent: Thursday, October 8, 2015 9:47 AM
> To: dev@trafodion.incubator.apache.org
> Subject: RE: trafodion won't start core files are generated
>
> Daily builds for development/test are posted at
> http://traf-downloads.esgyn.com/
>
> --Steve
>
> -----Original Message-----
> From: Suresh Subbiah [mailto:suresh.subbiah60@gmail.com]
> Sent: Thursday, October 8, 2015 7:10 AM
> To: dev@trafodion.incubator.apache.org
> Subject: Re: trafodion won't start core files are generated
>
> Hi,
>
> What is the suggested procedure to pick up a daily build?
>
> Thanks
> Suresh
>
> On Thu, Oct 8, 2015 at 1:02 AM, Prashanth Vasudev <
> prashanth.vasudev@esgyn.com> wrote:
>
> > Memorymonitor.cpp fix is part of this
> > https://issues.apache.org/jira/browse/TRAFODION-1492
> > Please pick up latest daily build.
> >
> > Also max locked memory 64kb below appears very small.
> >
> > Regards,
> > Prashanth
> >
> > -----Original Message-----
> > From: Radu Marias [mailto:radumarias@gmail.com]
> > Sent: Wednesday, October 7, 2015 8:45 AM
> > To: dev <de...@trafodion.incubator.apache.org>
> > Subject: Re: trafodion won't start core files are generated
> >
> > Hi,
> >
> > I have these:
> >
> > # pwd
> > /dev/shm
> > # ls -la
> > total 4
> > drwxrwxrwx 2 root      root        60 Oct  6 21:07 .
> > drwxr-xr-x 9 root      root      2180 Oct  2 22:28 ..
> > -rw-r--r-- 1 trafodion trafodion   32 Oct  6 21:07
> > sem.monitor.sem.trafodion
> >
> > kernel.shmmax = 68719476736
> > kernel.shmall = 4294967296
> >
> > # ulimit -a
> > core file size          (blocks, -c) 0
> > data seg size           (kbytes, -d) unlimited
> > scheduling priority             (-e) 0
> > file size               (blocks, -f) unlimited
> > pending signals                 (-i) 1805076
> > max locked memory       (kbytes, -l) 64
> > max memory size         (kbytes, -m) unlimited
> > open files                      (-n) 65535
> > pipe size            (512 bytes, -p) 8
> > POSIX message queues     (bytes, -q) 819200
> > real-time priority              (-r) 0
> > stack size              (kbytes, -s) 10240
> > cpu time               (seconds, -t) unlimited
> > max user processes              (-u) 65535
> > virtual memory          (kbytes, -v) unlimited
> > file locks                      (-x) unlimited
> >
> > I would try to reinstall trafodion to see it something got corrupted
> > and maybe that would fix the issue but I know there was a crash on
> > sqstart and one of your guys fixed it and copied the lib file to our
> > cluster:
> >
> > This is a response from Narendra in a previous thread where the issue
> > was fixed to start the trafodion:
> >
> >
> > >
> > >
> > >
> > > *I updated the code: sql/cli/memmonitor.cpp, so that if
> > > /proc/meminfo does not have the ‘Committed_AS’ entry, it will ignore
> > > it. Built it and put the
> > > binary: libcli.so on the veracity box (in the
> > > $MY_SQROOT/export/lib64 directory – on all the nodes). Restarted the
> > > env and ‘sqlci’ worked fine.
> > > Was able to ‘initialize trafodion’ and create a table.*
> >
> >
> > There was another one similar which I see it's closed
> > https://issues.apache.org/jira/browse/TRAFODION-1492
> >
> > So the idea is are these fixes in the latest daily build and I can try
> > to reinstall? Or please send the changed files so I can override after
> > reinstall.
> >
> > On Wed, Oct 7, 2015 at 6:02 PM, Selva Govindarajan <
> > selva.govindarajan@esgyn.com> wrote:
> >
> > > You would want to retain the shared segment size across reboots. So,
> > > please check if the following settings are available in
> > > /etc/sysctl.conf
> > >
> > > # Controls the maximum shared segment size, in bytes kernel.shmmax =
> > > 134217728
> > >
> > > # Controls the maximum number of shared memory segments, in pages
> > > kernel.shmall = 4294967296
> > >
> > >
> > > shmmax needs to be at least 64 MB. By default, Trafodion RMS shared
> > > segment size is 64 MB. Trafodion RMS shared segment can be expanded
> > > to
> > > 128 MB. So, it is better to set shmmax to 128 mb, just in case we
> > > need to expand it later.
> > >
> > > Selva
> > >
> > > -----Original Message-----
> > > From: Prashanth Vasudev [mailto:prashanth.vasudev@esgyn.com]
> > > Sent: Tuesday, October 6, 2015 2:19 PM
> > > To: dev@trafodion.incubator.apache.org
> > > Subject: RE: trafodion won't start core files are generated
> > >
> > > Hi,
> > > From the stack trace below, it appears trafodion monitor is unable
> > > to create shared memory objects.
> > > Please makes sure ulimit settings on all nodes have high limits for
> > > max locked memory.
> > > Also make sure /dev/shm on all nodes have the correct write
> > > permissions to trafodion user id.
> > >
> > > Regards,
> > > Prashanth
> > >
> > > -----Original Message-----
> > > From: Radu Marias [mailto:radumarias@gmail.com]
> > > Sent: Tuesday, October 6, 2015 9:21 AM
> > > To: dev <de...@trafodion.incubator.apache.org>
> > > Subject: trafodion won't start core files are generated
> > >
> > > Hi,
> > >
> > > At some point a node from the 5 nodes cluster has stopped and we
> > > needed to restart it, After that I've restarted all the ambari and
> > > hdp services but trafodion fails to start.
> > >
> > > Bellow are some stack traces and details for files that I'm not
> > > getting any stack. Files are from node1 and node2 and were in Oct  2
> > > (when I think node
> > > 2 was down) and Oct  6 (when re rebooted the node and tried to start
> > > trafodion). Feel free to connect and debug the issue on our cluster,
> > > Amanda has the credentials.
> > >
> > > *FROM NODE1*
> > >
> > > Oct  2 22:27 core.39347
> > > core.39347: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > > SVR4-style, from 'tm SQMON1.1 00000 00000 039347 $TM0
> > > 188.138.61.175:60186 00002 00000
> > > 00009 SPAR'
> > > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > > core.39347
> > > no stack
> > >
> > > Oct  2 22:41 core.15144
> > > Program terminated with signal 6, Aborted.
> > > #0  0x00007f77bcbbb625 in ?? ()
> > > #1  0x00007f77bcbbce05 in ?? ()
> > > #2  0x0000000000000010 in ?? () at ../common/Collections.cpp:109
> > > #3  0x00007f77bee62130 in ?? ()
> > > #4  0x00007ffe8e796ec0 in ?? ()
> > > #5  0x00007f77bdeced00 in ?? ()
> > > #6  0x0000000000000004 in ?? () at ../common/Collections.cpp:109
> > > #7  0x0000000001b3a310 in ?? ()
> > > #8  0x0000000000000000 in ?? ()
> > >
> > > Oct  2 22:41 core.39240
> > > #0  0x00007f534d03c625 in raise () from /lib64/libc.so.6
> > > #1  0x00007f534d03de05 in abort () from /lib64/libc.so.6
> > > #2  0x00007f534d03574e in __assert_fail_base () from
> > > /lib64/libc.so.6
> > > #3  0x00007f534d035810 in __assert_fail () from /lib64/libc.so.6
> > > #4  0x000000000046e213 in CExtTmLeaderReq::performRequest
> > > (this=0x7f53340008c0) at reqtmleader.cxx:126
> > > #5  0x000000000045a64a in CReqWorker::reqWorkerThread (this=<value
> > > optimized
> > > out>) at reqworker.cxx:79
> > > #6  0x000000000045a86d in reqWorker (arg=0xc6f9a0) at
> > > reqworker.cxx:147
> > > #7  0x00007f534db45a51 in start_thread () from
> > > /lib64/libpthread.so.0
> > > #8  0x00007f534d0f29ad in clone () from /lib64/libc.so.6
> > >
> > > Oct  2 22:41 core.15309
> > > core.15309: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > > SVR4-style, from 'tm SQMON1.1 00000 00000 015309 $TM0
> > > 188.138.61.175:60186 00002 00000
> > > 00134 SPAR'
> > > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > > core.15309
> > > no stack
> > >
> > >
> > > *FROM NODE2*
> > >
> > > Oct  2 22:29 core.39491
> > > core.39491: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > > SVR4-style, from 'tm SQMON1.1 00001 00001 039491 $TM1
> > > 188.138.61.177:38680 00002 00001
> > > 00003 SPAR'
> > > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > > core.39491
> > > no stack
> > >
> > > Oct  6 15:23 core.1394
> > > Program terminated with signal 6, Aborted.
> > > #0  0x00007fb97acbf625 in raise () from /lib64/libc.so.6
> > > #1  0x00007fb97acc0e05 in abort () from /lib64/libc.so.6
> > > #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> > > (this=0x2071880, nodeContainer=<value optimized out>) at
> > > process.cxx:3366
> > > #3  0x0000000000453f5c in CNode::CNode (this=0x2071880,
> > > name=0x204c448 "euve79672", pnid=0, rank=0) at pnode.cxx:153
> > > #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> > > optimized
> > > out>) at pnode.cxx:1564
> > > #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> > > (this=0x20757b0) at cluster.cxx:2740
> > > #6  0x0000000000417645 in CCluster::CCluster (this=0x20757b0) at
> > > cluster.cxx:567
> > > #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> > > (this=0x20757b0) at tmsync.cxx:137
> > > #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x20757b0,
> > > procTermSig=9) at monitor.cxx:323
> > > #9  0x00000000004086ad in main (argc=2, argv=0x7fff8322e298) at
> > > monitor.cxx:1152
> > >
> > > Oct  6 15:43 core.17626
> > > Program terminated with signal 6, Aborted.
> > > #0  0x00007fcf11aea625 in raise () from /lib64/libc.so.6
> > > #1  0x00007fcf11aebe05 in abort () from /lib64/libc.so.6
> > > #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> > > (this=0x1182890, nodeContainer=<value optimized out>) at
> > > process.cxx:3366
> > > #3  0x0000000000453f5c in CNode::CNode (this=0x1182890,
> > > name=0x115d458 "euve79672", pnid=0, rank=0) at pnode.cxx:153
> > > #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> > > optimized
> > > out>) at pnode.cxx:1564
> > > #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> > > (this=0x11867c0) at cluster.cxx:2740
> > > #6  0x0000000000417645 in CCluster::CCluster (this=0x11867c0) at
> > > cluster.cxx:567
> > > #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> > > (this=0x11867c0) at tmsync.cxx:137
> > > #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x11867c0,
> > > procTermSig=9) at monitor.cxx:323
> > > #9  0x00000000004086ad in main (argc=2, argv=0x7ffcaca91f68) at
> > > monitor.cxx:1152
> > >
> > > --
> > > And in the end, it's not the years in your life that count. It's the
> > > life in your years.
> > >
> >
> >
> >
> > --
> > And in the end, it's not the years in your life that count. It's the
> > life in your years.
> >
>

RE: trafodion won't start core files are generated

Posted by Steve Varnau <st...@esgyn.com>.
In upcoming changes to Jenkins automation, I will add the daily build
downloads link to the daily-build test result email that gets sent to this
list.

--Steve

-----Original Message-----
From: Roberta Marton [mailto:roberta.marton@esgyn.com]
Sent: Thursday, October 8, 2015 9:51 AM
To: dev@trafodion.incubator.apache.org
Subject: RE: trafodion won't start core files are generated

Is this something that should be added to the Apache Trafodion website/wiki?

     Roberta

-----Original Message-----
From: Steve Varnau [mailto:steve.varnau@esgyn.com]
Sent: Thursday, October 8, 2015 9:47 AM
To: dev@trafodion.incubator.apache.org
Subject: RE: trafodion won't start core files are generated

Daily builds for development/test are posted at
http://traf-downloads.esgyn.com/

--Steve

-----Original Message-----
From: Suresh Subbiah [mailto:suresh.subbiah60@gmail.com]
Sent: Thursday, October 8, 2015 7:10 AM
To: dev@trafodion.incubator.apache.org
Subject: Re: trafodion won't start core files are generated

Hi,

What is the suggested procedure to pick up a daily build?

Thanks
Suresh

On Thu, Oct 8, 2015 at 1:02 AM, Prashanth Vasudev <
prashanth.vasudev@esgyn.com> wrote:

> Memorymonitor.cpp fix is part of this
> https://issues.apache.org/jira/browse/TRAFODION-1492
> Please pick up latest daily build.
>
> Also max locked memory 64kb below appears very small.
>
> Regards,
> Prashanth
>
> -----Original Message-----
> From: Radu Marias [mailto:radumarias@gmail.com]
> Sent: Wednesday, October 7, 2015 8:45 AM
> To: dev <de...@trafodion.incubator.apache.org>
> Subject: Re: trafodion won't start core files are generated
>
> Hi,
>
> I have these:
>
> # pwd
> /dev/shm
> # ls -la
> total 4
> drwxrwxrwx 2 root      root        60 Oct  6 21:07 .
> drwxr-xr-x 9 root      root      2180 Oct  2 22:28 ..
> -rw-r--r-- 1 trafodion trafodion   32 Oct  6 21:07
> sem.monitor.sem.trafodion
>
> kernel.shmmax = 68719476736
> kernel.shmall = 4294967296
>
> # ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 1805076
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 65535
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 10240
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 65535
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> I would try to reinstall trafodion to see it something got corrupted
> and maybe that would fix the issue but I know there was a crash on
> sqstart and one of your guys fixed it and copied the lib file to our
> cluster:
>
> This is a response from Narendra in a previous thread where the issue
> was fixed to start the trafodion:
>
>
> >
> >
> >
> > *I updated the code: sql/cli/memmonitor.cpp, so that if
> > /proc/meminfo does not have the ‘Committed_AS’ entry, it will ignore
> > it. Built it and put the
> > binary: libcli.so on the veracity box (in the
> > $MY_SQROOT/export/lib64 directory – on all the nodes). Restarted the
> > env and ‘sqlci’ worked fine.
> > Was able to ‘initialize trafodion’ and create a table.*
>
>
> There was another one similar which I see it's closed
> https://issues.apache.org/jira/browse/TRAFODION-1492
>
> So the idea is are these fixes in the latest daily build and I can try
> to reinstall? Or please send the changed files so I can override after
> reinstall.
>
> On Wed, Oct 7, 2015 at 6:02 PM, Selva Govindarajan <
> selva.govindarajan@esgyn.com> wrote:
>
> > You would want to retain the shared segment size across reboots. So,
> > please check if the following settings are available in
> > /etc/sysctl.conf
> >
> > # Controls the maximum shared segment size, in bytes kernel.shmmax =
> > 134217728
> >
> > # Controls the maximum number of shared memory segments, in pages
> > kernel.shmall = 4294967296
> >
> >
> > shmmax needs to be at least 64 MB. By default, Trafodion RMS shared
> > segment size is 64 MB. Trafodion RMS shared segment can be expanded
> > to
> > 128 MB. So, it is better to set shmmax to 128 mb, just in case we
> > need to expand it later.
> >
> > Selva
> >
> > -----Original Message-----
> > From: Prashanth Vasudev [mailto:prashanth.vasudev@esgyn.com]
> > Sent: Tuesday, October 6, 2015 2:19 PM
> > To: dev@trafodion.incubator.apache.org
> > Subject: RE: trafodion won't start core files are generated
> >
> > Hi,
> > From the stack trace below, it appears trafodion monitor is unable
> > to create shared memory objects.
> > Please makes sure ulimit settings on all nodes have high limits for
> > max locked memory.
> > Also make sure /dev/shm on all nodes have the correct write
> > permissions to trafodion user id.
> >
> > Regards,
> > Prashanth
> >
> > -----Original Message-----
> > From: Radu Marias [mailto:radumarias@gmail.com]
> > Sent: Tuesday, October 6, 2015 9:21 AM
> > To: dev <de...@trafodion.incubator.apache.org>
> > Subject: trafodion won't start core files are generated
> >
> > Hi,
> >
> > At some point a node from the 5 nodes cluster has stopped and we
> > needed to restart it, After that I've restarted all the ambari and
> > hdp services but trafodion fails to start.
> >
> > Bellow are some stack traces and details for files that I'm not
> > getting any stack. Files are from node1 and node2 and were in Oct  2
> > (when I think node
> > 2 was down) and Oct  6 (when re rebooted the node and tried to start
> > trafodion). Feel free to connect and debug the issue on our cluster,
> > Amanda has the credentials.
> >
> > *FROM NODE1*
> >
> > Oct  2 22:27 core.39347
> > core.39347: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > SVR4-style, from 'tm SQMON1.1 00000 00000 039347 $TM0
> > 188.138.61.175:60186 00002 00000
> > 00009 SPAR'
> > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > core.39347
> > no stack
> >
> > Oct  2 22:41 core.15144
> > Program terminated with signal 6, Aborted.
> > #0  0x00007f77bcbbb625 in ?? ()
> > #1  0x00007f77bcbbce05 in ?? ()
> > #2  0x0000000000000010 in ?? () at ../common/Collections.cpp:109
> > #3  0x00007f77bee62130 in ?? ()
> > #4  0x00007ffe8e796ec0 in ?? ()
> > #5  0x00007f77bdeced00 in ?? ()
> > #6  0x0000000000000004 in ?? () at ../common/Collections.cpp:109
> > #7  0x0000000001b3a310 in ?? ()
> > #8  0x0000000000000000 in ?? ()
> >
> > Oct  2 22:41 core.39240
> > #0  0x00007f534d03c625 in raise () from /lib64/libc.so.6
> > #1  0x00007f534d03de05 in abort () from /lib64/libc.so.6
> > #2  0x00007f534d03574e in __assert_fail_base () from
> > /lib64/libc.so.6
> > #3  0x00007f534d035810 in __assert_fail () from /lib64/libc.so.6
> > #4  0x000000000046e213 in CExtTmLeaderReq::performRequest
> > (this=0x7f53340008c0) at reqtmleader.cxx:126
> > #5  0x000000000045a64a in CReqWorker::reqWorkerThread (this=<value
> > optimized
> > out>) at reqworker.cxx:79
> > #6  0x000000000045a86d in reqWorker (arg=0xc6f9a0) at
> > reqworker.cxx:147
> > #7  0x00007f534db45a51 in start_thread () from
> > /lib64/libpthread.so.0
> > #8  0x00007f534d0f29ad in clone () from /lib64/libc.so.6
> >
> > Oct  2 22:41 core.15309
> > core.15309: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > SVR4-style, from 'tm SQMON1.1 00000 00000 015309 $TM0
> > 188.138.61.175:60186 00002 00000
> > 00134 SPAR'
> > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > core.15309
> > no stack
> >
> >
> > *FROM NODE2*
> >
> > Oct  2 22:29 core.39491
> > core.39491: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > SVR4-style, from 'tm SQMON1.1 00001 00001 039491 $TM1
> > 188.138.61.177:38680 00002 00001
> > 00003 SPAR'
> > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > core.39491
> > no stack
> >
> > Oct  6 15:23 core.1394
> > Program terminated with signal 6, Aborted.
> > #0  0x00007fb97acbf625 in raise () from /lib64/libc.so.6
> > #1  0x00007fb97acc0e05 in abort () from /lib64/libc.so.6
> > #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> > (this=0x2071880, nodeContainer=<value optimized out>) at
> > process.cxx:3366
> > #3  0x0000000000453f5c in CNode::CNode (this=0x2071880,
> > name=0x204c448 "euve79672", pnid=0, rank=0) at pnode.cxx:153
> > #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> > optimized
> > out>) at pnode.cxx:1564
> > #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> > (this=0x20757b0) at cluster.cxx:2740
> > #6  0x0000000000417645 in CCluster::CCluster (this=0x20757b0) at
> > cluster.cxx:567
> > #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> > (this=0x20757b0) at tmsync.cxx:137
> > #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x20757b0,
> > procTermSig=9) at monitor.cxx:323
> > #9  0x00000000004086ad in main (argc=2, argv=0x7fff8322e298) at
> > monitor.cxx:1152
> >
> > Oct  6 15:43 core.17626
> > Program terminated with signal 6, Aborted.
> > #0  0x00007fcf11aea625 in raise () from /lib64/libc.so.6
> > #1  0x00007fcf11aebe05 in abort () from /lib64/libc.so.6
> > #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> > (this=0x1182890, nodeContainer=<value optimized out>) at
> > process.cxx:3366
> > #3  0x0000000000453f5c in CNode::CNode (this=0x1182890,
> > name=0x115d458 "euve79672", pnid=0, rank=0) at pnode.cxx:153
> > #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> > optimized
> > out>) at pnode.cxx:1564
> > #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> > (this=0x11867c0) at cluster.cxx:2740
> > #6  0x0000000000417645 in CCluster::CCluster (this=0x11867c0) at
> > cluster.cxx:567
> > #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> > (this=0x11867c0) at tmsync.cxx:137
> > #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x11867c0,
> > procTermSig=9) at monitor.cxx:323
> > #9  0x00000000004086ad in main (argc=2, argv=0x7ffcaca91f68) at
> > monitor.cxx:1152
> >
> > --
> > And in the end, it's not the years in your life that count. It's the
> > life in your years.
> >
>
>
>
> --
> And in the end, it's not the years in your life that count. It's the
> life in your years.
>

RE: trafodion won't start core files are generated

Posted by Roberta Marton <ro...@esgyn.com>.
Is this something that should be added to the Apache Trafodion website/wiki?

     Roberta

-----Original Message-----
From: Steve Varnau [mailto:steve.varnau@esgyn.com]
Sent: Thursday, October 8, 2015 9:47 AM
To: dev@trafodion.incubator.apache.org
Subject: RE: trafodion won't start core files are generated

Daily builds for development/test are posted at
http://traf-downloads.esgyn.com/

--Steve

-----Original Message-----
From: Suresh Subbiah [mailto:suresh.subbiah60@gmail.com]
Sent: Thursday, October 8, 2015 7:10 AM
To: dev@trafodion.incubator.apache.org
Subject: Re: trafodion won't start core files are generated

Hi,

What is the suggested procedure to pick up a daily build?

Thanks
Suresh

On Thu, Oct 8, 2015 at 1:02 AM, Prashanth Vasudev <
prashanth.vasudev@esgyn.com> wrote:

> Memorymonitor.cpp fix is part of this
> https://issues.apache.org/jira/browse/TRAFODION-1492
> Please pick up latest daily build.
>
> Also max locked memory 64kb below appears very small.
>
> Regards,
> Prashanth
>
> -----Original Message-----
> From: Radu Marias [mailto:radumarias@gmail.com]
> Sent: Wednesday, October 7, 2015 8:45 AM
> To: dev <de...@trafodion.incubator.apache.org>
> Subject: Re: trafodion won't start core files are generated
>
> Hi,
>
> I have these:
>
> # pwd
> /dev/shm
> # ls -la
> total 4
> drwxrwxrwx 2 root      root        60 Oct  6 21:07 .
> drwxr-xr-x 9 root      root      2180 Oct  2 22:28 ..
> -rw-r--r-- 1 trafodion trafodion   32 Oct  6 21:07
> sem.monitor.sem.trafodion
>
> kernel.shmmax = 68719476736
> kernel.shmall = 4294967296
>
> # ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 1805076
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 65535
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 10240
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 65535
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> I would try to reinstall trafodion to see it something got corrupted
> and maybe that would fix the issue but I know there was a crash on
> sqstart and one of your guys fixed it and copied the lib file to our
> cluster:
>
> This is a response from Narendra in a previous thread where the issue
> was fixed to start the trafodion:
>
>
> >
> >
> >
> > *I updated the code: sql/cli/memmonitor.cpp, so that if
> > /proc/meminfo does not have the ‘Committed_AS’ entry, it will ignore
> > it. Built it and put the
> > binary: libcli.so on the veracity box (in the
> > $MY_SQROOT/export/lib64 directory – on all the nodes). Restarted the
> > env and ‘sqlci’ worked fine.
> > Was able to ‘initialize trafodion’ and create a table.*
>
>
> There was another one similar which I see it's closed
> https://issues.apache.org/jira/browse/TRAFODION-1492
>
> So the idea is are these fixes in the latest daily build and I can try
> to reinstall? Or please send the changed files so I can override after
> reinstall.
>
> On Wed, Oct 7, 2015 at 6:02 PM, Selva Govindarajan <
> selva.govindarajan@esgyn.com> wrote:
>
> > You would want to retain the shared segment size across reboots. So,
> > please check if the following settings are available in
> > /etc/sysctl.conf
> >
> > # Controls the maximum shared segment size, in bytes kernel.shmmax =
> > 134217728
> >
> > # Controls the maximum number of shared memory segments, in pages
> > kernel.shmall = 4294967296
> >
> >
> > shmmax needs to be at least 64 MB. By default, Trafodion RMS shared
> > segment size is 64 MB. Trafodion RMS shared segment can be expanded
> > to
> > 128 MB. So, it is better to set shmmax to 128 mb, just in case we
> > need to expand it later.
> >
> > Selva
> >
> > -----Original Message-----
> > From: Prashanth Vasudev [mailto:prashanth.vasudev@esgyn.com]
> > Sent: Tuesday, October 6, 2015 2:19 PM
> > To: dev@trafodion.incubator.apache.org
> > Subject: RE: trafodion won't start core files are generated
> >
> > Hi,
> > From the stack trace below, it appears trafodion monitor is unable
> > to create shared memory objects.
> > Please makes sure ulimit settings on all nodes have high limits for
> > max locked memory.
> > Also make sure /dev/shm on all nodes have the correct write
> > permissions to trafodion user id.
> >
> > Regards,
> > Prashanth
> >
> > -----Original Message-----
> > From: Radu Marias [mailto:radumarias@gmail.com]
> > Sent: Tuesday, October 6, 2015 9:21 AM
> > To: dev <de...@trafodion.incubator.apache.org>
> > Subject: trafodion won't start core files are generated
> >
> > Hi,
> >
> > At some point a node from the 5 nodes cluster has stopped and we
> > needed to restart it, After that I've restarted all the ambari and
> > hdp services but trafodion fails to start.
> >
> > Bellow are some stack traces and details for files that I'm not
> > getting any stack. Files are from node1 and node2 and were in Oct  2
> > (when I think node
> > 2 was down) and Oct  6 (when re rebooted the node and tried to start
> > trafodion). Feel free to connect and debug the issue on our cluster,
> > Amanda has the credentials.
> >
> > *FROM NODE1*
> >
> > Oct  2 22:27 core.39347
> > core.39347: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > SVR4-style, from 'tm SQMON1.1 00000 00000 039347 $TM0
> > 188.138.61.175:60186 00002 00000
> > 00009 SPAR'
> > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > core.39347
> > no stack
> >
> > Oct  2 22:41 core.15144
> > Program terminated with signal 6, Aborted.
> > #0  0x00007f77bcbbb625 in ?? ()
> > #1  0x00007f77bcbbce05 in ?? ()
> > #2  0x0000000000000010 in ?? () at ../common/Collections.cpp:109
> > #3  0x00007f77bee62130 in ?? ()
> > #4  0x00007ffe8e796ec0 in ?? ()
> > #5  0x00007f77bdeced00 in ?? ()
> > #6  0x0000000000000004 in ?? () at ../common/Collections.cpp:109
> > #7  0x0000000001b3a310 in ?? ()
> > #8  0x0000000000000000 in ?? ()
> >
> > Oct  2 22:41 core.39240
> > #0  0x00007f534d03c625 in raise () from /lib64/libc.so.6
> > #1  0x00007f534d03de05 in abort () from /lib64/libc.so.6
> > #2  0x00007f534d03574e in __assert_fail_base () from
> > /lib64/libc.so.6
> > #3  0x00007f534d035810 in __assert_fail () from /lib64/libc.so.6
> > #4  0x000000000046e213 in CExtTmLeaderReq::performRequest
> > (this=0x7f53340008c0) at reqtmleader.cxx:126
> > #5  0x000000000045a64a in CReqWorker::reqWorkerThread (this=<value
> > optimized
> > out>) at reqworker.cxx:79
> > #6  0x000000000045a86d in reqWorker (arg=0xc6f9a0) at
> > reqworker.cxx:147
> > #7  0x00007f534db45a51 in start_thread () from
> > /lib64/libpthread.so.0
> > #8  0x00007f534d0f29ad in clone () from /lib64/libc.so.6
> >
> > Oct  2 22:41 core.15309
> > core.15309: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > SVR4-style, from 'tm SQMON1.1 00000 00000 015309 $TM0
> > 188.138.61.175:60186 00002 00000
> > 00134 SPAR'
> > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > core.15309
> > no stack
> >
> >
> > *FROM NODE2*
> >
> > Oct  2 22:29 core.39491
> > core.39491: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > SVR4-style, from 'tm SQMON1.1 00001 00001 039491 $TM1
> > 188.138.61.177:38680 00002 00001
> > 00003 SPAR'
> > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > core.39491
> > no stack
> >
> > Oct  6 15:23 core.1394
> > Program terminated with signal 6, Aborted.
> > #0  0x00007fb97acbf625 in raise () from /lib64/libc.so.6
> > #1  0x00007fb97acc0e05 in abort () from /lib64/libc.so.6
> > #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> > (this=0x2071880, nodeContainer=<value optimized out>) at
> > process.cxx:3366
> > #3  0x0000000000453f5c in CNode::CNode (this=0x2071880,
> > name=0x204c448 "euve79672", pnid=0, rank=0) at pnode.cxx:153
> > #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> > optimized
> > out>) at pnode.cxx:1564
> > #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> > (this=0x20757b0) at cluster.cxx:2740
> > #6  0x0000000000417645 in CCluster::CCluster (this=0x20757b0) at
> > cluster.cxx:567
> > #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> > (this=0x20757b0) at tmsync.cxx:137
> > #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x20757b0,
> > procTermSig=9) at monitor.cxx:323
> > #9  0x00000000004086ad in main (argc=2, argv=0x7fff8322e298) at
> > monitor.cxx:1152
> >
> > Oct  6 15:43 core.17626
> > Program terminated with signal 6, Aborted.
> > #0  0x00007fcf11aea625 in raise () from /lib64/libc.so.6
> > #1  0x00007fcf11aebe05 in abort () from /lib64/libc.so.6
> > #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> > (this=0x1182890, nodeContainer=<value optimized out>) at
> > process.cxx:3366
> > #3  0x0000000000453f5c in CNode::CNode (this=0x1182890,
> > name=0x115d458 "euve79672", pnid=0, rank=0) at pnode.cxx:153
> > #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> > optimized
> > out>) at pnode.cxx:1564
> > #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> > (this=0x11867c0) at cluster.cxx:2740
> > #6  0x0000000000417645 in CCluster::CCluster (this=0x11867c0) at
> > cluster.cxx:567
> > #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> > (this=0x11867c0) at tmsync.cxx:137
> > #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x11867c0,
> > procTermSig=9) at monitor.cxx:323
> > #9  0x00000000004086ad in main (argc=2, argv=0x7ffcaca91f68) at
> > monitor.cxx:1152
> >
> > --
> > And in the end, it's not the years in your life that count. It's the
> > life in your years.
> >
>
>
>
> --
> And in the end, it's not the years in your life that count. It's the
> life in your years.
>

RE: trafodion won't start core files are generated

Posted by Steve Varnau <st...@esgyn.com>.
Daily builds for development/test are posted at
http://traf-downloads.esgyn.com/

--Steve

-----Original Message-----
From: Suresh Subbiah [mailto:suresh.subbiah60@gmail.com]
Sent: Thursday, October 8, 2015 7:10 AM
To: dev@trafodion.incubator.apache.org
Subject: Re: trafodion won't start core files are generated

Hi,

What is the suggested procedure to pick up a daily build?

Thanks
Suresh

On Thu, Oct 8, 2015 at 1:02 AM, Prashanth Vasudev <
prashanth.vasudev@esgyn.com> wrote:

> Memorymonitor.cpp fix is part of this
> https://issues.apache.org/jira/browse/TRAFODION-1492
> Please pick up latest daily build.
>
> Also max locked memory 64kb below appears very small.
>
> Regards,
> Prashanth
>
> -----Original Message-----
> From: Radu Marias [mailto:radumarias@gmail.com]
> Sent: Wednesday, October 7, 2015 8:45 AM
> To: dev <de...@trafodion.incubator.apache.org>
> Subject: Re: trafodion won't start core files are generated
>
> Hi,
>
> I have these:
>
> # pwd
> /dev/shm
> # ls -la
> total 4
> drwxrwxrwx 2 root      root        60 Oct  6 21:07 .
> drwxr-xr-x 9 root      root      2180 Oct  2 22:28 ..
> -rw-r--r-- 1 trafodion trafodion   32 Oct  6 21:07
> sem.monitor.sem.trafodion
>
> kernel.shmmax = 68719476736
> kernel.shmall = 4294967296
>
> # ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 1805076
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 65535
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 10240
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 65535
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> I would try to reinstall trafodion to see it something got corrupted and
> maybe that would fix the issue but I know there was a crash on sqstart and
> one of your guys fixed it and copied the lib file to our cluster:
>
> This is a response from Narendra in a previous thread where the issue was
> fixed to start the trafodion:
>
>
> >
> >
> >
> > *I updated the code: sql/cli/memmonitor.cpp, so that if /proc/meminfo
> > does not have the ‘Committed_AS’ entry, it will ignore it. Built it
> > and put the
> > binary: libcli.so on the veracity box (in the $MY_SQROOT/export/lib64
> > directory – on all the nodes). Restarted the env and ‘sqlci’ worked
> > fine.
> > Was able to ‘initialize trafodion’ and create a table.*
>
>
> There was another one similar which I see it's closed
> https://issues.apache.org/jira/browse/TRAFODION-1492
>
> So the idea is are these fixes in the latest daily build and I can try to
> reinstall? Or please send the changed files so I can override after
> reinstall.
>
> On Wed, Oct 7, 2015 at 6:02 PM, Selva Govindarajan <
> selva.govindarajan@esgyn.com> wrote:
>
> > You would want to retain the shared segment size across reboots. So,
> > please check if the following settings are available in
> > /etc/sysctl.conf
> >
> > # Controls the maximum shared segment size, in bytes kernel.shmmax =
> > 134217728
> >
> > # Controls the maximum number of shared memory segments, in pages
> > kernel.shmall = 4294967296
> >
> >
> > shmmax needs to be at least 64 MB. By default, Trafodion RMS shared
> > segment size is 64 MB. Trafodion RMS shared segment can be expanded to
> > 128 MB. So, it is better to set shmmax to 128 mb, just in case we need
> > to expand it later.
> >
> > Selva
> >
> > -----Original Message-----
> > From: Prashanth Vasudev [mailto:prashanth.vasudev@esgyn.com]
> > Sent: Tuesday, October 6, 2015 2:19 PM
> > To: dev@trafodion.incubator.apache.org
> > Subject: RE: trafodion won't start core files are generated
> >
> > Hi,
> > From the stack trace below, it appears trafodion monitor is unable to
> > create shared memory objects.
> > Please makes sure ulimit settings on all nodes have high limits for
> > max locked memory.
> > Also make sure /dev/shm on all nodes have the correct write
> > permissions to trafodion user id.
> >
> > Regards,
> > Prashanth
> >
> > -----Original Message-----
> > From: Radu Marias [mailto:radumarias@gmail.com]
> > Sent: Tuesday, October 6, 2015 9:21 AM
> > To: dev <de...@trafodion.incubator.apache.org>
> > Subject: trafodion won't start core files are generated
> >
> > Hi,
> >
> > At some point a node from the 5 nodes cluster has stopped and we
> > needed to restart it, After that I've restarted all the ambari and hdp
> > services but trafodion fails to start.
> >
> > Bellow are some stack traces and details for files that I'm not
> > getting any stack. Files are from node1 and node2 and were in Oct  2
> > (when I think node
> > 2 was down) and Oct  6 (when re rebooted the node and tried to start
> > trafodion). Feel free to connect and debug the issue on our cluster,
> > Amanda has the credentials.
> >
> > *FROM NODE1*
> >
> > Oct  2 22:27 core.39347
> > core.39347: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > SVR4-style, from 'tm SQMON1.1 00000 00000 039347 $TM0
> > 188.138.61.175:60186 00002 00000
> > 00009 SPAR'
> > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > core.39347
> > no stack
> >
> > Oct  2 22:41 core.15144
> > Program terminated with signal 6, Aborted.
> > #0  0x00007f77bcbbb625 in ?? ()
> > #1  0x00007f77bcbbce05 in ?? ()
> > #2  0x0000000000000010 in ?? () at ../common/Collections.cpp:109
> > #3  0x00007f77bee62130 in ?? ()
> > #4  0x00007ffe8e796ec0 in ?? ()
> > #5  0x00007f77bdeced00 in ?? ()
> > #6  0x0000000000000004 in ?? () at ../common/Collections.cpp:109
> > #7  0x0000000001b3a310 in ?? ()
> > #8  0x0000000000000000 in ?? ()
> >
> > Oct  2 22:41 core.39240
> > #0  0x00007f534d03c625 in raise () from /lib64/libc.so.6
> > #1  0x00007f534d03de05 in abort () from /lib64/libc.so.6
> > #2  0x00007f534d03574e in __assert_fail_base () from /lib64/libc.so.6
> > #3  0x00007f534d035810 in __assert_fail () from /lib64/libc.so.6
> > #4  0x000000000046e213 in CExtTmLeaderReq::performRequest
> > (this=0x7f53340008c0) at reqtmleader.cxx:126
> > #5  0x000000000045a64a in CReqWorker::reqWorkerThread (this=<value
> > optimized
> > out>) at reqworker.cxx:79
> > #6  0x000000000045a86d in reqWorker (arg=0xc6f9a0) at
> > reqworker.cxx:147
> > #7  0x00007f534db45a51 in start_thread () from /lib64/libpthread.so.0
> > #8  0x00007f534d0f29ad in clone () from /lib64/libc.so.6
> >
> > Oct  2 22:41 core.15309
> > core.15309: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > SVR4-style, from 'tm SQMON1.1 00000 00000 015309 $TM0
> > 188.138.61.175:60186 00002 00000
> > 00134 SPAR'
> > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > core.15309
> > no stack
> >
> >
> > *FROM NODE2*
> >
> > Oct  2 22:29 core.39491
> > core.39491: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > SVR4-style, from 'tm SQMON1.1 00001 00001 039491 $TM1
> > 188.138.61.177:38680 00002 00001
> > 00003 SPAR'
> > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > core.39491
> > no stack
> >
> > Oct  6 15:23 core.1394
> > Program terminated with signal 6, Aborted.
> > #0  0x00007fb97acbf625 in raise () from /lib64/libc.so.6
> > #1  0x00007fb97acc0e05 in abort () from /lib64/libc.so.6
> > #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> > (this=0x2071880, nodeContainer=<value optimized out>) at
> > process.cxx:3366
> > #3  0x0000000000453f5c in CNode::CNode (this=0x2071880, name=0x204c448
> > "euve79672", pnid=0, rank=0) at pnode.cxx:153
> > #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> > optimized
> > out>) at pnode.cxx:1564
> > #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> > (this=0x20757b0) at cluster.cxx:2740
> > #6  0x0000000000417645 in CCluster::CCluster (this=0x20757b0) at
> > cluster.cxx:567
> > #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> > (this=0x20757b0) at tmsync.cxx:137
> > #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x20757b0,
> > procTermSig=9) at monitor.cxx:323
> > #9  0x00000000004086ad in main (argc=2, argv=0x7fff8322e298) at
> > monitor.cxx:1152
> >
> > Oct  6 15:43 core.17626
> > Program terminated with signal 6, Aborted.
> > #0  0x00007fcf11aea625 in raise () from /lib64/libc.so.6
> > #1  0x00007fcf11aebe05 in abort () from /lib64/libc.so.6
> > #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> > (this=0x1182890, nodeContainer=<value optimized out>) at
> > process.cxx:3366
> > #3  0x0000000000453f5c in CNode::CNode (this=0x1182890, name=0x115d458
> > "euve79672", pnid=0, rank=0) at pnode.cxx:153
> > #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> > optimized
> > out>) at pnode.cxx:1564
> > #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> > (this=0x11867c0) at cluster.cxx:2740
> > #6  0x0000000000417645 in CCluster::CCluster (this=0x11867c0) at
> > cluster.cxx:567
> > #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> > (this=0x11867c0) at tmsync.cxx:137
> > #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x11867c0,
> > procTermSig=9) at monitor.cxx:323
> > #9  0x00000000004086ad in main (argc=2, argv=0x7ffcaca91f68) at
> > monitor.cxx:1152
> >
> > --
> > And in the end, it's not the years in your life that count. It's the
> > life in your years.
> >
>
>
>
> --
> And in the end, it's not the years in your life that count. It's the life
> in your years.
>

Re: trafodion won't start core files are generated

Posted by Suresh Subbiah <su...@gmail.com>.
Hi,

What is the suggested procedure to pick up a daily build?

Thanks
Suresh

On Thu, Oct 8, 2015 at 1:02 AM, Prashanth Vasudev <
prashanth.vasudev@esgyn.com> wrote:

> Memorymonitor.cpp fix is part of this
> https://issues.apache.org/jira/browse/TRAFODION-1492
> Please pick up latest daily build.
>
> Also max locked memory 64kb below appears very small.
>
> Regards,
> Prashanth
>
> -----Original Message-----
> From: Radu Marias [mailto:radumarias@gmail.com]
> Sent: Wednesday, October 7, 2015 8:45 AM
> To: dev <de...@trafodion.incubator.apache.org>
> Subject: Re: trafodion won't start core files are generated
>
> Hi,
>
> I have these:
>
> # pwd
> /dev/shm
> # ls -la
> total 4
> drwxrwxrwx 2 root      root        60 Oct  6 21:07 .
> drwxr-xr-x 9 root      root      2180 Oct  2 22:28 ..
> -rw-r--r-- 1 trafodion trafodion   32 Oct  6 21:07
> sem.monitor.sem.trafodion
>
> kernel.shmmax = 68719476736
> kernel.shmall = 4294967296
>
> # ulimit -a
> core file size          (blocks, -c) 0
> data seg size           (kbytes, -d) unlimited
> scheduling priority             (-e) 0
> file size               (blocks, -f) unlimited
> pending signals                 (-i) 1805076
> max locked memory       (kbytes, -l) 64
> max memory size         (kbytes, -m) unlimited
> open files                      (-n) 65535
> pipe size            (512 bytes, -p) 8
> POSIX message queues     (bytes, -q) 819200
> real-time priority              (-r) 0
> stack size              (kbytes, -s) 10240
> cpu time               (seconds, -t) unlimited
> max user processes              (-u) 65535
> virtual memory          (kbytes, -v) unlimited
> file locks                      (-x) unlimited
>
> I would try to reinstall trafodion to see it something got corrupted and
> maybe that would fix the issue but I know there was a crash on sqstart and
> one of your guys fixed it and copied the lib file to our cluster:
>
> This is a response from Narendra in a previous thread where the issue was
> fixed to start the trafodion:
>
>
> >
> >
> >
> > *I updated the code: sql/cli/memmonitor.cpp, so that if /proc/meminfo
> > does not have the ‘Committed_AS’ entry, it will ignore it. Built it
> > and put the
> > binary: libcli.so on the veracity box (in the $MY_SQROOT/export/lib64
> > directory – on all the nodes). Restarted the env and ‘sqlci’ worked fine.
> > Was able to ‘initialize trafodion’ and create a table.*
>
>
> There was another one similar which I see it's closed
> https://issues.apache.org/jira/browse/TRAFODION-1492
>
> So the idea is are these fixes in the latest daily build and I can try to
> reinstall? Or please send the changed files so I can override after
> reinstall.
>
> On Wed, Oct 7, 2015 at 6:02 PM, Selva Govindarajan <
> selva.govindarajan@esgyn.com> wrote:
>
> > You would want to retain the shared segment size across reboots. So,
> > please check if the following settings are available in
> > /etc/sysctl.conf
> >
> > # Controls the maximum shared segment size, in bytes kernel.shmmax =
> > 134217728
> >
> > # Controls the maximum number of shared memory segments, in pages
> > kernel.shmall = 4294967296
> >
> >
> > shmmax needs to be at least 64 MB. By default, Trafodion RMS shared
> > segment size is 64 MB. Trafodion RMS shared segment can be expanded to
> > 128 MB. So, it is better to set shmmax to 128 mb, just in case we need
> > to expand it later.
> >
> > Selva
> >
> > -----Original Message-----
> > From: Prashanth Vasudev [mailto:prashanth.vasudev@esgyn.com]
> > Sent: Tuesday, October 6, 2015 2:19 PM
> > To: dev@trafodion.incubator.apache.org
> > Subject: RE: trafodion won't start core files are generated
> >
> > Hi,
> > From the stack trace below, it appears trafodion monitor is unable to
> > create shared memory objects.
> > Please makes sure ulimit settings on all nodes have high limits for
> > max locked memory.
> > Also make sure /dev/shm on all nodes have the correct write
> > permissions to trafodion user id.
> >
> > Regards,
> > Prashanth
> >
> > -----Original Message-----
> > From: Radu Marias [mailto:radumarias@gmail.com]
> > Sent: Tuesday, October 6, 2015 9:21 AM
> > To: dev <de...@trafodion.incubator.apache.org>
> > Subject: trafodion won't start core files are generated
> >
> > Hi,
> >
> > At some point a node from the 5 nodes cluster has stopped and we
> > needed to restart it, After that I've restarted all the ambari and hdp
> > services but trafodion fails to start.
> >
> > Bellow are some stack traces and details for files that I'm not
> > getting any stack. Files are from node1 and node2 and were in Oct  2
> > (when I think node
> > 2 was down) and Oct  6 (when re rebooted the node and tried to start
> > trafodion). Feel free to connect and debug the issue on our cluster,
> > Amanda has the credentials.
> >
> > *FROM NODE1*
> >
> > Oct  2 22:27 core.39347
> > core.39347: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > SVR4-style, from 'tm SQMON1.1 00000 00000 039347 $TM0
> > 188.138.61.175:60186 00002 00000
> > 00009 SPAR'
> > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > core.39347
> > no stack
> >
> > Oct  2 22:41 core.15144
> > Program terminated with signal 6, Aborted.
> > #0  0x00007f77bcbbb625 in ?? ()
> > #1  0x00007f77bcbbce05 in ?? ()
> > #2  0x0000000000000010 in ?? () at ../common/Collections.cpp:109
> > #3  0x00007f77bee62130 in ?? ()
> > #4  0x00007ffe8e796ec0 in ?? ()
> > #5  0x00007f77bdeced00 in ?? ()
> > #6  0x0000000000000004 in ?? () at ../common/Collections.cpp:109
> > #7  0x0000000001b3a310 in ?? ()
> > #8  0x0000000000000000 in ?? ()
> >
> > Oct  2 22:41 core.39240
> > #0  0x00007f534d03c625 in raise () from /lib64/libc.so.6
> > #1  0x00007f534d03de05 in abort () from /lib64/libc.so.6
> > #2  0x00007f534d03574e in __assert_fail_base () from /lib64/libc.so.6
> > #3  0x00007f534d035810 in __assert_fail () from /lib64/libc.so.6
> > #4  0x000000000046e213 in CExtTmLeaderReq::performRequest
> > (this=0x7f53340008c0) at reqtmleader.cxx:126
> > #5  0x000000000045a64a in CReqWorker::reqWorkerThread (this=<value
> > optimized
> > out>) at reqworker.cxx:79
> > #6  0x000000000045a86d in reqWorker (arg=0xc6f9a0) at
> > reqworker.cxx:147
> > #7  0x00007f534db45a51 in start_thread () from /lib64/libpthread.so.0
> > #8  0x00007f534d0f29ad in clone () from /lib64/libc.so.6
> >
> > Oct  2 22:41 core.15309
> > core.15309: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > SVR4-style, from 'tm SQMON1.1 00000 00000 015309 $TM0
> > 188.138.61.175:60186 00002 00000
> > 00134 SPAR'
> > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > core.15309
> > no stack
> >
> >
> > *FROM NODE2*
> >
> > Oct  2 22:29 core.39491
> > core.39491: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> > SVR4-style, from 'tm SQMON1.1 00001 00001 039491 $TM1
> > 188.138.61.177:38680 00002 00001
> > 00003 SPAR'
> > gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> > core.39491
> > no stack
> >
> > Oct  6 15:23 core.1394
> > Program terminated with signal 6, Aborted.
> > #0  0x00007fb97acbf625 in raise () from /lib64/libc.so.6
> > #1  0x00007fb97acc0e05 in abort () from /lib64/libc.so.6
> > #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> > (this=0x2071880, nodeContainer=<value optimized out>) at
> > process.cxx:3366
> > #3  0x0000000000453f5c in CNode::CNode (this=0x2071880, name=0x204c448
> > "euve79672", pnid=0, rank=0) at pnode.cxx:153
> > #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> > optimized
> > out>) at pnode.cxx:1564
> > #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> > (this=0x20757b0) at cluster.cxx:2740
> > #6  0x0000000000417645 in CCluster::CCluster (this=0x20757b0) at
> > cluster.cxx:567
> > #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> > (this=0x20757b0) at tmsync.cxx:137
> > #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x20757b0,
> > procTermSig=9) at monitor.cxx:323
> > #9  0x00000000004086ad in main (argc=2, argv=0x7fff8322e298) at
> > monitor.cxx:1152
> >
> > Oct  6 15:43 core.17626
> > Program terminated with signal 6, Aborted.
> > #0  0x00007fcf11aea625 in raise () from /lib64/libc.so.6
> > #1  0x00007fcf11aebe05 in abort () from /lib64/libc.so.6
> > #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> > (this=0x1182890, nodeContainer=<value optimized out>) at
> > process.cxx:3366
> > #3  0x0000000000453f5c in CNode::CNode (this=0x1182890, name=0x115d458
> > "euve79672", pnid=0, rank=0) at pnode.cxx:153
> > #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> > optimized
> > out>) at pnode.cxx:1564
> > #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> > (this=0x11867c0) at cluster.cxx:2740
> > #6  0x0000000000417645 in CCluster::CCluster (this=0x11867c0) at
> > cluster.cxx:567
> > #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> > (this=0x11867c0) at tmsync.cxx:137
> > #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x11867c0,
> > procTermSig=9) at monitor.cxx:323
> > #9  0x00000000004086ad in main (argc=2, argv=0x7ffcaca91f68) at
> > monitor.cxx:1152
> >
> > --
> > And in the end, it's not the years in your life that count. It's the
> > life in your years.
> >
>
>
>
> --
> And in the end, it's not the years in your life that count. It's the life
> in your years.
>

RE: trafodion won't start core files are generated

Posted by Prashanth Vasudev <pr...@esgyn.com>.
Memorymonitor.cpp fix is part of this
https://issues.apache.org/jira/browse/TRAFODION-1492
Please pick up latest daily build.

Also max locked memory 64kb below appears very small.

Regards,
Prashanth

-----Original Message-----
From: Radu Marias [mailto:radumarias@gmail.com]
Sent: Wednesday, October 7, 2015 8:45 AM
To: dev <de...@trafodion.incubator.apache.org>
Subject: Re: trafodion won't start core files are generated

Hi,

I have these:

# pwd
/dev/shm
# ls -la
total 4
drwxrwxrwx 2 root      root        60 Oct  6 21:07 .
drwxr-xr-x 9 root      root      2180 Oct  2 22:28 ..
-rw-r--r-- 1 trafodion trafodion   32 Oct  6 21:07 sem.monitor.sem.trafodion

kernel.shmmax = 68719476736
kernel.shmall = 4294967296

# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1805076
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 65535
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I would try to reinstall trafodion to see it something got corrupted and
maybe that would fix the issue but I know there was a crash on sqstart and
one of your guys fixed it and copied the lib file to our cluster:

This is a response from Narendra in a previous thread where the issue was
fixed to start the trafodion:


>
>
>
> *I updated the code: sql/cli/memmonitor.cpp, so that if /proc/meminfo
> does not have the ‘Committed_AS’ entry, it will ignore it. Built it
> and put the
> binary: libcli.so on the veracity box (in the $MY_SQROOT/export/lib64
> directory – on all the nodes). Restarted the env and ‘sqlci’ worked fine.
> Was able to ‘initialize trafodion’ and create a table.*


There was another one similar which I see it's closed
https://issues.apache.org/jira/browse/TRAFODION-1492

So the idea is are these fixes in the latest daily build and I can try to
reinstall? Or please send the changed files so I can override after
reinstall.

On Wed, Oct 7, 2015 at 6:02 PM, Selva Govindarajan <
selva.govindarajan@esgyn.com> wrote:

> You would want to retain the shared segment size across reboots. So,
> please check if the following settings are available in
> /etc/sysctl.conf
>
> # Controls the maximum shared segment size, in bytes kernel.shmmax =
> 134217728
>
> # Controls the maximum number of shared memory segments, in pages
> kernel.shmall = 4294967296
>
>
> shmmax needs to be at least 64 MB. By default, Trafodion RMS shared
> segment size is 64 MB. Trafodion RMS shared segment can be expanded to
> 128 MB. So, it is better to set shmmax to 128 mb, just in case we need
> to expand it later.
>
> Selva
>
> -----Original Message-----
> From: Prashanth Vasudev [mailto:prashanth.vasudev@esgyn.com]
> Sent: Tuesday, October 6, 2015 2:19 PM
> To: dev@trafodion.incubator.apache.org
> Subject: RE: trafodion won't start core files are generated
>
> Hi,
> From the stack trace below, it appears trafodion monitor is unable to
> create shared memory objects.
> Please makes sure ulimit settings on all nodes have high limits for
> max locked memory.
> Also make sure /dev/shm on all nodes have the correct write
> permissions to trafodion user id.
>
> Regards,
> Prashanth
>
> -----Original Message-----
> From: Radu Marias [mailto:radumarias@gmail.com]
> Sent: Tuesday, October 6, 2015 9:21 AM
> To: dev <de...@trafodion.incubator.apache.org>
> Subject: trafodion won't start core files are generated
>
> Hi,
>
> At some point a node from the 5 nodes cluster has stopped and we
> needed to restart it, After that I've restarted all the ambari and hdp
> services but trafodion fails to start.
>
> Bellow are some stack traces and details for files that I'm not
> getting any stack. Files are from node1 and node2 and were in Oct  2
> (when I think node
> 2 was down) and Oct  6 (when re rebooted the node and tried to start
> trafodion). Feel free to connect and debug the issue on our cluster,
> Amanda has the credentials.
>
> *FROM NODE1*
>
> Oct  2 22:27 core.39347
> core.39347: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> SVR4-style, from 'tm SQMON1.1 00000 00000 039347 $TM0
> 188.138.61.175:60186 00002 00000
> 00009 SPAR'
> gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> core.39347
> no stack
>
> Oct  2 22:41 core.15144
> Program terminated with signal 6, Aborted.
> #0  0x00007f77bcbbb625 in ?? ()
> #1  0x00007f77bcbbce05 in ?? ()
> #2  0x0000000000000010 in ?? () at ../common/Collections.cpp:109
> #3  0x00007f77bee62130 in ?? ()
> #4  0x00007ffe8e796ec0 in ?? ()
> #5  0x00007f77bdeced00 in ?? ()
> #6  0x0000000000000004 in ?? () at ../common/Collections.cpp:109
> #7  0x0000000001b3a310 in ?? ()
> #8  0x0000000000000000 in ?? ()
>
> Oct  2 22:41 core.39240
> #0  0x00007f534d03c625 in raise () from /lib64/libc.so.6
> #1  0x00007f534d03de05 in abort () from /lib64/libc.so.6
> #2  0x00007f534d03574e in __assert_fail_base () from /lib64/libc.so.6
> #3  0x00007f534d035810 in __assert_fail () from /lib64/libc.so.6
> #4  0x000000000046e213 in CExtTmLeaderReq::performRequest
> (this=0x7f53340008c0) at reqtmleader.cxx:126
> #5  0x000000000045a64a in CReqWorker::reqWorkerThread (this=<value
> optimized
> out>) at reqworker.cxx:79
> #6  0x000000000045a86d in reqWorker (arg=0xc6f9a0) at
> reqworker.cxx:147
> #7  0x00007f534db45a51 in start_thread () from /lib64/libpthread.so.0
> #8  0x00007f534d0f29ad in clone () from /lib64/libc.so.6
>
> Oct  2 22:41 core.15309
> core.15309: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> SVR4-style, from 'tm SQMON1.1 00000 00000 015309 $TM0
> 188.138.61.175:60186 00002 00000
> 00134 SPAR'
> gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> core.15309
> no stack
>
>
> *FROM NODE2*
>
> Oct  2 22:29 core.39491
> core.39491: ELF 64-bit LSB core file x86-64, version 1 (SYSV),
> SVR4-style, from 'tm SQMON1.1 00001 00001 039491 $TM1
> 188.138.61.177:38680 00002 00001
> 00003 SPAR'
> gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> core.39491
> no stack
>
> Oct  6 15:23 core.1394
> Program terminated with signal 6, Aborted.
> #0  0x00007fb97acbf625 in raise () from /lib64/libc.so.6
> #1  0x00007fb97acc0e05 in abort () from /lib64/libc.so.6
> #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> (this=0x2071880, nodeContainer=<value optimized out>) at
> process.cxx:3366
> #3  0x0000000000453f5c in CNode::CNode (this=0x2071880, name=0x204c448
> "euve79672", pnid=0, rank=0) at pnode.cxx:153
> #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> optimized
> out>) at pnode.cxx:1564
> #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> (this=0x20757b0) at cluster.cxx:2740
> #6  0x0000000000417645 in CCluster::CCluster (this=0x20757b0) at
> cluster.cxx:567
> #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> (this=0x20757b0) at tmsync.cxx:137
> #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x20757b0,
> procTermSig=9) at monitor.cxx:323
> #9  0x00000000004086ad in main (argc=2, argv=0x7fff8322e298) at
> monitor.cxx:1152
>
> Oct  6 15:43 core.17626
> Program terminated with signal 6, Aborted.
> #0  0x00007fcf11aea625 in raise () from /lib64/libc.so.6
> #1  0x00007fcf11aebe05 in abort () from /lib64/libc.so.6
> #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> (this=0x1182890, nodeContainer=<value optimized out>) at
> process.cxx:3366
> #3  0x0000000000453f5c in CNode::CNode (this=0x1182890, name=0x115d458
> "euve79672", pnid=0, rank=0) at pnode.cxx:153
> #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value
> optimized
> out>) at pnode.cxx:1564
> #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> (this=0x11867c0) at cluster.cxx:2740
> #6  0x0000000000417645 in CCluster::CCluster (this=0x11867c0) at
> cluster.cxx:567
> #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> (this=0x11867c0) at tmsync.cxx:137
> #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x11867c0,
> procTermSig=9) at monitor.cxx:323
> #9  0x00000000004086ad in main (argc=2, argv=0x7ffcaca91f68) at
> monitor.cxx:1152
>
> --
> And in the end, it's not the years in your life that count. It's the
> life in your years.
>



-- 
And in the end, it's not the years in your life that count. It's the life
in your years.

Re: trafodion won't start core files are generated

Posted by Radu Marias <ra...@gmail.com>.
Hi,

I have these:

# pwd
/dev/shm
# ls -la
total 4
drwxrwxrwx 2 root      root        60 Oct  6 21:07 .
drwxr-xr-x 9 root      root      2180 Oct  2 22:28 ..
-rw-r--r-- 1 trafodion trafodion   32 Oct  6 21:07 sem.monitor.sem.trafodion

kernel.shmmax = 68719476736
kernel.shmall = 4294967296

# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 1805076
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 65535
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

I would try to reinstall trafodion to see it something got corrupted and
maybe that would fix the issue but I know there was a crash on sqstart and
one of your guys fixed it and copied the lib file to our cluster:

This is a response from Narendra in a previous thread where the issue was
fixed to start the trafodion:


>
>
>
> *I updated the code: sql/cli/memmonitor.cpp, so that if /proc/meminfo does
> not have the ‘Committed_AS’ entry, it will ignore it. Built it and put the
> binary: libcli.so on the veracity box (in the $MY_SQROOT/export/lib64
> directory – on all the nodes). Restarted the env and ‘sqlci’ worked fine.
> Was able to ‘initialize trafodion’ and create a table.*


There was another one similar which I see it's closed
https://issues.apache.org/jira/browse/TRAFODION-1492

So the idea is are these fixes in the latest daily build and I can try to
reinstall? Or please send the changed files so I can override after
reinstall.

On Wed, Oct 7, 2015 at 6:02 PM, Selva Govindarajan <
selva.govindarajan@esgyn.com> wrote:

> You would want to retain the shared segment size across reboots. So, please
> check if the following settings are available in /etc/sysctl.conf
>
> # Controls the maximum shared segment size, in bytes
> kernel.shmmax = 134217728
>
> # Controls the maximum number of shared memory segments, in pages
> kernel.shmall = 4294967296
>
>
> shmmax needs to be at least 64 MB. By default, Trafodion RMS shared segment
> size is 64 MB. Trafodion RMS shared segment can be expanded to 128 MB. So,
> it is better to set shmmax to 128 mb, just in case we need to expand it
> later.
>
> Selva
>
> -----Original Message-----
> From: Prashanth Vasudev [mailto:prashanth.vasudev@esgyn.com]
> Sent: Tuesday, October 6, 2015 2:19 PM
> To: dev@trafodion.incubator.apache.org
> Subject: RE: trafodion won't start core files are generated
>
> Hi,
> From the stack trace below, it appears trafodion monitor is unable to
> create
> shared memory objects.
> Please makes sure ulimit settings on all nodes have high limits for max
> locked memory.
> Also make sure /dev/shm on all nodes have the correct write permissions to
> trafodion user id.
>
> Regards,
> Prashanth
>
> -----Original Message-----
> From: Radu Marias [mailto:radumarias@gmail.com]
> Sent: Tuesday, October 6, 2015 9:21 AM
> To: dev <de...@trafodion.incubator.apache.org>
> Subject: trafodion won't start core files are generated
>
> Hi,
>
> At some point a node from the 5 nodes cluster has stopped and we needed to
> restart it, After that I've restarted all the ambari and hdp services but
> trafodion fails to start.
>
> Bellow are some stack traces and details for files that I'm not getting any
> stack. Files are from node1 and node2 and were in Oct  2 (when I think node
> 2 was down) and Oct  6 (when re rebooted the node and tried to start
> trafodion). Feel free to connect and debug the issue on our cluster, Amanda
> has the credentials.
>
> *FROM NODE1*
>
> Oct  2 22:27 core.39347
> core.39347: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
> from 'tm SQMON1.1 00000 00000 039347 $TM0 188.138.61.175:60186 00002 00000
> 00009 SPAR'
> gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> core.39347
> no stack
>
> Oct  2 22:41 core.15144
> Program terminated with signal 6, Aborted.
> #0  0x00007f77bcbbb625 in ?? ()
> #1  0x00007f77bcbbce05 in ?? ()
> #2  0x0000000000000010 in ?? () at ../common/Collections.cpp:109
> #3  0x00007f77bee62130 in ?? ()
> #4  0x00007ffe8e796ec0 in ?? ()
> #5  0x00007f77bdeced00 in ?? ()
> #6  0x0000000000000004 in ?? () at ../common/Collections.cpp:109
> #7  0x0000000001b3a310 in ?? ()
> #8  0x0000000000000000 in ?? ()
>
> Oct  2 22:41 core.39240
> #0  0x00007f534d03c625 in raise () from /lib64/libc.so.6
> #1  0x00007f534d03de05 in abort () from /lib64/libc.so.6
> #2  0x00007f534d03574e in __assert_fail_base () from /lib64/libc.so.6
> #3  0x00007f534d035810 in __assert_fail () from /lib64/libc.so.6
> #4  0x000000000046e213 in CExtTmLeaderReq::performRequest
> (this=0x7f53340008c0) at reqtmleader.cxx:126
> #5  0x000000000045a64a in CReqWorker::reqWorkerThread (this=<value
> optimized
> out>) at reqworker.cxx:79
> #6  0x000000000045a86d in reqWorker (arg=0xc6f9a0) at reqworker.cxx:147
> #7  0x00007f534db45a51 in start_thread () from /lib64/libpthread.so.0
> #8  0x00007f534d0f29ad in clone () from /lib64/libc.so.6
>
> Oct  2 22:41 core.15309
> core.15309: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
> from 'tm SQMON1.1 00000 00000 015309 $TM0 188.138.61.175:60186 00002 00000
> 00134 SPAR'
> gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> core.15309
> no stack
>
>
> *FROM NODE2*
>
> Oct  2 22:29 core.39491
> core.39491: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
> from 'tm SQMON1.1 00001 00001 039491 $TM1 188.138.61.177:38680 00002 00001
> 00003 SPAR'
> gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
> core.39491
> no stack
>
> Oct  6 15:23 core.1394
> Program terminated with signal 6, Aborted.
> #0  0x00007fb97acbf625 in raise () from /lib64/libc.so.6
> #1  0x00007fb97acc0e05 in abort () from /lib64/libc.so.6
> #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> (this=0x2071880, nodeContainer=<value optimized out>) at process.cxx:3366
> #3  0x0000000000453f5c in CNode::CNode (this=0x2071880, name=0x204c448
> "euve79672", pnid=0, rank=0) at pnode.cxx:153
> #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value optimized
> out>) at pnode.cxx:1564
> #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> (this=0x20757b0) at cluster.cxx:2740
> #6  0x0000000000417645 in CCluster::CCluster (this=0x20757b0) at
> cluster.cxx:567
> #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> (this=0x20757b0) at tmsync.cxx:137
> #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x20757b0,
> procTermSig=9) at monitor.cxx:323
> #9  0x00000000004086ad in main (argc=2, argv=0x7fff8322e298) at
> monitor.cxx:1152
>
> Oct  6 15:43 core.17626
> Program terminated with signal 6, Aborted.
> #0  0x00007fcf11aea625 in raise () from /lib64/libc.so.6
> #1  0x00007fcf11aebe05 in abort () from /lib64/libc.so.6
> #2  0x000000000041d07d in CProcessContainer::CProcessContainer
> (this=0x1182890, nodeContainer=<value optimized out>) at process.cxx:3366
> #3  0x0000000000453f5c in CNode::CNode (this=0x1182890, name=0x115d458
> "euve79672", pnid=0, rank=0) at pnode.cxx:153
> #4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value optimized
> out>) at pnode.cxx:1564
> #5  0x00000000004169a5 in CCluster::InitializeConfigCluster
> (this=0x11867c0) at cluster.cxx:2740
> #6  0x0000000000417645 in CCluster::CCluster (this=0x11867c0) at
> cluster.cxx:567
> #7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
> (this=0x11867c0) at tmsync.cxx:137
> #8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x11867c0,
> procTermSig=9) at monitor.cxx:323
> #9  0x00000000004086ad in main (argc=2, argv=0x7ffcaca91f68) at
> monitor.cxx:1152
>
> --
> And in the end, it's not the years in your life that count. It's the life
> in
> your years.
>



-- 
And in the end, it's not the years in your life that count. It's the life
in your years.

RE: trafodion won't start core files are generated

Posted by Selva Govindarajan <se...@esgyn.com>.
You would want to retain the shared segment size across reboots. So, please
check if the following settings are available in /etc/sysctl.conf

# Controls the maximum shared segment size, in bytes
kernel.shmmax = 134217728

# Controls the maximum number of shared memory segments, in pages
kernel.shmall = 4294967296


shmmax needs to be at least 64 MB. By default, Trafodion RMS shared segment
size is 64 MB. Trafodion RMS shared segment can be expanded to 128 MB. So,
it is better to set shmmax to 128 mb, just in case we need to expand it
later.

Selva

-----Original Message-----
From: Prashanth Vasudev [mailto:prashanth.vasudev@esgyn.com]
Sent: Tuesday, October 6, 2015 2:19 PM
To: dev@trafodion.incubator.apache.org
Subject: RE: trafodion won't start core files are generated

Hi,
>From the stack trace below, it appears trafodion monitor is unable to create
shared memory objects.
Please makes sure ulimit settings on all nodes have high limits for max
locked memory.
Also make sure /dev/shm on all nodes have the correct write permissions to
trafodion user id.

Regards,
Prashanth

-----Original Message-----
From: Radu Marias [mailto:radumarias@gmail.com]
Sent: Tuesday, October 6, 2015 9:21 AM
To: dev <de...@trafodion.incubator.apache.org>
Subject: trafodion won't start core files are generated

Hi,

At some point a node from the 5 nodes cluster has stopped and we needed to
restart it, After that I've restarted all the ambari and hdp services but
trafodion fails to start.

Bellow are some stack traces and details for files that I'm not getting any
stack. Files are from node1 and node2 and were in Oct  2 (when I think node
2 was down) and Oct  6 (when re rebooted the node and tried to start
trafodion). Feel free to connect and debug the issue on our cluster, Amanda
has the credentials.

*FROM NODE1*

Oct  2 22:27 core.39347
core.39347: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
from 'tm SQMON1.1 00000 00000 039347 $TM0 188.138.61.175:60186 00002 00000
00009 SPAR'
gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
core.39347
no stack

Oct  2 22:41 core.15144
Program terminated with signal 6, Aborted.
#0  0x00007f77bcbbb625 in ?? ()
#1  0x00007f77bcbbce05 in ?? ()
#2  0x0000000000000010 in ?? () at ../common/Collections.cpp:109
#3  0x00007f77bee62130 in ?? ()
#4  0x00007ffe8e796ec0 in ?? ()
#5  0x00007f77bdeced00 in ?? ()
#6  0x0000000000000004 in ?? () at ../common/Collections.cpp:109
#7  0x0000000001b3a310 in ?? ()
#8  0x0000000000000000 in ?? ()

Oct  2 22:41 core.39240
#0  0x00007f534d03c625 in raise () from /lib64/libc.so.6
#1  0x00007f534d03de05 in abort () from /lib64/libc.so.6
#2  0x00007f534d03574e in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007f534d035810 in __assert_fail () from /lib64/libc.so.6
#4  0x000000000046e213 in CExtTmLeaderReq::performRequest
(this=0x7f53340008c0) at reqtmleader.cxx:126
#5  0x000000000045a64a in CReqWorker::reqWorkerThread (this=<value optimized
out>) at reqworker.cxx:79
#6  0x000000000045a86d in reqWorker (arg=0xc6f9a0) at reqworker.cxx:147
#7  0x00007f534db45a51 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f534d0f29ad in clone () from /lib64/libc.so.6

Oct  2 22:41 core.15309
core.15309: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
from 'tm SQMON1.1 00000 00000 015309 $TM0 188.138.61.175:60186 00002 00000
00134 SPAR'
gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
core.15309
no stack


*FROM NODE2*

Oct  2 22:29 core.39491
core.39491: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
from 'tm SQMON1.1 00001 00001 039491 $TM1 188.138.61.177:38680 00002 00001
00003 SPAR'
gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
core.39491
no stack

Oct  6 15:23 core.1394
Program terminated with signal 6, Aborted.
#0  0x00007fb97acbf625 in raise () from /lib64/libc.so.6
#1  0x00007fb97acc0e05 in abort () from /lib64/libc.so.6
#2  0x000000000041d07d in CProcessContainer::CProcessContainer
(this=0x2071880, nodeContainer=<value optimized out>) at process.cxx:3366
#3  0x0000000000453f5c in CNode::CNode (this=0x2071880, name=0x204c448
"euve79672", pnid=0, rank=0) at pnode.cxx:153
#4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value optimized
out>) at pnode.cxx:1564
#5  0x00000000004169a5 in CCluster::InitializeConfigCluster
(this=0x20757b0) at cluster.cxx:2740
#6  0x0000000000417645 in CCluster::CCluster (this=0x20757b0) at
cluster.cxx:567
#7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
(this=0x20757b0) at tmsync.cxx:137
#8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x20757b0,
procTermSig=9) at monitor.cxx:323
#9  0x00000000004086ad in main (argc=2, argv=0x7fff8322e298) at
monitor.cxx:1152

Oct  6 15:43 core.17626
Program terminated with signal 6, Aborted.
#0  0x00007fcf11aea625 in raise () from /lib64/libc.so.6
#1  0x00007fcf11aebe05 in abort () from /lib64/libc.so.6
#2  0x000000000041d07d in CProcessContainer::CProcessContainer
(this=0x1182890, nodeContainer=<value optimized out>) at process.cxx:3366
#3  0x0000000000453f5c in CNode::CNode (this=0x1182890, name=0x115d458
"euve79672", pnid=0, rank=0) at pnode.cxx:153
#4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value optimized
out>) at pnode.cxx:1564
#5  0x00000000004169a5 in CCluster::InitializeConfigCluster
(this=0x11867c0) at cluster.cxx:2740
#6  0x0000000000417645 in CCluster::CCluster (this=0x11867c0) at
cluster.cxx:567
#7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
(this=0x11867c0) at tmsync.cxx:137
#8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x11867c0,
procTermSig=9) at monitor.cxx:323
#9  0x00000000004086ad in main (argc=2, argv=0x7ffcaca91f68) at
monitor.cxx:1152

--
And in the end, it's not the years in your life that count. It's the life in
your years.

RE: trafodion won't start core files are generated

Posted by Prashanth Vasudev <pr...@esgyn.com>.
Hi,
>From the stack trace below, it appears trafodion monitor is unable to create
shared memory objects.
Please makes sure ulimit settings on all nodes have high limits for max
locked memory.
Also make sure /dev/shm on all nodes have the correct write permissions to
trafodion user id.

Regards,
Prashanth

-----Original Message-----
From: Radu Marias [mailto:radumarias@gmail.com]
Sent: Tuesday, October 6, 2015 9:21 AM
To: dev <de...@trafodion.incubator.apache.org>
Subject: trafodion won't start core files are generated

Hi,

At some point a node from the 5 nodes cluster has stopped and we needed to
restart it, After that I've restarted all the ambari and hdp services but
trafodion fails to start.

Bellow are some stack traces and details for files that I'm not getting any
stack. Files are from node1 and node2 and were in Oct  2 (when I think node
2 was down) and Oct  6 (when re rebooted the node and tried to start
trafodion). Feel free to connect and debug the issue on our cluster, Amanda
has the credentials.

*FROM NODE1*

Oct  2 22:27 core.39347
core.39347: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
from 'tm SQMON1.1 00000 00000 039347 $TM0 188.138.61.175:60186 00002 00000
00009 SPAR'
gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
core.39347
no stack

Oct  2 22:41 core.15144
Program terminated with signal 6, Aborted.
#0  0x00007f77bcbbb625 in ?? ()
#1  0x00007f77bcbbce05 in ?? ()
#2  0x0000000000000010 in ?? () at ../common/Collections.cpp:109
#3  0x00007f77bee62130 in ?? ()
#4  0x00007ffe8e796ec0 in ?? ()
#5  0x00007f77bdeced00 in ?? ()
#6  0x0000000000000004 in ?? () at ../common/Collections.cpp:109
#7  0x0000000001b3a310 in ?? ()
#8  0x0000000000000000 in ?? ()

Oct  2 22:41 core.39240
#0  0x00007f534d03c625 in raise () from /lib64/libc.so.6
#1  0x00007f534d03de05 in abort () from /lib64/libc.so.6
#2  0x00007f534d03574e in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007f534d035810 in __assert_fail () from /lib64/libc.so.6
#4  0x000000000046e213 in CExtTmLeaderReq::performRequest
(this=0x7f53340008c0) at reqtmleader.cxx:126
#5  0x000000000045a64a in CReqWorker::reqWorkerThread (this=<value optimized
out>) at reqworker.cxx:79
#6  0x000000000045a86d in reqWorker (arg=0xc6f9a0) at reqworker.cxx:147
#7  0x00007f534db45a51 in start_thread () from /lib64/libpthread.so.0
#8  0x00007f534d0f29ad in clone () from /lib64/libc.so.6

Oct  2 22:41 core.15309
core.15309: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
from 'tm SQMON1.1 00000 00000 015309 $TM0 188.138.61.175:60186 00002 00000
00134 SPAR'
gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
core.15309
no stack


*FROM NODE2*

Oct  2 22:29 core.39491
core.39491: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style,
from 'tm SQMON1.1 00001 00001 039491 $TM1 188.138.61.177:38680 00002 00001
00003 SPAR'
gdb /home/trafodion/trafodion-20150828_0830/export/bin64/tdm_udrserv
core.39491
no stack

Oct  6 15:23 core.1394
Program terminated with signal 6, Aborted.
#0  0x00007fb97acbf625 in raise () from /lib64/libc.so.6
#1  0x00007fb97acc0e05 in abort () from /lib64/libc.so.6
#2  0x000000000041d07d in CProcessContainer::CProcessContainer
(this=0x2071880, nodeContainer=<value optimized out>) at process.cxx:3366
#3  0x0000000000453f5c in CNode::CNode (this=0x2071880, name=0x204c448
"euve79672", pnid=0, rank=0) at pnode.cxx:153
#4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value optimized
out>) at pnode.cxx:1564
#5  0x00000000004169a5 in CCluster::InitializeConfigCluster
(this=0x20757b0) at cluster.cxx:2740
#6  0x0000000000417645 in CCluster::CCluster (this=0x20757b0) at
cluster.cxx:567
#7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
(this=0x20757b0) at tmsync.cxx:137
#8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x20757b0,
procTermSig=9) at monitor.cxx:323
#9  0x00000000004086ad in main (argc=2, argv=0x7fff8322e298) at
monitor.cxx:1152

Oct  6 15:43 core.17626
Program terminated with signal 6, Aborted.
#0  0x00007fcf11aea625 in raise () from /lib64/libc.so.6
#1  0x00007fcf11aebe05 in abort () from /lib64/libc.so.6
#2  0x000000000041d07d in CProcessContainer::CProcessContainer
(this=0x1182890, nodeContainer=<value optimized out>) at process.cxx:3366
#3  0x0000000000453f5c in CNode::CNode (this=0x1182890, name=0x115d458
"euve79672", pnid=0, rank=0) at pnode.cxx:153
#4  0x00000000004558e0 in CNodeContainer::AddNodes (this=<value optimized
out>) at pnode.cxx:1564
#5  0x00000000004169a5 in CCluster::InitializeConfigCluster
(this=0x11867c0) at cluster.cxx:2740
#6  0x0000000000417645 in CCluster::CCluster (this=0x11867c0) at
cluster.cxx:567
#7  0x0000000000431e1a in CTmSync_Container::CTmSync_Container
(this=0x11867c0) at tmsync.cxx:137
#8  0x0000000000407bb6 in CMonitor::CMonitor (this=0x11867c0,
procTermSig=9) at monitor.cxx:323
#9  0x00000000004086ad in main (argc=2, argv=0x7ffcaca91f68) at
monitor.cxx:1152

--
And in the end, it's not the years in your life that count. It's the life in
your years.