You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@trafodion.apache.org by "Arvind Narain (JIRA)" <ji...@apache.org> on 2017/03/31 18:33:41 UTC

[jira] [Commented] (TRAFODION-2547) Daily 2.1 builds seeing leftover semaphore dev files after running db_uninstall.py

    [ https://issues.apache.org/jira/browse/TRAFODION-2547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951464#comment-15951464 ] 

Arvind Narain commented on TRAFODION-2547:
------------------------------------------

Hi Eason, Could you check if sqstop, ckillall should be done in the uninstall script ? Thanks.

> Daily 2.1 builds seeing leftover semaphore dev files after running db_uninstall.py
> ----------------------------------------------------------------------------------
>
>                 Key: TRAFODION-2547
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2547
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: installer
>    Affects Versions: 2.1-incubating
>         Environment: Release 2.1 with py installer.
>            Reporter: Arvind Narain
>            Assignee: Eason Zhang
>
> Noticed that after running the db_uninstall.py script (Release 2.1) we are always left with the device semaphore files. This is not the case when trafodion_uninstaller (master) is run.
> The leftover semaphore dev files in Daily run Release 2.1 cause problems with the next Daily run of the master branch.
> In case of cdh we don't see the failures in master builds because of the userid picked up by Release 2.1 python installer script is same as the master installer script (506). In case of HDP env this maybe different (1003).
> Though Steve is fixing the Jenkin jobs to clear out /dev/shm I think we have two issues:
> 1.	db_uninstall.py in Release 2.1 does not stop all the trafodion processes - this may be due to recent checkin where ckillall is not being done. pkillall (called by ckillall) does handle all the trafodion processes as well as clears the semaphores.
> https://github.com/apache/incubator-trafodion/pull/991
> 2. monitor could be modified to create semaphores similar to rms or at least use userid instead of username.
> HDP  job in Release 2.1:
> 2017-03-22 06:41:47 + ./python-installer/db_uninstall.py --verbose --silent --config-file ./Install_Config
> 2017-03-22 06:41:48 *****************************
> 2017-03-22 06:41:48   Trafodion Uninstall Start
> 2017-03-22 06:41:48 *****************************
> 2017-03-22 06:41:48 
> 2017-03-22 06:41:48  [33m***[INFO]: Remove Trafodion on node [slave-ahw23] ...  [0m
> 2017-03-22 06:41:48 *********************************
> 2017-03-22 06:41:48   Trafodion Uninstall Completed
> 2017-03-22 06:41:48 *********************************
> 2017-03-22 06:41:48 + uninst_ret=0
> 2017-03-22 06:41:48 + sudo rm -f /home/jenkins/workspace/pyodbc_test-hdp/traf_run
> 2017-03-22 06:41:48 + sudo mv /home/jenkins/workspace/pyodbc_test-hdp/traf_run.save /home/jenkins/workspace/pyodbc_test-hdp/traf_run
> 2017-03-22 06:41:48 + sudo chmod -R a+rX /home/jenkins/workspace/pyodbc_test-hdp/traf_run
> 2017-03-22 06:41:48 + exit 0
> 2017-03-22 06:41:48 + rc=0
> 2017-03-22 06:41:48 + echo 'Checking shared mem'
> 2017-03-22 06:41:48 Checking shared mem
> 2017-03-22 06:41:48 + ls -ld /dev/shm
> 2017-03-22 06:41:48 drwxrwxrwt 2 root root 100 Mar 22 06:35 /dev/shm
> 2017-03-22 06:41:48 + ls -l /dev/shm
> 2017-03-22 06:41:48 total 12
> 2017-03-22 06:41:48 -rw-r--r-- 1 1003 509 32 Mar 22 06:34 sem.monitor.sem.trafodion
> 2017-03-22 06:41:48 -rw------- 1 1003 509 32 Mar 22 06:35 sem.rms.1003.268469813
> 2017-03-22 06:41:48 -rw------- 1 1003 509 32 Mar 22 06:35 sem.rms.1003.268477888
> 2017-03-22 06:41:48 + echo ============
> 2017-03-22 06:41:48 ============
> 2017-03-22 06:41:48 + exit 0
> 2017-03-22 06:41:48 + '[' 0 -ne 0 ']'
> 2017-03-22 06:41:48 + exit 0
> CDH job in Release 2.1:
> 2017-03-22 07:38:28 + ./python-installer/db_uninstall.py --verbose --silent --config-file ./Install_Config
> 2017-03-22 07:38:28 *****************************
> 2017-03-22 07:38:28   Trafodion Uninstall Start
> 2017-03-22 07:38:28 *****************************
> 2017-03-22 07:38:28 
> 2017-03-22 07:38:28  [33m***[INFO]: Remove Trafodion on node [slave-cm54] ...  [0m
> 2017-03-22 07:38:28 *********************************
> 2017-03-22 07:38:28   Trafodion Uninstall Completed
> 2017-03-22 07:38:28 *********************************
> 2017-03-22 07:38:28 + uninst_ret=0
> 2017-03-22 07:38:28 + sudo rm -f /home/jenkins/workspace/pyodbc_test-cdh/traf_run
> 2017-03-22 07:38:28 + sudo mv /home/jenkins/workspace/pyodbc_test-cdh/traf_run.save /home/jenkins/workspace/pyodbc_test-cdh/traf_run
> 2017-03-22 07:38:28 + sudo chmod -R a+rX /home/jenkins/workspace/pyodbc_test-cdh/traf_run
> 2017-03-22 07:38:28 + exit 0
> 2017-03-22 07:38:28 + rc=0
> 2017-03-22 07:38:28 + echo 'Checking shared mem'
> 2017-03-22 07:38:28 Checking shared mem
> 2017-03-22 07:38:28 + ls -ld /dev/shm
> 2017-03-22 07:38:28 drwxrwxrwt 2 root root 100 Mar 22 07:33 /dev/shm
> 2017-03-22 07:38:28 + ls -l /dev/shm
> 2017-03-22 07:38:28 total 12
> 2017-03-22 07:38:28 -rw-r--r-- 1 506 507 32 Mar 22 07:33 sem.monitor.sem.trafodion
> 2017-03-22 07:38:28 -rw------- 1 506 507 32 Mar 22 07:33 sem.rms.506.268474535
> 2017-03-22 07:38:28 -rw------- 1 506 507 32 Mar 22 07:33 sem.rms.506.268480568
> 2017-03-22 07:38:28 + echo ============
> 2017-03-22 07:38:28 ============
> 2017-03-22 07:38:28 + exit 0
> 2017-03-22 07:38:28 + '[' 0 -ne 0 ']'
> 2017-03-22 07:38:28 + exit 0
> ===
> Previous emails:
> From: Selva Govindarajan [mailto:selva.govindarajan@esgyn.com] 
> Sent: Tuesday, March 21, 2017 12:25 PM
> To: dev@trafodion.incubator.apache.org
> Cc: Steve Varnau <st...@esgyn.com>
> Subject: Re: Trafodion Maser daily build failures
> Thanks Arvind and Steve for following it up. I had said RMS uses port number.  Actually,  the segment id is obtained from the foundation layer and used in the semaphore name.
> SEG_ID getStatsSegmentId()
> {
>   Int32 segid;
>   Int32 error;
>   if (gStatsSegmentId_ == -1)
>   {
>    error = msg_mon_get_my_segid(&segid);
>    assert(error == 0);//XZFIL_ERR_OK);
>    gStatsSegmentId_ = segid + RMS_SEGMENT_ID_OFFSET;
>   }
>   return gStatsSegmentId_;
> }
> RMS gets it once and stores the created semaphore name for use later. I think process Id can also be used in case of monitor because the semaphore is valid only as long as the monitor is alive. In case of RMS, semaphore name needs to remain the same even RMS processes are restarted as long as the node is UP.
> Selva
> Selva
> ________________________________
> From: Arvind N <na...@gmail.com>
> Sent: Tuesday, March 21, 2017 12:03:22 PM
> To: dev@trafodion.incubator.apache.org
> Cc: Steve Varnau; Selva Govindarajan
> Subject: RE: Trafodion Maser daily build failures
> Steve modified the scripts to print out the contents of /dev/shm before
> install and after uninstall. As per the following it does seem that it is a
> leftover semaphore in /dev/shm from previous build.
> Did notice that the failures are restricted to hdp environment. Happens in
> an environment where the slave system was first used by a daily build for
> Release2.1 (leaves files in /dev/shm for id 1003) and then the same is used
> for daily build for master. Maybe the logic of finding the next available id
> is different in the py installer vs bash installer ?
> Suggestion from Selva to attach process ID to the semaphore name should
> clear this problem.
>                 From master daily build:
> https://jenkins.esgyn.com/job/core-regress-privs1-hdp/505/console
>                  AHW 2.3 (i-014c7dcfa0719ec26)
>                 2017-03-21 09:18:58 === Tue Mar 21 09:18:58 UTC 2017:
> /usr/local/bin/install-traf.sh
>                 2017-03-21 09:18:58 === Setting up Trafodion
>                 2017-03-21 09:18:58
> ========================================================
>                 2017-03-21 09:18:58 Source
> /home/jenkins/workspace/core-regress-privs1-hdp/trafodion/core/sqf/conf/inst
> all_features
>                 2017-03-21 09:18:58 Java for Trafodion install:
> /usr/lib/jvm/java-1.7.0-openjdk.x86_64
>                 2017-03-21 09:18:58 Saving output in Install_Start.log
>                 2017-03-21 09:18:58 + chmod o+r
> /home/jenkins/workspace/core-regress-privs1-hdp/trafodion/distribution/apach
> e-trafodion_installer-2.2.0-incubating.tar.gz
> /home/jenkins/workspace/core-regress-privs1-hdp/trafodion/distribution/apach
> e-trafodion_server-2.2.0-RH6-x86_64-incubating.tar.gz
> /home/jenkins/workspace/core-regress-privs1-hdp/trafodion/distribution/apach
> e-trafodion-regress.tgz
>                 2017-03-21 09:18:58 + echo 'Checking shared mem'
>                 2017-03-21 09:18:58 Checking shared mem
>                 2017-03-21 09:18:58 + ls -ld /dev/shm
>                 2017-03-21 09:18:58 drwxrwxrwt 2 root root 100 Mar 21 09:18
> /dev/shm
>                 2017-03-21 09:18:58 + ls -l /dev/shm
>                 2017-03-21 09:18:58 total 12
>                 2017-03-21 09:18:58 -rw-r--r-- 1 1003 509 32 Mar 21 08:03
> sem.monitor.sem.trafodion
>                 2017-03-21 09:18:58 -rw------- 1 1003 509 32 Mar 21 08:03
> sem.rms.1003.268468606
>                 2017-03-21 09:18:58 -rw------- 1 1003 509 32 Mar 21 08:03
> sem.rms.1003.268490614
>                 2017-03-21 09:18:58 + echo ============
>                 2017-03-21 09:18:58 ============
>                 Leftover from the Release 2.1 build:
> https://jenkins.esgyn.com/job/phoenix_part2_T4-hdp/580/consoleFull - 2.1
> build
>                 2017-03-21 09:16:05 *********************************
>                 2017-03-21 09:16:05   Trafodion Uninstall Completed
>                 2017-03-21 09:16:05 *********************************
>                 2017-03-21 09:16:05 + uninst_ret=0
>                 2017-03-21 09:16:05 + sudo rm -f
> /home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run
>                 2017-03-21 09:16:05 + sudo mv
> /home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run.save
> /home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run
>                 2017-03-21 09:16:05 + sudo chmod -R a+rX
> /home/jenkins/workspace/phoenix_part2_T4-hdp/traf_run
>                 2017-03-21 09:16:05 + exit 0
>                 2017-03-21 09:16:05 + rc=0
>                 2017-03-21 09:16:05 + echo 'Checking shared mem'
>                 2017-03-21 09:16:05 Checking shared mem
>                 2017-03-21 09:16:05 + ls -ld /dev/shm
>                 2017-03-21 09:16:05 drwxrwxrwt 2 root root 100 Mar 21 09:15
> /dev/shm
>                 2017-03-21 09:16:05 + ls -l /dev/shm
>                 2017-03-21 09:16:05 total 12
>                 2017-03-21 09:16:05 -rw-r--r-- 1 1003 509 32 Mar 21 08:03
> sem.monitor.sem.trafodion
>                 2017-03-21 09:16:05 -rw------- 1 1003 509 32 Mar 21 08:03
> sem.rms.1003.268468606
>                 2017-03-21 09:16:05 -rw------- 1 1003 509 32 Mar 21 08:03
> sem.rms.1003.268490614
>                 2017-03-21 09:16:05 + echo ============
>                 2017-03-21 09:16:05 ============
>                 2017-03-21 09:16:05 + exit 0
>                 2017-03-21 09:16:05 + exit 0
> Regards
> Arvind
> -----Original Message-----
> From: Narendra Goyal [mailto:narendra.goyal@esgyn.com]
> Sent: Friday, March 17, 2017 2:39 PM
> To: dev@trafodion.incubator.apache.org
> Subject: RE: Trafodion Maser daily build failures
> Checked the /dev/shm directory on the build machine and that was empty. I
> was able to create a file /dev/shm/foo (as the 'trafodion' user id) - so,
> does not look like a permissions issue (on /dev/shm at least).
> I am not sure whether any build has happened on that build machine but do
> not see any orphan semaphore in /dev/shm.
> Thanks,
> -Narendra
> -----Original Message-----
> From: Selva Govindarajan [mailto:selva.govindarajan@esgyn.com]
> Sent: Friday, March 17, 2017 11:07 AM
> To: dev@trafodion.incubator.apache.org
> <ma...@trafodion.incubator.apache.org>
> Subject: Trafodion Maser daily build failures
> First, I changed the subject line  so that this message doesn't get filtered
> out. Trafodion master daily build has been failing randomly with the
> following stack trace in monitor.
> (gdb) bt
> #0  0x00007feaee0eb625 in raise () from /lib64/libc.so.6
> #1  0x00007feaee0ece05 in abort () from /lib64/libc.so.6
> #2  0x000000000041f8b3 in CProcessContainer::CProcessContainer
> (this=0x270e340, nodeContainer=<value optimized out>) at process.cxx:3389
> #3  0x00000000004569cc in CNode::CNode (this=0x270e340, name=0x26e9548
> "slave-ahw23", pnid=0, rank=0) at pnode.cxx:152
> #4  0x0000000000458050 in CNodeContainer::AddNodes (this=<value optimized
> out>) at pnode.cxx:1572
> #5  0x0000000000419185 in CCluster::InitializeConfigCluster (this=0x2712270)
> at cluster.cxx:2818
> #6  0x0000000000419e25 in CCluster::CCluster (this=0x2712270) at
> cluster.cxx:597
> #7  0x000000000043473a in CTmSync_Container::CTmSync_Container
> (this=0x2712270) at tmsync.cxx:137
> #8  0x0000000000408f36 in CMonitor::CMonitor (this=0x2712270, procTermSig=9)
> at monitor.cxx:329
> #9  0x000000000040a5ab in main (argc=2, argv=0x7ffd157c0b48) at
> monitor.cxx:1308
> (gdb) f 2
> The monitor log shows
> 2017-03-16 09:21:48,327, INFO, MON, Node Number: 0,, PIN: 17918 , Process
> Name: $MONITOR,,, TID: 17918, Message ID: 101020103, [CMonitor::main],
> monitor Version 1.0.1 prodver Release 2.2.0 (Build release
> [2.0.1rc3-1425-g6155ff1_Bld883], branch 6155ff1_no_branch, date
> 20170316_0832), Started! CommType: Sockets
> 2017-03-16 09:21:48,327, INFO, MON, Node Number: 0,, PIN: 17918 , Process
> Name: $MONITOR,,, TID: 17918, Message ID: 101010401, [CCluster::CCluster]
> Validation of node down is disabled
> 2017-03-16 09:21:48,328, ERROR, MON, Node Number: 0,, PIN: 17918 , Process
> Name: $MONITOR,,, TID: 17918, Message ID: 101030703,
> [CProcessContainer::CProcessContainer], Can't create semaphore
> /monitor.sem.trafodion! (Permission denied)
> 2017-03-16 09:21:48,328, ERROR, MON, Node Number: 0,, PIN: 17918 , Process
> Name: $MONITOR,,, TID: 17918, Message ID: 101030704,
> [CProcessContainer::CProcessContainer], Can't unlink semaphore
> /monitor.sem.trafodion! (Permission denied)
> I came up with the following theory
> When a semaphore is created, a device file with the given semaphore name is
> created at /dev/shm by the process. The process owner needs to have write
> permission to create this file.  Initially I suspected it is permission
> issue of /dev/shm directory.
> I just looked at /dev/shm in the Jenkins VM. It did have the write
> permission.
>  If that's the case, it is possible the previous semaphore was not cleaned
> up correctly.  The monitor seems to create the semaphore with
> /dev/shm/sem.monitor.<user_name>. If trafodion gets the different uid
> between two different runs, it is possible that it is unable to clean it up.
> In case of RMS, we attach the port number to the semaphore name so that
> every run from the same user name will get a different semaphore name.
> ---------------------
> sem_open document shows
> EACCES The semaphore exists, but the caller does not have permission
>               to open it
> EACCES is 13 the errno returned in the gdb.
> Please offer your help to resolve this issue if you have any other idea.
> Selva



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)