You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Nahappan Somasundaram (JIRA)" <ji...@apache.org> on 2015/09/04 02:37:46 UTC

[jira] [Created] (AMBARI-13007) Stopping ambari-server may kill ambari-agent running on the same machine in some cases

Nahappan Somasundaram created AMBARI-13007:
----------------------------------------------

             Summary: Stopping ambari-server may kill ambari-agent running on the same machine in some cases
                 Key: AMBARI-13007
                 URL: https://issues.apache.org/jira/browse/AMBARI-13007
             Project: Ambari
          Issue Type: Bug
          Components: ambari-server
    Affects Versions: 2.2.0
            Reporter: Nahappan Somasundaram
            Assignee: Nahappan Somasundaram
             Fix For: 2.2.0


Launch multinode Ambari clusters using a simple python script. It logs in to every node via ssh and runs a shell script:

{code}
#!/usr/bin/env bash

while [[ $# > 0 ]]
do
  key="$1"
  case ${key} in
      --server)
        ASERVER="$2"        # Server hostname
        shift # past argument
      ;;

      --noserver)
        NOSERVER="NOSERVER"  # Don't install/start server
      ;;

      *)
        echo unknown option
        exit 1
      ;;
  esac
  shift # past argument or value
done


yum clean all
curl http://s3.amazonaws.com/dev.hortonworks.com/ambari/centos6/2.x/latest/trunk/ambaribn.repo > /etc/yum.repos.d/ambari.repo


# server
if [ "${ASERVER}" = $(hostname -f) ] && [ -z "${NOSERVER}" ] ; then
  yum install sudo postgresql-server wget -y
  rpm -i /tmp/rpms/ambari-server*.rpm
  # Disable iptables
  iptables -F
  ambari-server setup -s
  # Enable remote debug
  sed -rie 's/-server -XX:NewRatio/-server -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 -XX:NewRatio/g'  /usr/sbin/ambari_server_main.py
  ## Sleep until debugger connects
  # sed -rie 's/dt_socket,server=y,suspend=.,address=5005/dt_socket,server=y,suspend=y,address=5005/g' /usr/sbin/ambari-server.py
  # Fix an issue with UI client version
  gunzip /usr/lib/ambari-server/web/javascripts/app.js.gz
  amb=$(ambari-server --version); sed -i "s/App\.version = '';/App.version = '$amb';/" /usr/lib/ambari-server/web/javascripts/app.js
  gzip /usr/lib/ambari-server/web/javascripts/app.js
  # Increase task timeout
  sed -ri 's/agent.package.install.task.timeout=1800/agent.package.install.task.timeout=3600/g' /etc/ambari-server/conf/ambari.properties
  find /var/lib/ambari-server/resources/ -name metainfo.xml | xargs -L 1 sed -ri 's/<timeout>[[:digit:]]+[[:digit:]]*<\//<timeout>1800<\//g'
  # Start the server
  ambari-server start -v || exit 1
fi


# Agent
iptables -F
yum clean all
yum install -y wget
rpm -i /tmp/rpms/ambari-agent*.rpm
# Replace server hostname
sed -rie "s/hostname=localhost/hostname=$ASERVER/g" /etc/ambari-agent/conf/ambari-agent.ini
# Enable debug mode at agent
# sed -rie 's/=INFO/=DEBUG/g' /etc/ambari-agent/conf/ambari-agent.ini
ambari-agent start || exit 1
{code}

When I restart ambari-server, agent running on the same node is killed with 100% probability. That is because it is launched in the same process group with ambari-server, and ambari-server kills everything that belongs to it's process group. I assume that this situation is common for launching ambari-server and ambari-agent from the same shell script via ssh, or maybe also via configuration management tools like puppet/chef/etc. (did not check this assumption).

*More info:*

{code}
[root@dlysnichenko-ru3-1 ~]# ps -ejH
  PID  PGID   SID TTY          TIME CMD
 1584  1584  1584 ?        00:00:00   sshd
 2659  2659  2659 ?        00:00:00     sshd
 2662  2662  2662 pts/0    00:00:00       bash
 3268  3268  2662 pts/0    00:00:00         ps
 2056  2041  2041 ?        00:00:00   postmaster
 2058  2058  2058 ?        00:00:00     postmaster
 2060  2060  2060 ?        00:00:00     postmaster
 2061  2061  2061 ?        00:00:00     postmaster
 2062  2062  2062 ?        00:00:00     postmaster
 2063  2063  2063 ?        00:00:00     postmaster
 2380  2380  2380 ?        00:00:00     postmaster
 2397  2397  2397 ?        00:00:00     postmaster
 2649  2649  2649 ?        00:00:01     postmaster
 2654  2654  2654 ?        00:00:00     postmaster
 2655  2655  2655 ?        00:00:00     postmaster
 2656  2656  2656 ?        00:00:00     postmaster
 2360  1644  1644 ?        00:00:59   java
 2507  1644  1644 ?        00:00:00   python2.6
 2515  1644  1644 ?        00:00:01     python2.6
 3230  3230  3230 ?        00:00:00   anacron
[root@dlysnichenko-ru3-1 ~]# ambari-agent status
Found ambari-agent PID: 2515
ambari-agent running.
Agent PID at: /var/run/ambari-agent/ambari-agent.pid
Agent out at: /var/log/ambari-agent/ambari-agent.out
Agent log at: /var/log/ambari-agent/ambari-agent.log
[root@dlysnichenko-ru3-1 ~]# ambari-server stop
Using python  /usr/bin/python2.6
Stopping ambari-server
Ambari Server stopped
[root@dlysnichenko-ru3-1 ~]# ambari-agent status
Found ambari-agent PID: 2515
ambari-agent not running. Stale PID File at: /var/run/ambari-agent/ambari-agent.pid
[root@dlysnichenko-ru3-1 ~]# 
{code}

Note: both agent and server share the same process group 1644. We should not kill process group when stopping ambari-server, or we should create a dedicated process group when launching it.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)