You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Jeff Sposetti (JIRA)" <ji...@apache.org> on 2015/12/01 01:29:11 UTC

[jira] [Updated] (AMBARI-13007) Stopping ambari-server may kill ambari-agent running on the same machine in some cases

     [ https://issues.apache.org/jira/browse/AMBARI-13007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Sposetti updated AMBARI-13007:
-----------------------------------
    Fix Version/s: 2.1.3

> Stopping ambari-server may kill ambari-agent running on the same machine in some cases
> --------------------------------------------------------------------------------------
>
>                 Key: AMBARI-13007
>                 URL: https://issues.apache.org/jira/browse/AMBARI-13007
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-server
>    Affects Versions: 2.1.2
>            Reporter: Nahappan Somasundaram
>            Assignee: Nahappan Somasundaram
>             Fix For: 2.2.0, 2.1.3
>
>
> Launch multinode Ambari clusters using a simple python script. It logs in to every node via ssh and runs a shell script:
> {code}
> #!/usr/bin/env bash
> while [[ $# > 0 ]]
> do
>   key="$1"
>   case ${key} in
>       --server)
>         ASERVER="$2"        # Server hostname
>         shift # past argument
>       ;;
>       --noserver)
>         NOSERVER="NOSERVER"  # Don't install/start server
>       ;;
>       *)
>         echo unknown option
>         exit 1
>       ;;
>   esac
>   shift # past argument or value
> done
> yum clean all
> curl http://s3.amazonaws.com/dev.hortonworks.com/ambari/centos6/2.x/latest/trunk/ambaribn.repo > /etc/yum.repos.d/ambari.repo
> # server
> if [ "${ASERVER}" = $(hostname -f) ] && [ -z "${NOSERVER}" ] ; then
>   yum install sudo postgresql-server wget -y
>   rpm -i /tmp/rpms/ambari-server*.rpm
>   # Disable iptables
>   iptables -F
>   ambari-server setup -s
>   # Enable remote debug
>   sed -rie 's/-server -XX:NewRatio/-server -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 -XX:NewRatio/g'  /usr/sbin/ambari_server_main.py
>   ## Sleep until debugger connects
>   # sed -rie 's/dt_socket,server=y,suspend=.,address=5005/dt_socket,server=y,suspend=y,address=5005/g' /usr/sbin/ambari-server.py
>   # Fix an issue with UI client version
>   gunzip /usr/lib/ambari-server/web/javascripts/app.js.gz
>   amb=$(ambari-server --version); sed -i "s/App\.version = '';/App.version = '$amb';/" /usr/lib/ambari-server/web/javascripts/app.js
>   gzip /usr/lib/ambari-server/web/javascripts/app.js
>   # Increase task timeout
>   sed -ri 's/agent.package.install.task.timeout=1800/agent.package.install.task.timeout=3600/g' /etc/ambari-server/conf/ambari.properties
>   find /var/lib/ambari-server/resources/ -name metainfo.xml | xargs -L 1 sed -ri 's/<timeout>[[:digit:]]+[[:digit:]]*<\//<timeout>1800<\//g'
>   # Start the server
>   ambari-server start -v || exit 1
> fi
> # Agent
> iptables -F
> yum clean all
> yum install -y wget
> rpm -i /tmp/rpms/ambari-agent*.rpm
> # Replace server hostname
> sed -rie "s/hostname=localhost/hostname=$ASERVER/g" /etc/ambari-agent/conf/ambari-agent.ini
> # Enable debug mode at agent
> # sed -rie 's/=INFO/=DEBUG/g' /etc/ambari-agent/conf/ambari-agent.ini
> ambari-agent start || exit 1
> {code}
> When I restart ambari-server, agent running on the same node is killed with 100% probability. That is because it is launched in the same process group with ambari-server, and ambari-server kills everything that belongs to it's process group. I assume that this situation is common for launching ambari-server and ambari-agent from the same shell script via ssh, or maybe also via configuration management tools like puppet/chef/etc. (did not check this assumption).
> *More info:*
> {code}
> [root@dlysnichenko-ru3-1 ~]# ps -ejH
>   PID  PGID   SID TTY          TIME CMD
>  1584  1584  1584 ?        00:00:00   sshd
>  2659  2659  2659 ?        00:00:00     sshd
>  2662  2662  2662 pts/0    00:00:00       bash
>  3268  3268  2662 pts/0    00:00:00         ps
>  2056  2041  2041 ?        00:00:00   postmaster
>  2058  2058  2058 ?        00:00:00     postmaster
>  2060  2060  2060 ?        00:00:00     postmaster
>  2061  2061  2061 ?        00:00:00     postmaster
>  2062  2062  2062 ?        00:00:00     postmaster
>  2063  2063  2063 ?        00:00:00     postmaster
>  2380  2380  2380 ?        00:00:00     postmaster
>  2397  2397  2397 ?        00:00:00     postmaster
>  2649  2649  2649 ?        00:00:01     postmaster
>  2654  2654  2654 ?        00:00:00     postmaster
>  2655  2655  2655 ?        00:00:00     postmaster
>  2656  2656  2656 ?        00:00:00     postmaster
>  2360  1644  1644 ?        00:00:59   java
>  2507  1644  1644 ?        00:00:00   python2.6
>  2515  1644  1644 ?        00:00:01     python2.6
>  3230  3230  3230 ?        00:00:00   anacron
> [root@dlysnichenko-ru3-1 ~]# ambari-agent status
> Found ambari-agent PID: 2515
> ambari-agent running.
> Agent PID at: /var/run/ambari-agent/ambari-agent.pid
> Agent out at: /var/log/ambari-agent/ambari-agent.out
> Agent log at: /var/log/ambari-agent/ambari-agent.log
> [root@dlysnichenko-ru3-1 ~]# ambari-server stop
> Using python  /usr/bin/python2.6
> Stopping ambari-server
> Ambari Server stopped
> [root@dlysnichenko-ru3-1 ~]# ambari-agent status
> Found ambari-agent PID: 2515
> ambari-agent not running. Stale PID File at: /var/run/ambari-agent/ambari-agent.pid
> [root@dlysnichenko-ru3-1 ~]# 
> {code}
> Note: both agent and server share the same process group 1644. We should not kill process group when stopping ambari-server, or we should create a dedicated process group when launching it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)