You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Andrew Onischuk (JIRA)" <ji...@apache.org> on 2015/03/11 19:02:38 UTC

[jira] [Resolved] (AMBARI-10031) Ambari-agent died under SLES (and could not even restart automatically)

     [ https://issues.apache.org/jira/browse/AMBARI-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andrew Onischuk resolved AMBARI-10031.
--------------------------------------
    Resolution: Fixed

Committed to trunk and branch-2.0.0

> Ambari-agent died under SLES (and could not even restart automatically)
> -----------------------------------------------------------------------
>
>                 Key: AMBARI-10031
>                 URL: https://issues.apache.org/jira/browse/AMBARI-10031
>             Project: Ambari
>          Issue Type: Bug
>            Reporter: Andrew Onischuk
>            Assignee: Andrew Onischuk
>             Fix For: 2.0.0
>
>
> I was performing RU on weekend and left cluster running to finalize it later.
> So cluster was running unattended for 2 days, and ambari-agent died due to out
> of memory. Agents on other nodes are running well.  
> Node has 8gb of ram, does not look like memory exhausted (unless agent needs
> more then 1100 mb of ram)
>     
>     
>     
>     dmitriusan-sles3-ru1-6:~ # free -m
>                  total       used       free     shared    buffers     cached
>     Mem:          7872       7077        795          0        134        222
>     -/+ buffers/cache:       6720       1151
>     Swap:            0          0          0
>     
> So I suspect memory leak (probably due to status checks/jobs). Log files
> attached.
>     
>     
>     
>     WARNING 2015-03-10 06:10:30,692 scheduler.py:496 - Run time of job "c811d199-b07f-4eaf-995b-bf91e5ff848f (trigger: interval[0:01:00], next run at: 2015-03-10
>      06:11:27.480393)" was missed by 0:00:03.212293
>     WARNING 2015-03-10 06:10:38,214 scheduler.py:496 - Run time of job "5c219f4e-62e1-482c-88fc-e11b40935541 (trigger: interval[0:01:00], next run at: 2015-03-10
>      06:11:29.881993)" was missed by 0:00:08.332634
>     INFO 2015-03-10 06:10:38,995 scheduler.py:527 - Job "13163515-f895-4342-b802-12ce39c65fb9 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47368
>     5)" executed successfully
>     INFO 2015-03-10 06:10:39,088 scheduler.py:527 - Job "6186b998-9eb6-4f7b-af8b-96c27c0da962 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47213
>     9)" executed successfully
>     INFO 2015-03-10 06:10:39,089 scheduler.py:527 - Job "1531e319-25e9-4909-b461-bec0ba59c1d9 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47290
>     7)" executed successfully
>     INFO 2015-03-10 06:10:39,123 Controller.py:247 - Heartbeat response received (id = 21240)
>     INFO 2015-03-10 06:10:39,408 Controller.py:291 - No commands sent from dmitriusan-sles3-ru1-5.cs1cloud.internal
>     INFO 2015-03-10 06:10:42,672 scheduler.py:527 - Job "81137f2d-a1a8-433f-9446-4167a06b6fa3 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47332
>     0)" executed successfully
>     WARNING 2015-03-10 06:10:43,575 scheduler.py:496 - Run time of job "84ac5821-646b-41c1-8ac7-a561cd75d3ef (trigger: interval[0:01:00], next run at: 2015-03-10
>      06:10:41.837046)" was missed by 0:00:01.737801
>     ERROR 2015-03-10 06:10:45,043 CustomServiceOrchestrator.py:201 - Caught an exception while executing custom service command: <type 'exceptions.OSError'>: [Er
>     rno 12] Cannot allocate memory; [Errno 12] Cannot allocate memory
>     Traceback (most recent call last):
>       File "/usr/lib/python2.6/site-packages/ambari_agent/CustomServiceOrchestrator.py", line 176, in runCommand
>         task_id, override_output_files, handle = handle)
>       File "/usr/lib/python2.6/site-packages/ambari_agent/PythonExecutor.py", line 84, in run_file
>         process = self.launch_python_subprocess(pythonCommand, tmpout, tmperr)
>       File "/usr/lib/python2.6/site-packages/ambari_agent/PythonExecutor.py", line 151, in launch_python_subprocess
>         stderr=tmperr, close_fds=close_fds, env=command_env)
>       File "/usr/lib64/python2.6/subprocess.py", line 623, in __init__
>         errread, errwrite)
>       File "/usr/lib64/python2.6/subprocess.py", line 1051, in _execute_child
>         self.pid = os.fork()
>     OSError: [Errno 12] Cannot allocate memory
>     
> Also, agent could not restart automatically:
>     
>     
>     
>     INFO 2015-03-10 06:11:44,312 NetUtil.py:60 - Connecting to https://dmitriusan-sles3-ru1-5.cs1cloud.internal:8440/connection_info
>     INFO 2015-03-10 06:11:44,639 security.py:93 - SSL Connect being called.. connecting to the server
>     INFO 2015-03-10 06:11:44,730 security.py:55 - SSL connection established. Two-way SSL authentication is turned off on the server.
>     INFO 2015-03-10 06:11:44,733 Controller.py:247 - Heartbeat response received (id = 21240)
>     ERROR 2015-03-10 06:11:44,733 Controller.py:261 - Error in responseId sequence - restarting
>     INFO 2015-03-10 06:11:46,986 main.py:68 - loglevel=logging.INFO
>     INFO 2015-03-10 06:11:46,988 DataCleaner.py:36 - Data cleanup thread started
>     INFO 2015-03-10 06:11:46,997 DataCleaner.py:117 - Data cleanup started
>     INFO 2015-03-10 06:11:47,222 DataCleaner.py:119 - Data cleanup finished
>     ERROR 2015-03-10 06:11:47,641 main.py:243 - Failed to start ping port listener of: Could not open port 8670 because port already used by another process:
>     UID        PID  PPID  C STIME TTY          TIME CMD
>     root      1421     1  0 06:07 ?        00:00:00 /usr/bin/sudo su ambari-qa -l -s /bin/bash -c export  PATH='/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/u
>     sr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/var/lib/ambari-agent:/bin/:/usr/
>     bin/:/usr/lib/hive/bin/:/usr/sbin/' ; ! beeline -u 'jdbc:hive2://dmitriusan-sles3-ru1-6.cs1cloud.internal:10000' -e '' 2>&1| awk '{print}'|grep -i -e 'Connec
>     tion refused' -e 'Invalid URL'
>     
>     INFO 2015-03-10 06:11:47,654 PingPortListener.py:62 - Ping port listener killed
>     
> Also, manual restart failed as well
>     
>     
>     
>     ERROR: ambari-agent start failed. For more details, see /var/log/ambari-agent/ambari-agent.out:
>     ====================
>     Failed to start ping port listener of: Could not open port 8670 because port already used by another process:
>     UID        PID  PPID  C STIME TTY          TIME CMD
>     root     25597     1  0 05:59 ?        00:00:00 /usr/bin/sudo su ambari-qa -l -s /bin/bash -c export  PATH='/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/var/lib/ambari-agent:/bin/:/usr/bin/:/usr/lib/hive/bin/:/usr/sbin/' ; ! beeline -u 'jdbc:hive2://dmitriusan-sles3-ru1-6.cs1cloud.internal:10000' -e '' 2>&1| awk '{print}'|grep -i -e 'Connection refused' -e 'Invalid URL'
>     ====================
>     Agent out at: /var/log/ambari-agent/ambari-agent.out
>     Agent log at: /var/log/ambari-agent/ambari-agent.log
>     



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)