You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Andrew Onischuk (JIRA)" <ji...@apache.org> on 2015/03/11 19:02:38 UTC
[jira] [Resolved] (AMBARI-10031) Ambari-agent died under SLES (and
could not even restart automatically)
[ https://issues.apache.org/jira/browse/AMBARI-10031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Andrew Onischuk resolved AMBARI-10031.
--------------------------------------
Resolution: Fixed
Committed to trunk and branch-2.0.0
> Ambari-agent died under SLES (and could not even restart automatically)
> -----------------------------------------------------------------------
>
> Key: AMBARI-10031
> URL: https://issues.apache.org/jira/browse/AMBARI-10031
> Project: Ambari
> Issue Type: Bug
> Reporter: Andrew Onischuk
> Assignee: Andrew Onischuk
> Fix For: 2.0.0
>
>
> I was performing RU on weekend and left cluster running to finalize it later.
> So cluster was running unattended for 2 days, and ambari-agent died due to out
> of memory. Agents on other nodes are running well.
> Node has 8gb of ram, does not look like memory exhausted (unless agent needs
> more then 1100 mb of ram)
>
>
>
> dmitriusan-sles3-ru1-6:~ # free -m
> total used free shared buffers cached
> Mem: 7872 7077 795 0 134 222
> -/+ buffers/cache: 6720 1151
> Swap: 0 0 0
>
> So I suspect memory leak (probably due to status checks/jobs). Log files
> attached.
>
>
>
> WARNING 2015-03-10 06:10:30,692 scheduler.py:496 - Run time of job "c811d199-b07f-4eaf-995b-bf91e5ff848f (trigger: interval[0:01:00], next run at: 2015-03-10
> 06:11:27.480393)" was missed by 0:00:03.212293
> WARNING 2015-03-10 06:10:38,214 scheduler.py:496 - Run time of job "5c219f4e-62e1-482c-88fc-e11b40935541 (trigger: interval[0:01:00], next run at: 2015-03-10
> 06:11:29.881993)" was missed by 0:00:08.332634
> INFO 2015-03-10 06:10:38,995 scheduler.py:527 - Job "13163515-f895-4342-b802-12ce39c65fb9 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47368
> 5)" executed successfully
> INFO 2015-03-10 06:10:39,088 scheduler.py:527 - Job "6186b998-9eb6-4f7b-af8b-96c27c0da962 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47213
> 9)" executed successfully
> INFO 2015-03-10 06:10:39,089 scheduler.py:527 - Job "1531e319-25e9-4909-b461-bec0ba59c1d9 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47290
> 7)" executed successfully
> INFO 2015-03-10 06:10:39,123 Controller.py:247 - Heartbeat response received (id = 21240)
> INFO 2015-03-10 06:10:39,408 Controller.py:291 - No commands sent from dmitriusan-sles3-ru1-5.cs1cloud.internal
> INFO 2015-03-10 06:10:42,672 scheduler.py:527 - Job "81137f2d-a1a8-433f-9446-4167a06b6fa3 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47332
> 0)" executed successfully
> WARNING 2015-03-10 06:10:43,575 scheduler.py:496 - Run time of job "84ac5821-646b-41c1-8ac7-a561cd75d3ef (trigger: interval[0:01:00], next run at: 2015-03-10
> 06:10:41.837046)" was missed by 0:00:01.737801
> ERROR 2015-03-10 06:10:45,043 CustomServiceOrchestrator.py:201 - Caught an exception while executing custom service command: <type 'exceptions.OSError'>: [Er
> rno 12] Cannot allocate memory; [Errno 12] Cannot allocate memory
> Traceback (most recent call last):
> File "/usr/lib/python2.6/site-packages/ambari_agent/CustomServiceOrchestrator.py", line 176, in runCommand
> task_id, override_output_files, handle = handle)
> File "/usr/lib/python2.6/site-packages/ambari_agent/PythonExecutor.py", line 84, in run_file
> process = self.launch_python_subprocess(pythonCommand, tmpout, tmperr)
> File "/usr/lib/python2.6/site-packages/ambari_agent/PythonExecutor.py", line 151, in launch_python_subprocess
> stderr=tmperr, close_fds=close_fds, env=command_env)
> File "/usr/lib64/python2.6/subprocess.py", line 623, in __init__
> errread, errwrite)
> File "/usr/lib64/python2.6/subprocess.py", line 1051, in _execute_child
> self.pid = os.fork()
> OSError: [Errno 12] Cannot allocate memory
>
> Also, agent could not restart automatically:
>
>
>
> INFO 2015-03-10 06:11:44,312 NetUtil.py:60 - Connecting to https://dmitriusan-sles3-ru1-5.cs1cloud.internal:8440/connection_info
> INFO 2015-03-10 06:11:44,639 security.py:93 - SSL Connect being called.. connecting to the server
> INFO 2015-03-10 06:11:44,730 security.py:55 - SSL connection established. Two-way SSL authentication is turned off on the server.
> INFO 2015-03-10 06:11:44,733 Controller.py:247 - Heartbeat response received (id = 21240)
> ERROR 2015-03-10 06:11:44,733 Controller.py:261 - Error in responseId sequence - restarting
> INFO 2015-03-10 06:11:46,986 main.py:68 - loglevel=logging.INFO
> INFO 2015-03-10 06:11:46,988 DataCleaner.py:36 - Data cleanup thread started
> INFO 2015-03-10 06:11:46,997 DataCleaner.py:117 - Data cleanup started
> INFO 2015-03-10 06:11:47,222 DataCleaner.py:119 - Data cleanup finished
> ERROR 2015-03-10 06:11:47,641 main.py:243 - Failed to start ping port listener of: Could not open port 8670 because port already used by another process:
> UID PID PPID C STIME TTY TIME CMD
> root 1421 1 0 06:07 ? 00:00:00 /usr/bin/sudo su ambari-qa -l -s /bin/bash -c export PATH='/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/u
> sr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/var/lib/ambari-agent:/bin/:/usr/
> bin/:/usr/lib/hive/bin/:/usr/sbin/' ; ! beeline -u 'jdbc:hive2://dmitriusan-sles3-ru1-6.cs1cloud.internal:10000' -e '' 2>&1| awk '{print}'|grep -i -e 'Connec
> tion refused' -e 'Invalid URL'
>
> INFO 2015-03-10 06:11:47,654 PingPortListener.py:62 - Ping port listener killed
>
> Also, manual restart failed as well
>
>
>
> ERROR: ambari-agent start failed. For more details, see /var/log/ambari-agent/ambari-agent.out:
> ====================
> Failed to start ping port listener of: Could not open port 8670 because port already used by another process:
> UID PID PPID C STIME TTY TIME CMD
> root 25597 1 0 05:59 ? 00:00:00 /usr/bin/sudo su ambari-qa -l -s /bin/bash -c export PATH='/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/var/lib/ambari-agent:/bin/:/usr/bin/:/usr/lib/hive/bin/:/usr/sbin/' ; ! beeline -u 'jdbc:hive2://dmitriusan-sles3-ru1-6.cs1cloud.internal:10000' -e '' 2>&1| awk '{print}'|grep -i -e 'Connection refused' -e 'Invalid URL'
> ====================
> Agent out at: /var/log/ambari-agent/ambari-agent.out
> Agent log at: /var/log/ambari-agent/ambari-agent.log
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)