You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Andrew Onischuk (JIRA)" <ji...@apache.org> on 2015/03/11 18:56:38 UTC
[jira] [Created] (AMBARI-10031) Ambari-agent died under SLES (and
could not even restart automatically)
Andrew Onischuk created AMBARI-10031:
----------------------------------------
Summary: Ambari-agent died under SLES (and could not even restart automatically)
Key: AMBARI-10031
URL: https://issues.apache.org/jira/browse/AMBARI-10031
Project: Ambari
Issue Type: Bug
Reporter: Andrew Onischuk
Assignee: Andrew Onischuk
Fix For: 2.0.0
Cluster with reproduce
| 172.18.145.30| dmitriusan-sles3-ru1-5.cs1cloud.internal| dmitriusan-sles3-ru1-5|
| 172.18.145.45| dmitriusan-sles3-ru1-6.cs1cloud.internal| dmitriusan-sles3-ru1-6|
| 172.18.145.55| dmitriusan-sles3-ru1-7.cs1cloud.internal| dmitriusan-sles3-ru1-7|
I was performing RU on weekend and left cluster running to finalize it later.
So cluster was running unattended for 2 days, and ambari-agent died due to out
of memory. Agents on other nodes are running well.
Node has 8gb of ram, does not look like memory exhausted (unless agent needs
more then 1100 mb of ram)
dmitriusan-sles3-ru1-6:~ # free -m
total used free shared buffers cached
Mem: 7872 7077 795 0 134 222
-/+ buffers/cache: 6720 1151
Swap: 0 0 0
So I suspect memory leak (probably due to status checks/jobs). Log files
attached.
WARNING 2015-03-10 06:10:30,692 scheduler.py:496 - Run time of job "c811d199-b07f-4eaf-995b-bf91e5ff848f (trigger: interval[0:01:00], next run at: 2015-03-10
06:11:27.480393)" was missed by 0:00:03.212293
WARNING 2015-03-10 06:10:38,214 scheduler.py:496 - Run time of job "5c219f4e-62e1-482c-88fc-e11b40935541 (trigger: interval[0:01:00], next run at: 2015-03-10
06:11:29.881993)" was missed by 0:00:08.332634
INFO 2015-03-10 06:10:38,995 scheduler.py:527 - Job "13163515-f895-4342-b802-12ce39c65fb9 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47368
5)" executed successfully
INFO 2015-03-10 06:10:39,088 scheduler.py:527 - Job "6186b998-9eb6-4f7b-af8b-96c27c0da962 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47213
9)" executed successfully
INFO 2015-03-10 06:10:39,089 scheduler.py:527 - Job "1531e319-25e9-4909-b461-bec0ba59c1d9 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47290
7)" executed successfully
INFO 2015-03-10 06:10:39,123 Controller.py:247 - Heartbeat response received (id = 21240)
INFO 2015-03-10 06:10:39,408 Controller.py:291 - No commands sent from dmitriusan-sles3-ru1-5.cs1cloud.internal
INFO 2015-03-10 06:10:42,672 scheduler.py:527 - Job "81137f2d-a1a8-433f-9446-4167a06b6fa3 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47332
0)" executed successfully
WARNING 2015-03-10 06:10:43,575 scheduler.py:496 - Run time of job "84ac5821-646b-41c1-8ac7-a561cd75d3ef (trigger: interval[0:01:00], next run at: 2015-03-10
06:10:41.837046)" was missed by 0:00:01.737801
ERROR 2015-03-10 06:10:45,043 CustomServiceOrchestrator.py:201 - Caught an exception while executing custom service command: <type 'exceptions.OSError'>: [Er
rno 12] Cannot allocate memory; [Errno 12] Cannot allocate memory
Traceback (most recent call last):
File "/usr/lib/python2.6/site-packages/ambari_agent/CustomServiceOrchestrator.py", line 176, in runCommand
task_id, override_output_files, handle = handle)
File "/usr/lib/python2.6/site-packages/ambari_agent/PythonExecutor.py", line 84, in run_file
process = self.launch_python_subprocess(pythonCommand, tmpout, tmperr)
File "/usr/lib/python2.6/site-packages/ambari_agent/PythonExecutor.py", line 151, in launch_python_subprocess
stderr=tmperr, close_fds=close_fds, env=command_env)
File "/usr/lib64/python2.6/subprocess.py", line 623, in __init__
errread, errwrite)
File "/usr/lib64/python2.6/subprocess.py", line 1051, in _execute_child
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
Also, agent could not restart automatically:
INFO 2015-03-10 06:11:44,312 NetUtil.py:60 - Connecting to https://dmitriusan-sles3-ru1-5.cs1cloud.internal:8440/connection_info
INFO 2015-03-10 06:11:44,639 security.py:93 - SSL Connect being called.. connecting to the server
INFO 2015-03-10 06:11:44,730 security.py:55 - SSL connection established. Two-way SSL authentication is turned off on the server.
INFO 2015-03-10 06:11:44,733 Controller.py:247 - Heartbeat response received (id = 21240)
ERROR 2015-03-10 06:11:44,733 Controller.py:261 - Error in responseId sequence - restarting
INFO 2015-03-10 06:11:46,986 main.py:68 - loglevel=logging.INFO
INFO 2015-03-10 06:11:46,988 DataCleaner.py:36 - Data cleanup thread started
INFO 2015-03-10 06:11:46,997 DataCleaner.py:117 - Data cleanup started
INFO 2015-03-10 06:11:47,222 DataCleaner.py:119 - Data cleanup finished
ERROR 2015-03-10 06:11:47,641 main.py:243 - Failed to start ping port listener of: Could not open port 8670 because port already used by another process:
UID PID PPID C STIME TTY TIME CMD
root 1421 1 0 06:07 ? 00:00:00 /usr/bin/sudo su ambari-qa -l -s /bin/bash -c export PATH='/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/u
sr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/var/lib/ambari-agent:/bin/:/usr/
bin/:/usr/lib/hive/bin/:/usr/sbin/' ; ! beeline -u 'jdbc:hive2://dmitriusan-sles3-ru1-6.cs1cloud.internal:10000' -e '' 2>&1| awk '{print}'|grep -i -e 'Connec
tion refused' -e 'Invalid URL'
INFO 2015-03-10 06:11:47,654 PingPortListener.py:62 - Ping port listener killed
Also, manual restart failed as well
ERROR: ambari-agent start failed. For more details, see /var/log/ambari-agent/ambari-agent.out:
====================
Failed to start ping port listener of: Could not open port 8670 because port already used by another process:
UID PID PPID C STIME TTY TIME CMD
root 25597 1 0 05:59 ? 00:00:00 /usr/bin/sudo su ambari-qa -l -s /bin/bash -c export PATH='/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/var/lib/ambari-agent:/bin/:/usr/bin/:/usr/lib/hive/bin/:/usr/sbin/' ; ! beeline -u 'jdbc:hive2://dmitriusan-sles3-ru1-6.cs1cloud.internal:10000' -e '' 2>&1| awk '{print}'|grep -i -e 'Connection refused' -e 'Invalid URL'
====================
Agent out at: /var/log/ambari-agent/ambari-agent.out
Agent log at: /var/log/ambari-agent/ambari-agent.log
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)