You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Andrew Onischuk (JIRA)" <ji...@apache.org> on 2015/03/11 18:56:38 UTC

[jira] [Created] (AMBARI-10031) Ambari-agent died under SLES (and could not even restart automatically)

Andrew Onischuk created AMBARI-10031:
----------------------------------------

             Summary: Ambari-agent died under SLES (and could not even restart automatically)
                 Key: AMBARI-10031
                 URL: https://issues.apache.org/jira/browse/AMBARI-10031
             Project: Ambari
          Issue Type: Bug
            Reporter: Andrew Onischuk
            Assignee: Andrew Onischuk
             Fix For: 2.0.0


Cluster with reproduce

    
    
    
    |  172.18.145.30|          dmitriusan-sles3-ru1-5.cs1cloud.internal|        dmitriusan-sles3-ru1-5|                                                                                                                                           
    |  172.18.145.45|          dmitriusan-sles3-ru1-6.cs1cloud.internal|        dmitriusan-sles3-ru1-6|                                                                                                                                           
    |  172.18.145.55|          dmitriusan-sles3-ru1-7.cs1cloud.internal|        dmitriusan-sles3-ru1-7|  
    

I was performing RU on weekend and left cluster running to finalize it later.
So cluster was running unattended for 2 days, and ambari-agent died due to out
of memory. Agents on other nodes are running well.  
Node has 8gb of ram, does not look like memory exhausted (unless agent needs
more then 1100 mb of ram)

    
    
    
    dmitriusan-sles3-ru1-6:~ # free -m
                 total       used       free     shared    buffers     cached
    Mem:          7872       7077        795          0        134        222
    -/+ buffers/cache:       6720       1151
    Swap:            0          0          0
    

So I suspect memory leak (probably due to status checks/jobs). Log files
attached.

    
    
    
    WARNING 2015-03-10 06:10:30,692 scheduler.py:496 - Run time of job "c811d199-b07f-4eaf-995b-bf91e5ff848f (trigger: interval[0:01:00], next run at: 2015-03-10
     06:11:27.480393)" was missed by 0:00:03.212293
    WARNING 2015-03-10 06:10:38,214 scheduler.py:496 - Run time of job "5c219f4e-62e1-482c-88fc-e11b40935541 (trigger: interval[0:01:00], next run at: 2015-03-10
     06:11:29.881993)" was missed by 0:00:08.332634
    INFO 2015-03-10 06:10:38,995 scheduler.py:527 - Job "13163515-f895-4342-b802-12ce39c65fb9 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47368
    5)" executed successfully
    INFO 2015-03-10 06:10:39,088 scheduler.py:527 - Job "6186b998-9eb6-4f7b-af8b-96c27c0da962 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47213
    9)" executed successfully
    INFO 2015-03-10 06:10:39,089 scheduler.py:527 - Job "1531e319-25e9-4909-b461-bec0ba59c1d9 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47290
    7)" executed successfully
    INFO 2015-03-10 06:10:39,123 Controller.py:247 - Heartbeat response received (id = 21240)
    INFO 2015-03-10 06:10:39,408 Controller.py:291 - No commands sent from dmitriusan-sles3-ru1-5.cs1cloud.internal
    INFO 2015-03-10 06:10:42,672 scheduler.py:527 - Job "81137f2d-a1a8-433f-9446-4167a06b6fa3 (trigger: interval[0:01:00], next run at: 2015-03-10 06:11:27.47332
    0)" executed successfully
    WARNING 2015-03-10 06:10:43,575 scheduler.py:496 - Run time of job "84ac5821-646b-41c1-8ac7-a561cd75d3ef (trigger: interval[0:01:00], next run at: 2015-03-10
     06:10:41.837046)" was missed by 0:00:01.737801
    ERROR 2015-03-10 06:10:45,043 CustomServiceOrchestrator.py:201 - Caught an exception while executing custom service command: <type 'exceptions.OSError'>: [Er
    rno 12] Cannot allocate memory; [Errno 12] Cannot allocate memory
    Traceback (most recent call last):
      File "/usr/lib/python2.6/site-packages/ambari_agent/CustomServiceOrchestrator.py", line 176, in runCommand
        task_id, override_output_files, handle = handle)
      File "/usr/lib/python2.6/site-packages/ambari_agent/PythonExecutor.py", line 84, in run_file
        process = self.launch_python_subprocess(pythonCommand, tmpout, tmperr)
      File "/usr/lib/python2.6/site-packages/ambari_agent/PythonExecutor.py", line 151, in launch_python_subprocess
        stderr=tmperr, close_fds=close_fds, env=command_env)
      File "/usr/lib64/python2.6/subprocess.py", line 623, in __init__
        errread, errwrite)
      File "/usr/lib64/python2.6/subprocess.py", line 1051, in _execute_child
        self.pid = os.fork()
    OSError: [Errno 12] Cannot allocate memory
    

Also, agent could not restart automatically:

    
    
    
    INFO 2015-03-10 06:11:44,312 NetUtil.py:60 - Connecting to https://dmitriusan-sles3-ru1-5.cs1cloud.internal:8440/connection_info
    INFO 2015-03-10 06:11:44,639 security.py:93 - SSL Connect being called.. connecting to the server
    INFO 2015-03-10 06:11:44,730 security.py:55 - SSL connection established. Two-way SSL authentication is turned off on the server.
    INFO 2015-03-10 06:11:44,733 Controller.py:247 - Heartbeat response received (id = 21240)
    ERROR 2015-03-10 06:11:44,733 Controller.py:261 - Error in responseId sequence - restarting
    INFO 2015-03-10 06:11:46,986 main.py:68 - loglevel=logging.INFO
    INFO 2015-03-10 06:11:46,988 DataCleaner.py:36 - Data cleanup thread started
    INFO 2015-03-10 06:11:46,997 DataCleaner.py:117 - Data cleanup started
    INFO 2015-03-10 06:11:47,222 DataCleaner.py:119 - Data cleanup finished
    ERROR 2015-03-10 06:11:47,641 main.py:243 - Failed to start ping port listener of: Could not open port 8670 because port already used by another process:
    UID        PID  PPID  C STIME TTY          TIME CMD
    root      1421     1  0 06:07 ?        00:00:00 /usr/bin/sudo su ambari-qa -l -s /bin/bash -c export  PATH='/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/u
    sr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/var/lib/ambari-agent:/bin/:/usr/
    bin/:/usr/lib/hive/bin/:/usr/sbin/' ; ! beeline -u 'jdbc:hive2://dmitriusan-sles3-ru1-6.cs1cloud.internal:10000' -e '' 2>&1| awk '{print}'|grep -i -e 'Connec
    tion refused' -e 'Invalid URL'
    
    INFO 2015-03-10 06:11:47,654 PingPortListener.py:62 - Ping port listener killed
    

Also, manual restart failed as well

    
    
    
    ERROR: ambari-agent start failed. For more details, see /var/log/ambari-agent/ambari-agent.out:
    ====================
    Failed to start ping port listener of: Could not open port 8670 because port already used by another process:
    UID        PID  PPID  C STIME TTY          TIME CMD
    root     25597     1  0 05:59 ?        00:00:00 /usr/bin/sudo su ambari-qa -l -s /bin/bash -c export  PATH='/usr/sbin:/sbin:/usr/lib/ambari-server/*:/sbin:/usr/sbin:/usr/local/sbin:/root/bin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/var/lib/ambari-agent:/bin/:/usr/bin/:/usr/lib/hive/bin/:/usr/sbin/' ; ! beeline -u 'jdbc:hive2://dmitriusan-sles3-ru1-6.cs1cloud.internal:10000' -e '' 2>&1| awk '{print}'|grep -i -e 'Connection refused' -e 'Invalid URL'
    ====================
    Agent out at: /var/log/ambari-agent/ambari-agent.out
    Agent log at: /var/log/ambari-agent/ambari-agent.log
    





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)