You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ambari.apache.org by "Andrew Onischuk (JIRA)" <ji...@apache.org> on 2015/04/22 16:07:58 UTC

[jira] [Created] (AMBARI-10657) Ambari restart/stop operation loses control of Flume agents

Andrew Onischuk created AMBARI-10657:
----------------------------------------

             Summary: Ambari restart/stop operation loses control of Flume agents
                 Key: AMBARI-10657
                 URL: https://issues.apache.org/jira/browse/AMBARI-10657
             Project: Ambari
          Issue Type: Bug
            Reporter: Andrew Onischuk
            Assignee: Andrew Onischuk
             Fix For: 2.1.0


PROBLEM: Ambari seems to lose control of Flume agents - reporting them as
stopped even though the processes are still running.  
Trying to start the agents again results in:

    
    
    Please shutdown the agentor disable this component, or the agent will bein an undefined state. 
    
    Failed to bind to: /0.0.0.0:4545 Caused by: java.net.BindException: Address already in use

STEPS TO REPRODUCE:  
1\. Killed all agents using kill -9 (this step was necessary as the agents
were still running, but reported as stopped in Ambari)

2\. Start agents using Ambari

3\. Check the content of the pid file. In this case was 29873

4\. Check the pid using "ps -aux | grep flume". The output in this case was:

    
    
    Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ 
    flume 29873 0.0 0.0 106060 1308 ? Ss 13:50 0:00 bash -c export JAVA_HOME=/usr/jdk64/jdk1.7.0_45; /usr/hdp/current/flume-server/bin/flume-ng agent --name a1 --conf /etc/flume/conf/a1 --conf-file /etc/flume/conf/a1/flume.conf -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=hdp-mn00.c.onsight.nl:8655 
    flume 29874 35.7 0.5 17222116 272028 ? Sl 13:50 0:10 /usr/jdk64/jdk1.7.0_45/bin/java -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=10.26.118.10:8651 -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=hdp-mn00.c.onsight.nl:8655 
    

Everything is running fine at this point.

6\. Restart agents using flume

7\. Check the content of the pid file. In this case it was still 29873

8\. Check the pid using "ps -aux | grep flume". The output in this case was:

    
    
    Warning: bad syntax, perhaps a bogus '-'? See /usr/share/doc/procps-3.2.8/FAQ 
    flume 3097 0.0 0.0 106060 1308 ? Ss 13:54 0:00 bash -c export JAVA_HOME=/usr/jdk64/jdk1.7.0_45; /usr/hdp/current/flume-server/bin/flume-ng agent --name a1 --conf /etc/flume/conf/a1 --conf-file /etc/flume/conf/a1/flume.conf -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=hdp-mn00.c.onsight.nl:8655 
    flume 3098 7.2 0.5 17222116 271076 ? Sl 13:54 0:10 /usr/jdk64/jdk1.7.0_45/bin/java -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=10.26.118.10:8651 -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=hdp-mn00.c.onsight.nl:8655 
    

As you can see the pid file was not updated and shortly after the restart,
Ambari reports the agents as stopped.

SUPPORT ANALYSIS:

"cat /var/run/flume/a1.pid" returns 10056 last written 16 March 2015 13:04

When I check the running processes using "ps -aux | grep flume" it shows 26288
and 26289.

    
    
    flume 26288 0.0 0.0 106060 1308 ? Ss 13:04 0:00 bash -c export JAVA_HOME=/usr/jdk64/jdk1.7.0_45; /usr/hdp/current/flume-server/bin/flume-ng agent --name a1 --conf /etc/flume/conf/a1 --conf-file /etc/flume/conf/a1/flume.conf -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=hdp-mn00.c.onsight.nl:8655 
    flume 26289 13.2 0.5 18359888 294220 ? Sl 13:04 1:15 /usr/jdk64/jdk1.7.0_45/bin/java -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=10.26.118.10:8651 -Dflume.monitoring.type=ganglia -Dflume.monitoring.hosts=hdp-mn00.c.onsight.nl:8655 
    

The content of "/var/run/flume/ambari-state.txt" is RUNNING.

When I check the flume log file, nothing out of the ordinary is shown around
the time the pid was updated.  
I used "cat /var/log/flume/flume-a1.log | grep "16 Mar 2015 12:04"

    
    
    16 Mar 2015 12:04:13,166 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.EventQueueBackingStoreFile.beginCheckpoint:214) - Start checkpoint for /home/flume/.flume/file-channel/checkpoint/checkpoint_1426501435529, elements to sync = 18272 
    16 Mar 2015 12:04:13,241 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint:239) - Updating checkpoint metadata: logWriteOrderID: 1426503859575, queueSize: 576, queueHead: 475305 
    16 Mar 2015 12:04:13,341 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.Log.writeCheckpoint:1025) - Updated checkpoint for file: /home/flume/.flume/file-channel/data/log-6 position: 9108128 logWriteOrderID: 1426503859575 
    16 Mar 2015 12:04:13,342 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.LogFile$RandomReader.close:504) - Closing RandomReader /home/flume/.flume/file-channel/data/log-4 
    16 Mar 2015 12:04:43,348 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.EventQueueBackingStoreFile.beginCheckpoint:214) - Start checkpoint for /home/flume/.flume/file-channel/checkpoint/checkpoint_1426501435529, elements to sync = 20332 
    16 Mar 2015 12:04:43,519 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint:239) - Updating checkpoint metadata: logWriteOrderID: 1426503900154, queueSize: 0, queueHead: 495637 
    16 Mar 2015 12:04:43,628 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.Log.writeCheckpoint:1025) - Updated checkpoint for file: /home/flume/.flume/file-channel/data/log-6 position: 19009888 logWriteOrderID: 1426503900154 
    16 Mar 2015 12:04:43,629 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.Log.removeOldLogs:1080) - Removing old file: /home/flume/.flume/file-channel/data/log-4 
    16 Mar 2015 12:04:43,632 INFO [Log-BackgroundWorker-c1] (org.apache.flume.channel.file.Log.removeOldLogs:1080) - Removing old file: /home/flume/.flume/file-channel/data/log-4.meta 
    

Attached are flume conf, the output of the restart operation in ambari when
the agents are reported as stopped but are still running, agent log and
screenshot of ambari.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)