You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Matthew Rathbone (JIRA)" <ji...@apache.org> on 2011/09/09 23:14:09 UTC

[jira] [Commented] (FLUME-626) When using rpcSource flume SILENTLY fails when you update the node config via zookeeper

    [ https://issues.apache.org/jira/browse/FLUME-626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13101569#comment-13101569 ] 

Matthew Rathbone commented on FLUME-626:
----------------------------------------

This issue still exists in the trunk. I'm attempting to take a look at it.

Here is an example I'm going to try and reproduce:
(two nodes; api1 and collector1)
configs:
api1 = rpcSource(9100) => agentBESink(collector1)
collector1 = collectorSource() => collectorSink("some://hdfs/path")

Timeline:
1) collector1 goes into an error state consistant with FLUME-762
2) I execute "exec config collector1 <repeating the config above>"
3) api1 stops receiving messages from the rpcSource until collector1 is restarted
  -- here are the logs that api1 produces:

11/09/09 18:06:20 INFO thrift.ThriftEventSink: ThriftEventSink to logs-2:9109 opened
11/09/09 18:06:20 INFO debug.InsistentOpenDecorator: Opened StubbornAppend on try 4
11/09/09 20:00:10 WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 1 heartbeats
11/09/09 20:00:15 WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 2 heartbeats
11/09/09 20:00:20 WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 3 heartbeats
11/09/09 20:01:01 WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 3 heartbeats
11/09/09 20:01:06 WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 4 heartbeats
11/09/09 20:01:11 ERROR agent.LogicalNode: Forcing driver to exit uncleanly
11/09/09 20:01:11 ERROR connector.DirectDriver: Closing down due to exception during append calls
java.io.IOException: Waiting for queue element was interrupted! null
	at com.cloudera.flume.handlers.thrift.ThriftEventSource.next(ThriftEventSource.java:222)
	at com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:105)
Caused by: java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1961)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2038)
	at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424)
	at com.cloudera.flume.handlers.thrift.ThriftEventSource.next(ThriftEventSource.java:209)
	... 1 more
11/09/09 20:01:11 INFO connector.DirectDriver: Connector logicalNode api-29-27 exited with error: Waiting for queue element was interrupted! null
java.io.IOException: Waiting for queue element was interrupted! null
	at com.cloudera.flume.handlers.thrift.ThriftEventSource.next(ThriftEventSource.java:222)
	at com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:105)
Caused by: java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1961)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2038)
	at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424)
	at com.cloudera.flume.handlers.thrift.ThriftEventSource.next(ThriftEventSource.java:209)
	... 1 more
11/09/09 20:01:11 INFO thrift.ThriftEventSource: Closed server on port 9108...
11/09/09 20:01:11 INFO thrift.ThriftEventSource: Queue still has 0 elements ...
11/09/09 20:01:11 INFO thrift.ThriftEventSink: ThriftEventSink on port 9109 closed
11/09/09 20:01:11 ERROR connector.DirectDriver: Exiting driver logicalNode api-29-27 in error state ThriftEventSource | Agent because Waiting for queue element was interrupted! null
11/09/09 20:01:11 INFO agent.LogicalNode: Node config successfully set to FlumeConfigData: {srcVer:'Fri Sep 09 20:00:39 UTC 2011' snkVer:'Fri Sep 09 20:00:39 UTC 2011'  ts='Fri Sep 09 20:00:39 UTC 2011' flowId:'default-flow' source:'rpcSource( 9108 )' sink:'agentBESink( "logs-2", 9109 )' }
11/09/09 20:01:11 INFO thrift.ThriftEventSource: Starting blocking thread pool server on port 9108...
11/09/09 20:01:11 ERROR server.TSaneThreadPoolServer: Error occurred during listening.
org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address 0.0.0.0/0.0.0.0:9108.
	at org.apache.thrift.transport.TSaneServerSocket.bind(TSaneServerSocket.java:110)
	at org.apache.thrift.transport.TSaneServerSocket.listen(TSaneServerSocket.java:116)
	at org.apache.thrift.server.TSaneThreadPoolServer.start(TSaneThreadPoolServer.java:162)
	at com.cloudera.flume.handlers.thrift.ThriftEventSource.open(ThriftEventSource.java:151)
	at com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:87)
11/09/09 20:01:11 INFO thrift.ThriftEventSink: ThriftEventSink to logs-2:9109 opened
11/09/09 20:01:11 INFO debug.InsistentOpenDecorator: Opened StubbornAppend on try 0

So I tried re-executing the config for this node, and it basically does the same thing, it throws a bunch of exceptions, then says it connects, but doesn't receive or send any events.

11/09/09 20:02:11 WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 1 heartbeats
11/09/09 20:02:16 WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 2 heartbeats
11/09/09 20:02:21 WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 3 heartbeats
11/09/09 20:02:26 WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 4 heartbeats
11/09/09 20:02:31 ERROR agent.LogicalNode: Forcing driver to exit uncleanly
11/09/09 20:02:31 ERROR connector.DirectDriver: Closing down due to exception during append calls
java.io.IOException: Waiting for queue element was interrupted! null
	at com.cloudera.flume.handlers.thrift.ThriftEventSource.next(ThriftEventSource.java:222)
	at com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:105)
Caused by: java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1961)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2038)
	at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424)
	at com.cloudera.flume.handlers.thrift.ThriftEventSource.next(ThriftEventSource.java:209)
	... 1 more
11/09/09 20:02:31 INFO agent.LogicalNode: Node config successfully set to FlumeConfigData: {srcVer:'Fri Sep 09 20:01:57 UTC 2011' snkVer:'Fri Sep 09 20:01:57 UTC 2011'  ts='Fri Sep 09 20:01:57 UTC 2011' flowId:'default-flow' source:'rpcSource( 9108 )' sink:'agentBESink( "logs-2", 9109 )' }
11/09/09 20:02:31 INFO thrift.ThriftEventSource: Starting blocking thread pool server on port 9108...
11/09/09 20:02:31 INFO connector.DirectDriver: Connector logicalNode api-29-28 exited with error: Waiting for queue element was interrupted! null
java.io.IOException: Waiting for queue element was interrupted! null
	at com.cloudera.flume.handlers.thrift.ThriftEventSource.next(ThriftEventSource.java:222)
	at com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:105)
Caused by: java.lang.InterruptedException
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1961)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2038)
	at java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:424)
	at com.cloudera.flume.handlers.thrift.ThriftEventSource.next(ThriftEventSource.java:209)
	... 1 more
11/09/09 20:02:31 INFO thrift.ThriftEventSource: Closed server on port 9108...
11/09/09 20:02:31 INFO thrift.ThriftEventSource: Queue still has 0 elements ...
11/09/09 20:02:31 INFO thrift.ThriftEventSink: ThriftEventSink on port 9109 closed
11/09/09 20:02:31 ERROR connector.DirectDriver: Exiting driver logicalNode api-29-28 in error state ThriftEventSource | Agent because Waiting for queue element was interrupted! null
11/09/09 20:02:31 ERROR server.TSaneThreadPoolServer: Error occurred during listening.
org.apache.thrift.transport.TTransportException: Could not create ServerSocket on address 0.0.0.0/0.0.0.0:9108.
	at org.apache.thrift.transport.TSaneServerSocket.bind(TSaneServerSocket.java:110)
	at org.apache.thrift.transport.TSaneServerSocket.listen(TSaneServerSocket.java:116)
	at org.apache.thrift.server.TSaneThreadPoolServer.start(TSaneThreadPoolServer.java:162)
	at com.cloudera.flume.handlers.thrift.ThriftEventSource.open(ThriftEventSource.java:151)
	at com.cloudera.flume.core.connector.DirectDriver$PumperThread.run(DirectDriver.java:87)
11/09/09 20:02:31 INFO thrift.ThriftEventSink: ThriftEventSink to logs-2:9109 opened







> When using rpcSource flume SILENTLY fails when you update the node config via zookeeper
> ---------------------------------------------------------------------------------------
>
>                 Key: FLUME-626
>                 URL: https://issues.apache.org/jira/browse/FLUME-626
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v0.9.3
>         Environment: yum install
>            Reporter: Disabled imported user
>             Fix For: v0.9.5
>
>         Attachments: error_loop.txt
>
>
> Sometimes, when you update a node's config via zookeeper, that node will going into an infinite error loop (but it doesn't ERROR, or print any error messages).
> This seems to happen when that node is running an rpcSource(). It loops: open port, close port
> See attached file as a sample of debugging output.
> This is another MASSIVE reliability issue when using the rpc source. There is no way to really monitor this.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira