You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flume.apache.org by "Mingjie Lai (Created) (JIRA)" <ji...@apache.org> on 2011/10/20 08:29:10 UTC

[jira] [Created] (FLUME-808) Source/sink.close() is not invoked by DirectDriver.

Source/sink.close() is not invoked by DirectDriver.
---------------------------------------------------

                 Key: FLUME-808
                 URL: https://issues.apache.org/jira/browse/FLUME-808
             Project: Flume
          Issue Type: Bug
          Components: Node, Sinks+Sources
    Affects Versions: v0.9.4
            Reporter: Mingjie Lai


I just noticed an issue that DirectDriver doesn't really call the sink/source.close() at all, if flume node gets reconfigured. 

Very easy to reproduce: 
1. start a master, and node at the same machine
2. configure the node thru master, with any configure, e.g.: tail("/tmp/aaa"), null
3. reconfigure the node thru master: collectorSource(12345), null

The source.close() didn't get called at all for reconfigure, or rolled. And later on, Driver will decide to interrupt the thread since it seems not be able to be closed gracefully. 

{code}
2011-10-19 22:46:03,646 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 1 heartbeats
2011-10-19 22:46:08,648 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 2 heartbeats
2011-10-19 22:46:13,650 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 3 heartbeats
2011-10-19 22:46:18,652 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 4 heartbeats
2011-10-19 22:46:23,648 [Check config] ERROR agent.LogicalNode: Forcing driver to exit uncleanly
2011-10-19 22:46:23,648 [logicalNode c1-19] ERROR connector.DirectDriver: Closing down due to exception during append calls
2011-10-19 22:46:23,648 [Check config] INFO agent.LogicalNode: Node config successfully set to com.cloudera.flume.conf.FlumeConfigData@39a2f02e
2011-10-19 22:46:23,648 [logicalNode c1-19] INFO connector.DirectDriver: Connector logicalNode c1-19 exited with error: Waiting for queue element was interrupted! null
{code}

It may be brought by Flume-596 which is a big refactor of Driver. I tried to git-reset to 21b74010c34cef9a977c75ab5dec4dc747d8f5aa, and cannot reproduce the problem. 

Expected result -- source should be close()'ed
{code}
2011-10-19 23:20:45,811 [SpawningLogicalNode c1] INFO collector.CollectorSource: closed
2011-10-19 23:20:45,812 [SpawningLogicalNode c1] INFO thrift.ThriftEventSource: Closed server on port 35853...
2011-10-19 23:20:45,817 [SpawningLogicalNode c1] INFO thrift.ThriftEventSource: Queue still has 0 elements ...
2011-10-19 23:20:45,852 [logicalNode c1-20] INFO collector.CollectorSource: closed
2011-10-19 23:20:45,852 [logicalNode c1-20] INFO thrift.ThriftEventSource: Closed server on port 35853...
2011-10-19 23:20:45,853 [logicalNode c1-20] INFO thrift.ThriftEventSource: Queue still has 0 elements ...
2011-10-19 23:20:45,853 [logicalNode c1-20] INFO collector.CollectorSource: closed
2011-10-19 23:20:45,853 [logicalNode c1-20] INFO thrift.ThriftEventSource: Closed server on port 35853...
2011-10-19 23:20:45,853 [logicalNode c1-20] INFO thrift.ThriftEventSource: Queue still has 0 elements ...
2011-10-19 23:20:45,853 [logicalNode c1-20] INFO agent.LogicalNode: c1: Connector stopped: CollectorSource | NullSink
2011-10-19 23:20:45,853 [SpawningLogicalNode c1] INFO agent.LogicalNode: Node config successfully set to com.cloudera.flume.conf.FlumeConfigData@2682d210
2011-10-19 23:20:45,864 [logicalNode c1-23] INFO agent.LogicalNode: Connector started: TailSource | NullSink

{code}

It might be the root cause of FLUME-798, and related to all the recent Interrupted exception discussion on user@.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (FLUME-808) Source/sink.close() is not invoked by DirectDriver if source.next() is blocking.

Posted by "Mingjie Lai (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/FLUME-808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mingjie Lai updated FLUME-808:
------------------------------

    Summary: Source/sink.close() is not invoked by DirectDriver if source.next() is blocking.  (was: Source/sink.close() is not invoked by DirectDriver.)
    
> Source/sink.close() is not invoked by DirectDriver if source.next() is blocking.
> --------------------------------------------------------------------------------
>
>                 Key: FLUME-808
>                 URL: https://issues.apache.org/jira/browse/FLUME-808
>             Project: Flume
>          Issue Type: Bug
>          Components: Node, Sinks+Sources
>    Affects Versions: v0.9.4
>            Reporter: Mingjie Lai
>
> I just noticed an issue that DirectDriver doesn't really call the sink/source.close() at all, if flume node gets reconfigured. 
> Very easy to reproduce: 
> 1. start a master, and node at the same machine
> 2. configure the node thru master, with any configure, e.g.: tail("/tmp/aaa"), null
> 3. reconfigure the node thru master: collectorSource(12345), null
> The source.close() didn't get called at all for reconfigure, or rolled. And later on, Driver will decide to interrupt the thread since it seems not be able to be closed gracefully. 
> {code}
> 2011-10-19 22:46:03,646 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 1 heartbeats
> 2011-10-19 22:46:08,648 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 2 heartbeats
> 2011-10-19 22:46:13,650 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 3 heartbeats
> 2011-10-19 22:46:18,652 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 4 heartbeats
> 2011-10-19 22:46:23,648 [Check config] ERROR agent.LogicalNode: Forcing driver to exit uncleanly
> 2011-10-19 22:46:23,648 [logicalNode c1-19] ERROR connector.DirectDriver: Closing down due to exception during append calls
> 2011-10-19 22:46:23,648 [Check config] INFO agent.LogicalNode: Node config successfully set to com.cloudera.flume.conf.FlumeConfigData@39a2f02e
> 2011-10-19 22:46:23,648 [logicalNode c1-19] INFO connector.DirectDriver: Connector logicalNode c1-19 exited with error: Waiting for queue element was interrupted! null
> {code}
> It may be brought by Flume-596 which is a big refactor of Driver. I tried to git-reset to 21b74010c34cef9a977c75ab5dec4dc747d8f5aa, and cannot reproduce the problem. 
> Expected result -- source should be close()'ed
> {code}
> 2011-10-19 23:20:45,811 [SpawningLogicalNode c1] INFO collector.CollectorSource: closed
> 2011-10-19 23:20:45,812 [SpawningLogicalNode c1] INFO thrift.ThriftEventSource: Closed server on port 35853...
> 2011-10-19 23:20:45,817 [SpawningLogicalNode c1] INFO thrift.ThriftEventSource: Queue still has 0 elements ...
> 2011-10-19 23:20:45,852 [logicalNode c1-20] INFO collector.CollectorSource: closed
> 2011-10-19 23:20:45,852 [logicalNode c1-20] INFO thrift.ThriftEventSource: Closed server on port 35853...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO thrift.ThriftEventSource: Queue still has 0 elements ...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO collector.CollectorSource: closed
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO thrift.ThriftEventSource: Closed server on port 35853...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO thrift.ThriftEventSource: Queue still has 0 elements ...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO agent.LogicalNode: c1: Connector stopped: CollectorSource | NullSink
> 2011-10-19 23:20:45,853 [SpawningLogicalNode c1] INFO agent.LogicalNode: Node config successfully set to com.cloudera.flume.conf.FlumeConfigData@2682d210
> 2011-10-19 23:20:45,864 [logicalNode c1-23] INFO agent.LogicalNode: Connector started: TailSource | NullSink
> {code}
> It might be the root cause of FLUME-798, and related to all the recent Interrupted exception discussion on user@.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-808) Source/sink.close() is not invoked by DirectDriver if source.next() is blocking.

Posted by "jiraposter@reviews.apache.org (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13132362#comment-13132362 ] 

jiraposter@reviews.apache.org commented on FLUME-808:
-----------------------------------------------------


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/2523/
-----------------------------------------------------------

Review request for Flume.


Summary
-------

DirectDriver's lifecycle management is in one thread. It indicate that DirectDriver cannot close source/sink gracefully if the source.next() or sink.append() is blocking. Before the patch, in most of the cases, source and sink's close() relies on an interruption exception by DirectDriver.cancel(). 

The patch moved source.close() and sink.close() from the pumper thread to the main thread of DirectDriver. 

Put the patch for review. Expect to see more comments. 


This addresses bug Flume-808.
    https://issues.apache.org/jira/browse/Flume-808


Diffs
-----

  flume-core/src/main/java/com/cloudera/flume/core/connector/DirectDriver.java 79eb67a 
  flume-core/src/test/java/com/cloudera/flume/core/connector/TestDirectDriver.java 63a0e38 

Diff: https://reviews.apache.org/r/2523/diff


Testing
-------

TestDirectDriver passed locally. But haven't run a full tests against the trunk. I hope we can fix jenkins and make sure trunk can pass tests firstly. 


Thanks,

Mingjie


                
> Source/sink.close() is not invoked by DirectDriver if source.next() is blocking.
> --------------------------------------------------------------------------------
>
>                 Key: FLUME-808
>                 URL: https://issues.apache.org/jira/browse/FLUME-808
>             Project: Flume
>          Issue Type: Bug
>          Components: Node, Sinks+Sources
>    Affects Versions: v0.9.4
>            Reporter: Mingjie Lai
>
> I just noticed an issue that DirectDriver doesn't really call the sink/source.close() at all, if flume node gets reconfigured. 
> Very easy to reproduce: 
> 1. start a master, and node at the same machine
> 2. configure the node thru master, with any configure, e.g.: tail("/tmp/aaa"), null
> 3. reconfigure the node thru master: collectorSource(12345), null
> The source.close() didn't get called at all for reconfigure, or rolled. And later on, Driver will decide to interrupt the thread since it seems not be able to be closed gracefully. 
> {code}
> 2011-10-19 22:46:03,646 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 1 heartbeats
> 2011-10-19 22:46:08,648 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 2 heartbeats
> 2011-10-19 22:46:13,650 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 3 heartbeats
> 2011-10-19 22:46:18,652 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 4 heartbeats
> 2011-10-19 22:46:23,648 [Check config] ERROR agent.LogicalNode: Forcing driver to exit uncleanly
> 2011-10-19 22:46:23,648 [logicalNode c1-19] ERROR connector.DirectDriver: Closing down due to exception during append calls
> 2011-10-19 22:46:23,648 [Check config] INFO agent.LogicalNode: Node config successfully set to com.cloudera.flume.conf.FlumeConfigData@39a2f02e
> 2011-10-19 22:46:23,648 [logicalNode c1-19] INFO connector.DirectDriver: Connector logicalNode c1-19 exited with error: Waiting for queue element was interrupted! null
> {code}
> It may be brought by Flume-596 which is a big refactor of Driver. I tried to git-reset to 21b74010c34cef9a977c75ab5dec4dc747d8f5aa, and cannot reproduce the problem. 
> Expected result -- source should be close()'ed
> {code}
> 2011-10-19 23:20:45,811 [SpawningLogicalNode c1] INFO collector.CollectorSource: closed
> 2011-10-19 23:20:45,812 [SpawningLogicalNode c1] INFO thrift.ThriftEventSource: Closed server on port 35853...
> 2011-10-19 23:20:45,817 [SpawningLogicalNode c1] INFO thrift.ThriftEventSource: Queue still has 0 elements ...
> 2011-10-19 23:20:45,852 [logicalNode c1-20] INFO collector.CollectorSource: closed
> 2011-10-19 23:20:45,852 [logicalNode c1-20] INFO thrift.ThriftEventSource: Closed server on port 35853...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO thrift.ThriftEventSource: Queue still has 0 elements ...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO collector.CollectorSource: closed
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO thrift.ThriftEventSource: Closed server on port 35853...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO thrift.ThriftEventSource: Queue still has 0 elements ...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO agent.LogicalNode: c1: Connector stopped: CollectorSource | NullSink
> 2011-10-19 23:20:45,853 [SpawningLogicalNode c1] INFO agent.LogicalNode: Node config successfully set to com.cloudera.flume.conf.FlumeConfigData@2682d210
> 2011-10-19 23:20:45,864 [logicalNode c1-23] INFO agent.LogicalNode: Connector started: TailSource | NullSink
> {code}
> It might be the root cause of FLUME-798, and related to all the recent Interrupted exception discussion on user@.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (FLUME-808) Source/sink.close() is not invoked by DirectDriver if source.next() is blocking.

Posted by "Mingjie Lai (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/FLUME-808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13132149#comment-13132149 ] 

Mingjie Lai commented on FLUME-808:
-----------------------------------

After looked at the source code, I found that the DirectDriver's lifecycle management is within one thread. It indicate that DirectDriver cannot close source/sink gracefully if the next() or append() is blocking. And most of source/sinks are blocking. 

The user guide's explanation makes more sense:

...close() This call is usually made from a separate thread than the open() or next() calls...

An easy fix is to move source/sink.append() to main thread, to give source/sink.close() a chance to exit next() gracefully. However it seems brings impact to some existing source/sink which rely on interruption exception to close it. 
                
> Source/sink.close() is not invoked by DirectDriver if source.next() is blocking.
> --------------------------------------------------------------------------------
>
>                 Key: FLUME-808
>                 URL: https://issues.apache.org/jira/browse/FLUME-808
>             Project: Flume
>          Issue Type: Bug
>          Components: Node, Sinks+Sources
>    Affects Versions: v0.9.4
>            Reporter: Mingjie Lai
>
> I just noticed an issue that DirectDriver doesn't really call the sink/source.close() at all, if flume node gets reconfigured. 
> Very easy to reproduce: 
> 1. start a master, and node at the same machine
> 2. configure the node thru master, with any configure, e.g.: tail("/tmp/aaa"), null
> 3. reconfigure the node thru master: collectorSource(12345), null
> The source.close() didn't get called at all for reconfigure, or rolled. And later on, Driver will decide to interrupt the thread since it seems not be able to be closed gracefully. 
> {code}
> 2011-10-19 22:46:03,646 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 1 heartbeats
> 2011-10-19 22:46:08,648 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 2 heartbeats
> 2011-10-19 22:46:13,650 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 3 heartbeats
> 2011-10-19 22:46:18,652 [Heartbeat] WARN agent.LivenessManager: Heartbeats are backing up, currently behind by 4 heartbeats
> 2011-10-19 22:46:23,648 [Check config] ERROR agent.LogicalNode: Forcing driver to exit uncleanly
> 2011-10-19 22:46:23,648 [logicalNode c1-19] ERROR connector.DirectDriver: Closing down due to exception during append calls
> 2011-10-19 22:46:23,648 [Check config] INFO agent.LogicalNode: Node config successfully set to com.cloudera.flume.conf.FlumeConfigData@39a2f02e
> 2011-10-19 22:46:23,648 [logicalNode c1-19] INFO connector.DirectDriver: Connector logicalNode c1-19 exited with error: Waiting for queue element was interrupted! null
> {code}
> It may be brought by Flume-596 which is a big refactor of Driver. I tried to git-reset to 21b74010c34cef9a977c75ab5dec4dc747d8f5aa, and cannot reproduce the problem. 
> Expected result -- source should be close()'ed
> {code}
> 2011-10-19 23:20:45,811 [SpawningLogicalNode c1] INFO collector.CollectorSource: closed
> 2011-10-19 23:20:45,812 [SpawningLogicalNode c1] INFO thrift.ThriftEventSource: Closed server on port 35853...
> 2011-10-19 23:20:45,817 [SpawningLogicalNode c1] INFO thrift.ThriftEventSource: Queue still has 0 elements ...
> 2011-10-19 23:20:45,852 [logicalNode c1-20] INFO collector.CollectorSource: closed
> 2011-10-19 23:20:45,852 [logicalNode c1-20] INFO thrift.ThriftEventSource: Closed server on port 35853...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO thrift.ThriftEventSource: Queue still has 0 elements ...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO collector.CollectorSource: closed
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO thrift.ThriftEventSource: Closed server on port 35853...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO thrift.ThriftEventSource: Queue still has 0 elements ...
> 2011-10-19 23:20:45,853 [logicalNode c1-20] INFO agent.LogicalNode: c1: Connector stopped: CollectorSource | NullSink
> 2011-10-19 23:20:45,853 [SpawningLogicalNode c1] INFO agent.LogicalNode: Node config successfully set to com.cloudera.flume.conf.FlumeConfigData@2682d210
> 2011-10-19 23:20:45,864 [logicalNode c1-23] INFO agent.LogicalNode: Connector started: TailSource | NullSink
> {code}
> It might be the root cause of FLUME-798, and related to all the recent Interrupted exception discussion on user@.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira