You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@activemq.apache.org by "Hendley, Sam" <Sa...@sensus.com> on 2015/02/20 21:43:09 UTC

Faster detection of failure on network connections in "cable pull" scenario

We are doing "break testing" on our clustered deployment and are running into an issue with activemq being slow to notice a node that has been "cable pulled". We are seeking advice on how to best configure the connections between the brokers to notice this error quickly.

We have a producer box and two consumer boxes connected with a network connector with the below options. Our producer is enqueing messages to a distributed queue that is serviced by both of the consumer boxes. Normally traffic is round-robbined between the two consumer boxes correctly. In all of these tests we are producing messages at a constant rate.

Prodcuer configuration:
<networkConnector name="producer-to-consumer1" uri="static:tcp://consumer1:61616?soTimeout=2000&amp;soWriteTimeout=2000&amp;wireFormat.maxInactivityDuration=3000" />
<networkConnector name="producer-to-consumer2" uri="static:tcp://consumer2:61616?soTimeout=2000&amp;soWriteTimeout=2000&amp;wireFormat.maxInactivityDuration=3000" />

Consumer configuration:
<networkConnector name="consumer-to-produce" uri="static:tcp://producerbox:61616?soTimeout=2000&amp;soWriteTimeout=2000&amp;wireFormat.maxInactivityDuration=3000" />

If we do a "cable pull" on one of the consumer boxes we it can take many minutes before the broker notices that our connection is down. Eventually we did get a failure from the writeTimeoutFilter but instead of after two seconds like expected, we didn't get the failure for nearly 5 minutes. When it finally does trip all of the messages that had been enqueued for the bad consumer are correctly resent to the good consumer and all the future traffic is switched over to the good consumer.

Below is the log from the producer side. We pulled the cable at 14:20:36 and we get the expected inactivity failure error on the "reverse bridge" from the consumer broker came a few seconds later. Our "forward bridge" doesn't fail for around 5 minutes.

2015-02-20 14:20:41,798 | WARN  | Transport Connection to: tcp://10.22.19.206:44208 failed: org.apache.activemq.transport.InactivityIOException: Channel was inactive for too (>3000) long: tcp://10.22.19.206:44208 | org.apache.activemq.broker.TransportConnection.Transport | ActiveMQ InactivityMonitor Worker
2015-02-20 14:25:15,276 | WARN  | Forced write timeout for:tcp://10.22.19.206:61616 | org.apache.activemq.transport.WriteTimeoutFilter | WriteTimeoutFilter-Timeout-1
2015-02-20 14:25:15,278 | WARN  | Caught an exception processing local command | org.apache.activemq.network.DemandForwardingBridgeSupport | ActiveMQ BrokerService[scaleha-gw2] Task-37
java.net.SocketException: Socket closed
        at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:121)
        at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
        at org.apache.activemq.transport.tcp.TcpBufferedOutputStream.flush(TcpBufferedOutputStream.java:115)
        at java.io.DataOutputStream.flush(DataOutputStream.java:123)
        at org.apache.activemq.transport.tcp.TcpTransport.oneway(TcpTransport.java:176)
        at org.apache.activemq.transport.AbstractInactivityMonitor.doOnewaySend(AbstractInactivityMonitor.java:304)
        at org.apache.activemq.transport.AbstractInactivityMonitor.oneway(AbstractInactivityMonitor.java:286)
        at org.apache.activemq.transport.TransportFilter.oneway(TransportFilter.java:85)
        at org.apache.activemq.transport.WireFormatNegotiator.oneway(WireFormatNegotiator.java:104)
        at org.apache.activemq.transport.TransportFilter.oneway(TransportFilter.java:85)
        at org.apache.activemq.transport.WriteTimeoutFilter.oneway(WriteTimeoutFilter.java:73)
        at org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
        at org.apache.activemq.transport.ResponseCorrelator.asyncRequest(ResponseCorrelator.java:81)
        at org.apache.activemq.network.DemandForwardingBridgeSupport.serviceLocalCommand(DemandForwardingBridgeSupport.java:997)
        at org.apache.activemq.network.DemandForwardingBridgeSupport$2.onCommand(DemandForwardingBridgeSupport.java:206)
        at org.apache.activemq.transport.ResponseCorrelator.onCommand(ResponseCorrelator.java:116)
        at org.apache.activemq.transport.MutexTransport.onCommand(MutexTransport.java:50)
        at org.apache.activemq.transport.vm.VMTransport.doDispatch(VMTransport.java:138)
        at org.apache.activemq.transport.vm.VMTransport.dispatch(VMTransport.java:130)
        at org.apache.activemq.transport.vm.VMTransport.oneway(VMTransport.java:107)
        at org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
        at org.apache.activemq.transport.ResponseCorrelator.oneway(ResponseCorrelator.java:60)
        at org.apache.activemq.broker.TransportConnection.dispatch(TransportConnection.java:1370)
        at org.apache.activemq.broker.TransportConnection.processDispatch(TransportConnection.java:889)
        at org.apache.activemq.broker.TransportConnection.iterate(TransportConnection.java:935)
        at org.apache.activemq.thread.PooledTaskRunner.runTask(PooledTaskRunner.java:133)
        at org.apache.activemq.thread.PooledTaskRunner$1.run(PooledTaskRunner.java:48)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

I believe I know what is occurring but we don't know how to fix it. What we believe is occurring is the soWriteTimeout filter isn't firing because each socket write is actually succeeding and returning quickly but the data is being queued into the linux socket caches. The actual write to the socket doesn't block until we fill those buffers. We validated this by watching netstat as we ran the test and saw that the send-q slowly filled up as we ran traffic and then once it stopped growing the socket write timeout tripped a few seconds later.

tcp        0        0 10.22.19.205:39466          10.22.19.206:61616          ESTABLISHED   - // normal
tcp        0     7653 10.22.19.205:39466          10.22.19.206:61616          ESTABLISHED   - // early after cable pull
tcp        0    53400 10.22.19.205:39466          10.22.19.206:61616          ESTABLISHED   - // two minutes after cable pull
tcp        0   129489 10.22.19.205:39466          10.22.19.206:61616          FIN_WAIT1     - // final state after writeTimeout fires

I tried to rerun this same experiment without any messages flowing and the detection took 15 minutes. I think this took much longer because the number of bytes being written to the socket was much lower so we didn't fill the buffer as quickly.

I was expecting the InactivityMonitorTImeout to fire to protect us from this case but I think it considers the connection to be active every time it dispatches a message to that consumer so that timer never fires either.

Normally this is handled with an application level timeout. Presumabley in this case we should be waiting for the acknowledgement of receipt of the message from the other broker. The stats appear to show the messages as dispatched but not dequeued while we keep the consumer box off the network. I was expecting that the inactivity timer was looking the successful completion of a message, not just it's send.

Is there a setting somewhere I am missing? It seems like this should be a relatively common failure mode, maybe other people have enough data flowing that they fill those buffers incredibly quickly? We have investigated using keepAlive but in general those timeouts are still too slow for our purposes.

Sam Hendley

Re: Faster detection of failure on network connections in "cable pull" scenario

Posted by Tim Bain <tb...@alumni.duke.edu>.

I would have thought that there would be keepalives sent in both directions
on any connection irrespective of whether it's used for sending or
receiving message, and that the lack of receipt of them on that connection
would have declared it dead.  If the former is true, then there's a bug in
the detection algorithm (and a missing unit test!); if the former isn't
true, then I'd ask why not (and why the wiki page at
http://activemq.apache.org/activemq-inactivitymonitor.html is so
misleading/inaccurate) and propose that we implement it that way.  Handling
this directly with the inactivity monitor sounds better than the checks you
proposed, since it falls more cleanly into the paradigms ActiveMQ is
already using.

Can you test whether keepalives are being sent to the producer side of your
connection while the network cable is connected, so we know which of the
two scenarios I described above is the accurate one?

Tim

On Tue, Mar 17, 2015 at 7:23 AM, Sam Hendley <Sa...@sensus.com> wrote:

> We have been trying to run with the "small socketBuffer" strategy but this
> hasn't been as successful as I was hoping. Apparently the socketBuffer size
> we are attempting to configure is more of a suggestion than a absolute
> setting. Even when I configure it down to 2048 or 1024 linux will actually
> allow the socket buffer to contain up to 12k data before blocking any
> writes. Since my packets are small and my clients throttle it can take
> minutes before I have generated that much data to cause the socket write to
> actually block long enough to trigger the WriteTimeoutFilter.
>
> I have been looking over the activemq source code and think the behavior
> could be adjusted to react faster to this problem. Using
> DemandForwardingBridge as my example I think we could instrument the
> enqueueCounter and dequeueCounter to look for an inactivity gap of larger
> than X seconds between an enqueue and any dequeue.
>
> Basically I think the logic needs to be:
> * On Dequeue, check if enqueue count is > 0, set the timeout to now, if 0
> clear timeout
> * On Enqueue, check if timeout is set, if timeout > max break connection,
> if
> not set, set to current time
>
> It's not clear to me if this would fit in any of the current Abstractions.
> In the Demand forwarding bridge this looks like it might only work if
> asyncSend was set = true?
>
> I am open to other suggestions, this is our last big issue in our HA
> testing
> so I am eager and willing to try other alternatives.
>
> Thanks again,
>
> Sam Hendley
>
>
>
> --
> View this message in context:
> http://activemq.2283324.n4.nabble.com/Faster-detection-of-failure-on-network-connections-in-cable-pull-scenario-tp4691835p4693362.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>

RE: Faster detection of failure on network connections in "cable pull" scenario

Posted by Sam Hendley <Sa...@sensus.com>.

We have been trying to run with the "small socketBuffer" strategy but this
hasn't been as successful as I was hoping. Apparently the socketBuffer size
we are attempting to configure is more of a suggestion than a absolute
setting. Even when I configure it down to 2048 or 1024 linux will actually
allow the socket buffer to contain up to 12k data before blocking any
writes. Since my packets are small and my clients throttle it can take
minutes before I have generated that much data to cause the socket write to
actually block long enough to trigger the WriteTimeoutFilter.

I have been looking over the activemq source code and think the behavior
could be adjusted to react faster to this problem. Using
DemandForwardingBridge as my example I think we could instrument the
enqueueCounter and dequeueCounter to look for an inactivity gap of larger
than X seconds between an enqueue and any dequeue.

Basically I think the logic needs to be:
* On Dequeue, check if enqueue count is > 0, set the timeout to now, if 0
clear timeout
* On Enqueue, check if timeout is set, if timeout > max break connection, if
not set, set to current time

It's not clear to me if this would fit in any of the current Abstractions.
In the Demand forwarding bridge this looks like it might only work if
asyncSend was set = true?

I am open to other suggestions, this is our last big issue in our HA testing
so I am eager and willing to try other alternatives.

Thanks again,

Sam Hendley

--
View this message in context: http://activemq.2283324.n4.nabble.com/Faster-detection-of-failure-on-network-connections-in-cable-pull-scenario-tp4691835p4693362.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

RE: Faster detection of failure on network connections in "cable pull" scenario

Posted by "Hendley, Sam" <Sa...@sensus.com>.

Hi, thanks for looking at this. I don't think I was clear enough in my description of the issue. 

Each of our brokers has a network connector configured to all of the other brokers. This means there are two connections between each broker. We have found that the outbound (to a remote broker) direction will always log with a port number of 61616, and the inbound direction will use an ephermal port. That means this log message is on the read side since the port is ephemeral.

> 2015-02-20 14:20:41,798 | WARN  | Transport Connection to: tcp://10.22.19.206:44208 failed:org.apache.activemq.transport.InactivityIOException: Channel was inactive for too (>3000) long: tcp://10.22.19.206:44208 |  org.apache.activemq.broker.TransportConnection.Transport | ActiveMQInactivityMonitor Worker

So we find that the "inbound" connection drops at expected time, but the outbound connection still stays alive for a long time. 

By "successful completion of a message" I was hoping that the inactivity timer was tied to the "ACKs" from the remote broker. I know that the responses are being tracked since all of the normal "bookkeeping" on the piped message works (they get Dispatched but not Dequeued from their queues). I had been hoping that the inactivity monitor would be watching for the responses from the remote side to determine if the connection was inactive or not. However it seems to treat every attempted write we are doing as proof that the connection is "active", I would have thought we would wait for the resonses before 

In the meantime we found a workaround by setting the "socketBufferSize" to a very small value (2048). This meant that the socket buffer fills up very quickly and the detection time went from 5 minutes to around 10 seconds. This obviously will have throughput ramifications so we are still looking for a better solution.

Thanks for your input, hopefully this is a bit clearer.

Sam Hendley

-----Original Message-----
From: tbain98@gmail.com [mailto:tbain98@gmail.com] On Behalf Of Tim Bain
Sent: Sunday, February 22, 2015 6:17 PM
To: ActiveMQ Users
Subject: Re: Faster detection of failure on network connections in "cable pull" scenario

It sounds to me like what's going on is that the inactivity interval is tripping but either 1) the inactivity exception doesn't trigger the bridge to try to re-establish the connection, or 2) it tries to re-establish the connection but the connection establishment attempt send() succeeds (going into the OS socket buffer but not onto the network card) and nothing notices that it's been more than an acceptable amount of time without a response from the remote broker.  Can you try to determine which of those it is by stepping through the code with a debugger?
org.apache.activemq.transport.AbstractInactivityMonitor.readCheck() is where you would want the breakpoint, in the Runnable's call to onException().

Also, as far as I can tell, the inactivity interval algorithm correctly considers only reads to be an indication of connection liveness, so unless I'm missing something in the code of AbstractInactivityMonitor.readCheck(),
I don't think your theory that the inactivity monitor isn't firing is right.  (After all, the log you posted shows that it does in fact fire as expected at 2015-02-20 14:20:41,798; the problem is simply what happens after that.)

I'm not at all clear what you meant by "I was expecting that the inactivity timer was looking the successful completion of a message"; if what you meant wasn't addressed by what I wrote above and is still relevant, please clarify the question.

Tim

On Fri, Feb 20, 2015 at 1:43 PM, Hendley, Sam <Sa...@sensus.com>
wrote:

> We are doing "break testing" on our clustered deployment and are 
> running into an issue with activemq being slow to notice a node that 
> has been "cable pulled". We are seeking advice on how to best 
> configure the connections between the brokers to notice this error quickly.
>
> We have a producer box and two consumer boxes connected with a network 
> connector with the below options. Our producer is enqueing messages to 
> a distributed queue that is serviced by both of the consumer boxes. 
> Normally traffic is round-robbined between the two consumer boxes 
> correctly. In all of these tests we are producing messages at a constant rate.
>
> Prodcuer configuration:
> <networkConnector name="producer-to-consumer1"
> uri="static:tcp://consumer1:61616?soTimeout=2000&amp;soWriteTimeout=2000&amp;wireFormat.maxInactivityDuration=3000"
> />
> <networkConnector name="producer-to-consumer2"
> uri="static:tcp://consumer2:61616?soTimeout=2000&amp;soWriteTimeout=2000&amp;wireFormat.maxInactivityDuration=3000"
> />
>
> Consumer configuration:
> <networkConnector name="consumer-to-produce"
> uri="static:tcp://producerbox:61616?soTimeout=2000&amp;soWriteTimeout=2000&amp;wireFormat.maxInactivityDuration=3000"
> />
>
> If we do a "cable pull" on one of the consumer boxes we it can take 
> many minutes before the broker notices that our connection is down. 
> Eventually we did get a failure from the writeTimeoutFilter but 
> instead of after two seconds like expected, we didn't get the failure 
> for nearly 5 minutes. When it finally does trip all of the messages 
> that had been enqueued for the bad consumer are correctly resent to 
> the good consumer and all the future traffic is switched over to the good consumer.
>
> Below is the log from the producer side. We pulled the cable at 
> 14:20:36 and we get the expected inactivity failure error on the "reverse bridge"
> from the consumer broker came a few seconds later. Our "forward bridge"
> doesn't fail for around 5 minutes.
>
> 2015-02-20 14:20:41,798 | WARN  | Transport Connection to: tcp://
> 10.22.19.206:44208 failed:
> org.apache.activemq.transport.InactivityIOException: Channel was 
> inactive for too (>3000) long: tcp://10.22.19.206:44208 | 
> org.apache.activemq.broker.TransportConnection.Transport | ActiveMQ 
> InactivityMonitor Worker
> 2015-02-20 14:25:15,276 | WARN  | Forced write timeout for:tcp://
> 10.22.19.206:61616 | org.apache.activemq.transport.WriteTimeoutFilter 
> |
> WriteTimeoutFilter-Timeout-1
> 2015-02-20 14:25:15,278 | WARN  | Caught an exception processing local 
> command | org.apache.activemq.network.DemandForwardingBridgeSupport | 
> ActiveMQ BrokerService[scaleha-gw2] Task-37
> java.net.SocketException: Socket closed
>         at
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:121)
>         at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
>         at
> org.apache.activemq.transport.tcp.TcpBufferedOutputStream.flush(TcpBufferedOutputStream.java:115)
>         at java.io.DataOutputStream.flush(DataOutputStream.java:123)
>         at
> org.apache.activemq.transport.tcp.TcpTransport.oneway(TcpTransport.java:176)
>         at
> org.apache.activemq.transport.AbstractInactivityMonitor.doOnewaySend(AbstractInactivityMonitor.java:304)
>         at
> org.apache.activemq.transport.AbstractInactivityMonitor.oneway(AbstractInactivityMonitor.java:286)
>         at
> org.apache.activemq.transport.TransportFilter.oneway(TransportFilter.java:85)
>         at
> org.apache.activemq.transport.WireFormatNegotiator.oneway(WireFormatNegotiator.java:104)
>         at
> org.apache.activemq.transport.TransportFilter.oneway(TransportFilter.java:85)
>         at
> org.apache.activemq.transport.WriteTimeoutFilter.oneway(WriteTimeoutFilter.java:73)
>         at
> org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
>         at
> org.apache.activemq.transport.ResponseCorrelator.asyncRequest(ResponseCorrelator.java:81)
>         at
> org.apache.activemq.network.DemandForwardingBridgeSupport.serviceLocalCommand(DemandForwardingBridgeSupport.java:997)
>         at
> org.apache.activemq.network.DemandForwardingBridgeSupport$2.onCommand(DemandForwardingBridgeSupport.java:206)
>         at
> org.apache.activemq.transport.ResponseCorrelator.onCommand(ResponseCorrelator.java:116)
>         at
> org.apache.activemq.transport.MutexTransport.onCommand(MutexTransport.java:50)
>         at
> org.apache.activemq.transport.vm.VMTransport.doDispatch(VMTransport.java:138)
>         at
> org.apache.activemq.transport.vm.VMTransport.dispatch(VMTransport.java:130)
>         at
> org.apache.activemq.transport.vm.VMTransport.oneway(VMTransport.java:107)
>         at
> org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
>         at
> org.apache.activemq.transport.ResponseCorrelator.oneway(ResponseCorrelator.java:60)
>         at
> org.apache.activemq.broker.TransportConnection.dispatch(TransportConnection.java:1370)
>         at
> org.apache.activemq.broker.TransportConnection.processDispatch(TransportConnection.java:889)
>         at
> org.apache.activemq.broker.TransportConnection.iterate(TransportConnection.java:935)
>         at
> org.apache.activemq.thread.PooledTaskRunner.runTask(PooledTaskRunner.java:133)
>         at
> org.apache.activemq.thread.PooledTaskRunner$1.run(PooledTaskRunner.java:48)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
>
> I believe I know what is occurring but we don't know how to fix it. 
> What we believe is occurring is the soWriteTimeout filter isn't firing 
> because each socket write is actually succeeding and returning quickly 
> but the data is being queued into the linux socket caches. The actual 
> write to the socket doesn't block until we fill those buffers. We 
> validated this by watching netstat as we ran the test and saw that the 
> send-q slowly filled up as we ran traffic and then once it stopped 
> growing the socket write timeout tripped a few seconds later.
>
> tcp        0        0 10.22.19.205:39466          10.22.19.206:61616
>     ESTABLISHED   - // normal
> tcp        0     7653 10.22.19.205:39466          10.22.19.206:61616
>     ESTABLISHED   - // early after cable pull
> tcp        0    53400 10.22.19.205:39466          10.22.19.206:61616
>     ESTABLISHED   - // two minutes after cable pull
> tcp        0   129489 10.22.19.205:39466          10.22.19.206:61616
>     FIN_WAIT1     - // final state after writeTimeout fires
>
> I tried to rerun this same experiment without any messages flowing and 
> the detection took 15 minutes. I think this took much longer because 
> the number of bytes being written to the socket was much lower so we 
> didn't fill the buffer as quickly.
>
> I was expecting the InactivityMonitorTImeout to fire to protect us 
> from this case but I think it considers the connection to be active 
> every time it dispatches a message to that consumer so that timer never fires either.
>
> Normally this is handled with an application level timeout. 
> Presumabley in this case we should be waiting for the acknowledgement 
> of receipt of the message from the other broker. The stats appear to 
> show the messages as dispatched but not dequeued while we keep the consumer box off the network.
> I was expecting that the inactivity timer was looking the successful 
> completion of a message, not just it's send.
>
> Is there a setting somewhere I am missing? It seems like this should 
> be a relatively common failure mode, maybe other people have enough 
> data flowing that they fill those buffers incredibly quickly? We have 
> investigated using keepAlive but in general those timeouts are still too slow for our purposes.
>
> Sam Hendley
>
>

Re: Faster detection of failure on network connections in "cable pull" scenario

Posted by Tim Bain <tb...@alumni.duke.edu>.

It sounds to me like what's going on is that the inactivity interval is
tripping but either 1) the inactivity exception doesn't trigger the bridge
to try to re-establish the connection, or 2) it tries to re-establish the
connection but the connection establishment attempt send() succeeds (going
into the OS socket buffer but not onto the network card) and nothing
notices that it's been more than an acceptable amount of time without a
response from the remote broker.  Can you try to determine which of those
it is by stepping through the code with a debugger?
org.apache.activemq.transport.AbstractInactivityMonitor.readCheck() is
where you would want the breakpoint, in the Runnable's call to
onException().

Also, as far as I can tell, the inactivity interval algorithm correctly
considers only reads to be an indication of connection liveness, so unless
I'm missing something in the code of AbstractInactivityMonitor.readCheck(),
I don't think your theory that the inactivity monitor isn't firing is
right.  (After all, the log you posted shows that it does in fact fire as
expected at 2015-02-20 14:20:41,798; the problem is simply what happens
after that.)

I'm not at all clear what you meant by "I was expecting that the inactivity
timer was looking the successful completion of a message"; if what you
meant wasn't addressed by what I wrote above and is still relevant, please
clarify the question.

Tim

On Fri, Feb 20, 2015 at 1:43 PM, Hendley, Sam <Sa...@sensus.com>
wrote:

> We are doing "break testing" on our clustered deployment and are running
> into an issue with activemq being slow to notice a node that has been
> "cable pulled". We are seeking advice on how to best configure the
> connections between the brokers to notice this error quickly.
>
> We have a producer box and two consumer boxes connected with a network
> connector with the below options. Our producer is enqueing messages to a
> distributed queue that is serviced by both of the consumer boxes. Normally
> traffic is round-robbined between the two consumer boxes correctly. In all
> of these tests we are producing messages at a constant rate.
>
> Prodcuer configuration:
> <networkConnector name="producer-to-consumer1"
> uri="static:tcp://consumer1:61616?soTimeout=2000&amp;soWriteTimeout=2000&amp;wireFormat.maxInactivityDuration=3000"
> />
> <networkConnector name="producer-to-consumer2"
> uri="static:tcp://consumer2:61616?soTimeout=2000&amp;soWriteTimeout=2000&amp;wireFormat.maxInactivityDuration=3000"
> />
>
> Consumer configuration:
> <networkConnector name="consumer-to-produce"
> uri="static:tcp://producerbox:61616?soTimeout=2000&amp;soWriteTimeout=2000&amp;wireFormat.maxInactivityDuration=3000"
> />
>
> If we do a "cable pull" on one of the consumer boxes we it can take many
> minutes before the broker notices that our connection is down. Eventually
> we did get a failure from the writeTimeoutFilter but instead of after two
> seconds like expected, we didn't get the failure for nearly 5 minutes. When
> it finally does trip all of the messages that had been enqueued for the bad
> consumer are correctly resent to the good consumer and all the future
> traffic is switched over to the good consumer.
>
> Below is the log from the producer side. We pulled the cable at 14:20:36
> and we get the expected inactivity failure error on the "reverse bridge"
> from the consumer broker came a few seconds later. Our "forward bridge"
> doesn't fail for around 5 minutes.
>
> 2015-02-20 14:20:41,798 | WARN  | Transport Connection to: tcp://
> 10.22.19.206:44208 failed:
> org.apache.activemq.transport.InactivityIOException: Channel was inactive
> for too (>3000) long: tcp://10.22.19.206:44208 |
> org.apache.activemq.broker.TransportConnection.Transport | ActiveMQ
> InactivityMonitor Worker
> 2015-02-20 14:25:15,276 | WARN  | Forced write timeout for:tcp://
> 10.22.19.206:61616 | org.apache.activemq.transport.WriteTimeoutFilter |
> WriteTimeoutFilter-Timeout-1
> 2015-02-20 14:25:15,278 | WARN  | Caught an exception processing local
> command | org.apache.activemq.network.DemandForwardingBridgeSupport |
> ActiveMQ BrokerService[scaleha-gw2] Task-37
> java.net.SocketException: Socket closed
>         at
> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:121)
>         at java.net.SocketOutputStream.write(SocketOutputStream.java:159)
>         at
> org.apache.activemq.transport.tcp.TcpBufferedOutputStream.flush(TcpBufferedOutputStream.java:115)
>         at java.io.DataOutputStream.flush(DataOutputStream.java:123)
>         at
> org.apache.activemq.transport.tcp.TcpTransport.oneway(TcpTransport.java:176)
>         at
> org.apache.activemq.transport.AbstractInactivityMonitor.doOnewaySend(AbstractInactivityMonitor.java:304)
>         at
> org.apache.activemq.transport.AbstractInactivityMonitor.oneway(AbstractInactivityMonitor.java:286)
>         at
> org.apache.activemq.transport.TransportFilter.oneway(TransportFilter.java:85)
>         at
> org.apache.activemq.transport.WireFormatNegotiator.oneway(WireFormatNegotiator.java:104)
>         at
> org.apache.activemq.transport.TransportFilter.oneway(TransportFilter.java:85)
>         at
> org.apache.activemq.transport.WriteTimeoutFilter.oneway(WriteTimeoutFilter.java:73)
>         at
> org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
>         at
> org.apache.activemq.transport.ResponseCorrelator.asyncRequest(ResponseCorrelator.java:81)
>         at
> org.apache.activemq.network.DemandForwardingBridgeSupport.serviceLocalCommand(DemandForwardingBridgeSupport.java:997)
>         at
> org.apache.activemq.network.DemandForwardingBridgeSupport$2.onCommand(DemandForwardingBridgeSupport.java:206)
>         at
> org.apache.activemq.transport.ResponseCorrelator.onCommand(ResponseCorrelator.java:116)
>         at
> org.apache.activemq.transport.MutexTransport.onCommand(MutexTransport.java:50)
>         at
> org.apache.activemq.transport.vm.VMTransport.doDispatch(VMTransport.java:138)
>         at
> org.apache.activemq.transport.vm.VMTransport.dispatch(VMTransport.java:130)
>         at
> org.apache.activemq.transport.vm.VMTransport.oneway(VMTransport.java:107)
>         at
> org.apache.activemq.transport.MutexTransport.oneway(MutexTransport.java:68)
>         at
> org.apache.activemq.transport.ResponseCorrelator.oneway(ResponseCorrelator.java:60)
>         at
> org.apache.activemq.broker.TransportConnection.dispatch(TransportConnection.java:1370)
>         at
> org.apache.activemq.broker.TransportConnection.processDispatch(TransportConnection.java:889)
>         at
> org.apache.activemq.broker.TransportConnection.iterate(TransportConnection.java:935)
>         at
> org.apache.activemq.thread.PooledTaskRunner.runTask(PooledTaskRunner.java:133)
>         at
> org.apache.activemq.thread.PooledTaskRunner$1.run(PooledTaskRunner.java:48)
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         at java.lang.Thread.run(Thread.java:744)
>
> I believe I know what is occurring but we don't know how to fix it. What
> we believe is occurring is the soWriteTimeout filter isn't firing because
> each socket write is actually succeeding and returning quickly but the data
> is being queued into the linux socket caches. The actual write to the
> socket doesn't block until we fill those buffers. We validated this by
> watching netstat as we ran the test and saw that the send-q slowly filled
> up as we ran traffic and then once it stopped growing the socket write
> timeout tripped a few seconds later.
>
> tcp        0        0 10.22.19.205:39466          10.22.19.206:61616
>     ESTABLISHED   - // normal
> tcp        0     7653 10.22.19.205:39466          10.22.19.206:61616
>     ESTABLISHED   - // early after cable pull
> tcp        0    53400 10.22.19.205:39466          10.22.19.206:61616
>     ESTABLISHED   - // two minutes after cable pull
> tcp        0   129489 10.22.19.205:39466          10.22.19.206:61616
>     FIN_WAIT1     - // final state after writeTimeout fires
>
> I tried to rerun this same experiment without any messages flowing and the
> detection took 15 minutes. I think this took much longer because the number
> of bytes being written to the socket was much lower so we didn't fill the
> buffer as quickly.
>
> I was expecting the InactivityMonitorTImeout to fire to protect us from
> this case but I think it considers the connection to be active every time
> it dispatches a message to that consumer so that timer never fires either.
>
> Normally this is handled with an application level timeout. Presumabley in
> this case we should be waiting for the acknowledgement of receipt of the
> message from the other broker. The stats appear to show the messages as
> dispatched but not dequeued while we keep the consumer box off the network.
> I was expecting that the inactivity timer was looking the successful
> completion of a message, not just it's send.
>
> Is there a setting somewhere I am missing? It seems like this should be a
> relatively common failure mode, maybe other people have enough data flowing
> that they fill those buffers incredibly quickly? We have investigated using
> keepAlive but in general those timeouts are still too slow for our purposes.
>
> Sam Hendley
>
>