You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by "Daniil Kirilyuk (Jira)" <ji...@apache.org> on 2021/04/06 14:34:00 UTC

[jira] [Commented] (QPID-8514) High CPU usage and stucked connections

    [ https://issues.apache.org/jira/browse/QPID-8514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315616#comment-17315616 ] 

Daniil Kirilyuk commented on QPID-8514:
---------------------------------------

[~orudyy],

According to the threaddump it seems that for some reasons queue entry in the DeadLetterQueue stucks with the ACQUIRED state, preventing RingOverflowPolicyHandler to delete it. It continues to loop calling AbstractQueue.deleteEntry(), but this call fails to acquire queue entry (since it’s already acquired, probably by another thread) and performs no deletion. It seems to be some rare concurrency issue, which we weren’t be able to reproduce on our dev environments. On test environment colleagues were able to trigger that problem, but weren’t able to reproduce it as well (according to them similar issue was found about 5-6 months ago). So unfortunately we don’t have any stable test case for this issue, but we would like to develop a fix because of a serious impact on the broker performance and stability.

 Since the root cause remains unknown, possible fix could be detection of the looping inside of RingOverflowPolicyHandler.checkOverflow() and direct deletion of the queue entry via entry.delete().

 Could you probably give us few hints regarding further analysis directions / possible fix ideas?

> High CPU usage and stucked connections
> --------------------------------------
>
>                 Key: QPID-8514
>                 URL: https://issues.apache.org/jira/browse/QPID-8514
>             Project: Qpid
>          Issue Type: Bug
>          Components: Broker-J
>    Affects Versions: qpid-java-broker-8.0.4
>            Reporter: Daniil Kirilyuk
>            Priority: Minor
>         Attachments: threaddump.zip
>
>
> There were around 600 connections in state CLOSE__WAIT: 
> {noformat}
> [prod01@fdmchd log]$ netstat -Ainet | grep 22101 | grep TIME_WAIT | wc -l
> 38
> [prod01@fdmchd log]$ netstat -Ainet | grep 22101 | grep CLOSE_WAIT | wc -l
> 625
> [prod01@fdmchd log]$ netstat -Ainet | grep 22101 | grep -v CLOSE_WAIT | grep -V TIME_WAIT | wc -l
> 7
> {noformat}
> CPU usage raises up to 90% and higher.
> Connections couldn't be closed by webgui. Broker log contains message below, but connections were not closed.
> {noformat}
> 2021-03-26 15:12:28,474 INFO  [Broker-Config] (q.m.c.model_delete) - [mng:3hpU+miq(admin@/127.0.0.1:62743)] [con:751(TEST@/127.0.0.1:40422/default)] CON-1007 : Connection close initiated by operator
> {noformat}
>  Thread dumps shows around 49 threads with same stacktrace related to ring queue
> {noformat}
> java.lang.Thread.State: RUNNABLE
> 	at org.apache.qpid.server.queue.QueueEntryImpl.addStateChangeListener(QueueEntryImpl.java:674)
> 	at org.apache.qpid.server.queue.QueueEntryImpl.acquireOrSteal(QueueEntryImpl.java:316)
> 	at org.apache.qpid.server.queue.AbstractQueue.deleteEntry(AbstractQueue.java:1913)
> 	at org.apache.qpid.server.queue.AbstractQueue.deleteEntry(AbstractQueue.java:1863)
> 	at org.apache.qpid.server.queue.RingOverflowPolicyHandler$Handler.checkOverflow(RingOverflowPolicyHandler.java:102)
> 	at org.apache.qpid.server.queue.RingOverflowPolicyHandler$Handler.access$000(RingOverflowPolicyHandler.java:44)
> 	at org.apache.qpid.server.queue.RingOverflowPolicyHandler.checkOverflow(RingOverflowPolicyHandler.java:41)
> 	at org.apache.qpid.server.queue.AbstractQueue.doEnqueue(AbstractQueue.java:1382)
> 	at org.apache.qpid.server.queue.AbstractQueue.enqueue(AbstractQueue.java:1290)
> 	at org.apache.qpid.server.message.RoutingResult$1.postCommit(RoutingResult.java:135)
> 	at org.apache.qpid.server.txn.AsyncAutoCommitTransaction$3.postCommit(AsyncAutoCommitTransaction.java:309)
> 	at org.apache.qpid.server.txn.AsyncCommand.complete(AsyncCommand.java:84)
> 	at org.apache.qpid.server.protocol.v1_0.StandardReceivingLinkEndpoint.receiveComplete(StandardReceivingLinkEndpoint.java:588)
> 	at org.apache.qpid.server.protocol.v1_0.Session_1_0.lambda$receivedComplete$5(Session_1_0.java:1343)
> 	at org.apache.qpid.server.protocol.v1_0.Session_1_0$$Lambda$142/959267941.accept(Unknown Source)
> 	at java.lang.Iterable.forEach(Iterable.java:75)
> 	at org.apache.qpid.server.protocol.v1_0.Session_1_0.receivedComplete(Session_1_0.java:1343)
> 	at org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl.lambda$receivedComplete$11(AMQPConnection_1_0Impl.java:1349)
> 	at org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl$$Lambda$141/1774823837.run(Unknown Source)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl.receivedComplete(AMQPConnection_1_0Impl.java:1347)
> 	at org.apache.qpid.server.protocol.v1_0.AMQPConnection_1_0Impl.onReceive(AMQPConnection_1_0Impl.java:1328)
> 	at org.apache.qpid.server.transport.AbstractAMQPConnection.lambda$received$2(AbstractAMQPConnection.java:576)
> 	at org.apache.qpid.server.transport.AbstractAMQPConnection$$Lambda$119/6380061.run(Unknown Source)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at org.apache.qpid.server.transport.AbstractAMQPConnection.received(AbstractAMQPConnection.java:571)
> 	at org.apache.qpid.server.transport.MultiVersionProtocolEngine.received(MultiVersionProtocolEngine.java:135)
> 	at org.apache.qpid.server.transport.NonBlockingConnection.processAmqpData(NonBlockingConnection.java:611)
> 	at org.apache.qpid.server.transport.NonBlockingConnectionPlainDelegate.processData(NonBlockingConnectionPlainDelegate.java:58)
> 	at org.apache.qpid.server.transport.NonBlockingConnection.doRead(NonBlockingConnection.java:496)
> 	at org.apache.qpid.server.transport.NonBlockingConnection.doWork(NonBlockingConnection.java:270)
> 	at org.apache.qpid.server.transport.NetworkConnectionScheduler.processConnection(NetworkConnectionScheduler.java:134)
> 	at org.apache.qpid.server.transport.SelectorThread$ConnectionProcessor.processConnection(SelectorThread.java:575)
> 	at org.apache.qpid.server.transport.SelectorThread$SelectionTask.performSelect(SelectorThread.java:366)
> 	at org.apache.qpid.server.transport.SelectorThread$SelectionTask.run(SelectorThread.java:97)
> 	at org.apache.qpid.server.transport.SelectorThread.run(SelectorThread.java:533)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at org.apache.qpid.server.bytebuffer.QpidByteBufferFactory.lambda$null$0(QpidByteBufferFactory.java:464)
> 	at org.apache.qpid.server.bytebuffer.QpidByteBufferFactory$$Lambda$10/1487500813.run(Unknown Source)
> 	at java.lang.Thread.run(Thread.java:748)
> {noformat}
> Also ring queue behaves strangely: queue "{{DeadLetterQueue"}} has set maximum message depth to {{1}}, but in there were 55 message with {{Acquired}} state.
>  First message in {{DeadLetterQueue}} - {{2021-03-25, 16:50:59.381 (00:00 UTC)}}.
>  Last message in {{DeadLetterQueue}} - {{2021-03-25, 16:50:59.683 (00:00 UTC)}}.
> These times correlates to timed out drain request at 16:52:00,410, because default drain request is set 60 seconds:
> {noformat}
> ^[[0m^[[0m2021-03-25 16:52:00,408 INFO  [org.apache.qpid.jms.JmsSession] (QpidJMS Connection Executor: ID:986cd915-d072-4711-bd71-50a74aab4be8:1) A JMS MessageConsumer has been closed: JmsConsumerInfo: { ID:986cd915-d072-4711-bd71-50a74aab4be8:1:1:1, destination = broadcast.XX.PRISMA_LIVERiskLimitTrigger }
> ^[[0m^[[33m2021-03-25 16:52:00,410 WARN  [org.apache.camel.component.jms.DefaultJmsMessageListenerContainer] (Camel (camel-1) thread #3 - JmsConsumer[broadcast.XX.PRISMA_LIVERiskLimitTrigger]) Setup of JMS message listener invoker failed for destination 'broadcast.XX.PRISMA_LIVERiskLimitTrigger' - trying to recover. Cause: Remote did not respond to a drain request in time
> ^[[0m^[[33m2021-03-25 16:52:01,414 WARN  [org.apache.camel.component.jms.DefaultJmsMessageListenerContainer] (Camel (camel-1) thread #6 - JmsConsumer[broadcast.XX.PRISMA_LIVERiskLimitTrigger]) Setup of JMS message listener invoker failed for destination 'broadcast.XX.PRISMA_LIVERiskLimitTrigger'
> {noformat}
> After the problem was triggered broker started to reject client connections:
> {noformat}
>  2021-03-25 17:55:43,716 INFO  [Broker-Config] (q.m.c.open) - [con:901(/127.0.0.1:56012)] CON-1001 : Open : Destination : amqp(127.0.0.1:22101) : Protocol Version : 1.0
> 2021-03-25 17:55:43,745 INFO  [IO-/127.0.0.1:56012] (q.m.c.open) - [con:901(ARP@/127.0.0.1:56012/default)] CON-1001 : Open : Destination : amqp(172.16.75.197:22101) : Protocol Version : 1.0 : Client ID : ID:af50d087-0610-406a-9792-ffa5bf6b68b4:5 : Client Version : 0.51.0 : Client Product : QpidJMS
> 2021-03-25 17:55:43,747 INFO  [IO-/127.0.0.1:56012] (q.m.a.denied) - [con:901(ARP@/127.0.0.1:56012/default)] ACL-1002 : Denied : Access Virtualhost NAME=default,VIRTUALHOST_NAME=default
> 2021-03-25 17:55:43,755 INFO  [Broker-Config] (q.m.c.dropped_connection) - [con:901(ARP@/127.0.0.1:56012/default)] CON-1004 : Connection dropped
> {noformat}
> Issue seems to be quite rare: same test ran in a loop over several days and this issue was not triggered.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org