You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "René Cordier (Jira)" <se...@james.apache.org> on 2023/11/03 07:33:00 UTC

[jira] [Created] (JAMES-3955) James stops consuming sometimes RabbitMQ queue

René Cordier created JAMES-3955:
-----------------------------------

             Summary: James stops consuming sometimes RabbitMQ queue
                 Key: JAMES-3955
                 URL: https://issues.apache.org/jira/browse/JAMES-3955
             Project: James Server
          Issue Type: Improvement
          Components: rabbitmq
            Reporter: René Cordier


We sometimes had troubles with RabbitMQ in some production environments where james would stop consuming some queues (like the mail queue) and we never would understand really why, and we would just restart James in this case.

Well recently I had similar issues but with TaskManagerWorkQueue. Except that we managed to reproduce the problem manually. We have a task we play at night that can take a long time to play. After had some other planned tasks as well, we could observe the following pattern:

While the heavy task is being executed by James, others are pilling up in the TaskManagerWorkQueue. They getting unacked by James, meaning it's telling RabbitMQ that it will consume them later (as James executes one task at a time). Except that after 30 minutes after the first unacked item in the queue, could see James stopping consuming the queue, and all items coming back to the ready state.

After looking around RabbitMQ configuration: [https://www.rabbitmq.com/consumers.html#acknowledgement-timeout]

RabbitMQ will close the channel with a `PRECONDITION_FAILED` channel exception when detecting that an item here the first one being unacked) has not been consumed within 30 minutes. Matching with what we observed actually.

From this I guess we could deduce that when we had a similar issue with the mail queue, maybe James failed to consume properly a message or failed at acknowledging it for some reason and got the channel closed by RabbitMQ.

From there, there is some actions we can take to prevent this:
 * adding error logs when we get the channel closed on such an exception
 * trying to reconnect to the channel when such an exception occurs
 * on at least important queues like task manager queue, mail queue, event bus
 * potentially try to audit as well if in some cases we do not ack/nack the message back
 *  giving the possibility to increase the consumer timeout of the above queue with the `x-consumer-timeout` queue argument (would require to run rabbitmq 3.12 at least)

For now we can as well increase that timeout in rabbitmq.conf to minimize the problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org