You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Benoit Tellier (Jira)" <se...@james.apache.org> on 2021/06/14 02:59:00 UTC

[jira] [Created] (JAMES-3599) Improve the design of the RabbitMQ eventbus

Benoit Tellier created JAMES-3599:
-------------------------------------

             Summary: Improve the design of the RabbitMQ eventbus
                 Key: JAMES-3599
                 URL: https://issues.apache.org/jira/browse/JAMES-3599
             Project: James Server
          Issue Type: Task
          Components: mailbox, rabbitmq
    Affects Versions: 3.6.0
            Reporter: Benoit Tellier
             Fix For: 3.7.0


Mailing list discussion: https://www.mail-archive.com/server-dev@james.apache.org/msg70437.html

I did spend a bit of time digging within the RabbitMQ performances and
stability.


I was surprised to discover weeks ago the amount of work performed by
play.json library and could not just quite explain why it was hogging 3%
of CPU time, and be the most CPU consumer for mailbox events. RabbitMQ
acks account for another 1.20% of CPU time.

Investigating in the RabbitMQ eventbus I realized the events are routed
to all group queues, dispatched and deserialized then applied if relevant.

Given 200 events/s and given that the JMAP server has 10 groups we end
up deserializing 2000 events/s, even if irrelevant for the groups.

As I recall, we wanted the the event per group to be the unit of retry.
Noble design goal.

I think parallelizing groups is a non goal: this kind of optimization
would not improve response time as it is asynchronous, running in the
background, and makes little sense at 1000s requests per seconds.

However ending up having one queue per event is likely sub-optimal. I
think the design can be improved by, in the nominal case, transmitting
only one message to all groups. The receiving groups will then try to
execute all groups. We can keep reties for individual groups (with their
dedicated exchanges and queues): upon failure, we republish to the retry
exchange of the incriminated listener. This makes the upgrade path easy
too, as the group queue keeps being consumed. One would just need to do
some unbindings...

Note that such an evolution would:
 - also enable us, if we want, to enforce some execution orders for
listeners, opening the way to fix things like JAMES-3561
<https://issues.apache.org/jira/browse/JAMES-3561> ...
 - it could serve as an inspiration for future eventBus implementations
like the Pulsar one, hence getting feedback on the existing design is
IMO useful.

I will create a JIRA ticket holding the design proposal (schema) and how
it does defer from the previous one, as well as some RabbitMQ management
screenshots.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org