You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@james.apache.org by GitBox <gi...@apache.org> on 2022/01/10 04:30:12 UTC

[GitHub] [james-project] Arsnael commented on a change in pull request #829: [ADR] 51. Pulsar MailQueue

Arsnael commented on a change in pull request #829:
URL: https://github.com/apache/james-project/pull/829#discussion_r780885140



##########
File path: src/adr/0051-pulsar-mailqueue.md
##########
@@ -0,0 +1,168 @@
+# 51. Pulsar MailQueue
+
+Date: 2022-01-07
+
+## Status
+
+Accepted (lazy consensus).
+
+Implemented.
+
+Provides an alternative to [ADR-31 Distributed MailQueue (RabbitMQ + Cassandra)](0031-distributed-mail-queue.md).
+
+## Context
+
+### Mail Queue
+
+MailQueue is a central component of SMTP infrastructure allowing asynchronous mail processing. This enables a short 
+SMTP reply time despite a potentially longer mail processing time. It also works as a buffer during SMTP peak workload
+to not overload a server. 
+
+Furthermore, when used as a Mail Exchange server (MX), the ability to add delays being observed before dequeing elements
+allows, among others:
+
+ - Delaying retries upon MX delivery failure to a remote site.
+ - Throttling, which could be helpful for not being considered a spammer.
+
+A mailqueue also enables advanced administration operations like traffic review, discarding emails, resetting wait 
+delays, purging the queue, etc.
+
+### Existing distributed MailQueue
+
+Distributed James currently ship a distributed MailQueue composing the following software with the following 

Review comment:
       ```suggestion
   Distributed James currently ships a distributed MailQueue composing the following software with the following 
   ```

##########
File path: src/adr/0051-pulsar-mailqueue.md
##########
@@ -0,0 +1,168 @@
+# 51. Pulsar MailQueue
+
+Date: 2022-01-07
+
+## Status
+
+Accepted (lazy consensus).
+
+Implemented.
+
+Provides an alternative to [ADR-31 Distributed MailQueue (RabbitMQ + Cassandra)](0031-distributed-mail-queue.md).
+
+## Context
+
+### Mail Queue
+
+MailQueue is a central component of SMTP infrastructure allowing asynchronous mail processing. This enables a short 
+SMTP reply time despite a potentially longer mail processing time. It also works as a buffer during SMTP peak workload
+to not overload a server. 
+
+Furthermore, when used as a Mail Exchange server (MX), the ability to add delays being observed before dequeing elements
+allows, among others:
+
+ - Delaying retries upon MX delivery failure to a remote site.
+ - Throttling, which could be helpful for not being considered a spammer.
+
+A mailqueue also enables advanced administration operations like traffic review, discarding emails, resetting wait 
+delays, purging the queue, etc.
+
+### Existing distributed MailQueue
+
+Distributed James currently ship a distributed MailQueue composing the following software with the following 
+responsibilities:
+
+ - **RabbitMQ** for messaging. A rabbitMQ consumer will trigger dequeue operations.
+ - A time series projection of the queue content (order by time list of mail metadata) will be maintained in **Cassandra** . 
+ Time series avoid the aforementioned tombstone anti-pattern, and no polling is performed on this projection.
+ - **ObjectStorage** (Swift or S3) holds large byte content. This avoids overwhelming other software which do not scale
+ as well in term of Input/Output operation per seconds.
+ 
+This implementation suffers from the following pitfall:
+
+ - **RabbitMQ** is hard to reliably operate in a cluster. Cluster queues were only added in the 3.8 release line. Consistency 
+ guaranties for exchanges are unclear.
+ - The RabbitMQ Java driver is boiler plate and error prone. Things like retries, 
+ exponential back-offs, dead-lettering do not come out of the box. Publish confirms are tricky. Blocking calls are 
+ often performed. The driver is not cluster aware and would operate connected to a single host.
+ - The driver reliability is questionable: we experienced some crashed consumers that are never restarted.
+ - Throughput and scalability of RabbitMQ is questionable.
+ - The current implementation do not support priorities, delays.
+ - The current implementation is known to be complex, hard to maintain, with some non-obvious tradeoffs.
+
+### A few words about Apache Pulsar
+
+Apache Pulsar is a cloud-native, distributed messaging and streaming platform. It is horizontally scalable, low latency 
+with durability, persistent, multi-tenant, geo replicated. The count of topics can reach several millions, making it suitable
+for all queuing usages existing in James, including the one of the Event Bus (cf [ADR 37](0037-eventbus.md) and
+[ADR 38](0038-distributed-eventbus.md)).
+
+Pulsar supports advanced features like delayed messages, priorities, for instance, making it suitable to a MailQueue 
+implementation.
+
+Helm charts to ease deployments are available.
+
+Pulsar is however complex to deploy and relies on the following components:
+
+ - Stateless brokers
+ - Bookies (Bookkeeper) maintaining the persistent log of messages
+ - ZooKeeper quorum used for cluster-level configuration and coordination
+ 
+This would make it suitable for large to very-large deployments or PaaS.
+
+The Pulsar SDK is handy and handle natively reactive calls, retries, dead lettering, making implementation less 

Review comment:
       ```suggestion
   The Pulsar SDK is handy and handles natively reactive calls, retries, dead lettering, making implementation less 
   ```

##########
File path: src/adr/0051-pulsar-mailqueue.md
##########
@@ -0,0 +1,168 @@
+# 51. Pulsar MailQueue
+
+Date: 2022-01-07
+
+## Status
+
+Accepted (lazy consensus).
+
+Implemented.
+
+Provides an alternative to [ADR-31 Distributed MailQueue (RabbitMQ + Cassandra)](0031-distributed-mail-queue.md).
+
+## Context
+
+### Mail Queue
+
+MailQueue is a central component of SMTP infrastructure allowing asynchronous mail processing. This enables a short 
+SMTP reply time despite a potentially longer mail processing time. It also works as a buffer during SMTP peak workload
+to not overload a server. 
+
+Furthermore, when used as a Mail Exchange server (MX), the ability to add delays being observed before dequeing elements
+allows, among others:
+
+ - Delaying retries upon MX delivery failure to a remote site.
+ - Throttling, which could be helpful for not being considered a spammer.
+
+A mailqueue also enables advanced administration operations like traffic review, discarding emails, resetting wait 
+delays, purging the queue, etc.
+
+### Existing distributed MailQueue
+
+Distributed James currently ship a distributed MailQueue composing the following software with the following 
+responsibilities:
+
+ - **RabbitMQ** for messaging. A rabbitMQ consumer will trigger dequeue operations.
+ - A time series projection of the queue content (order by time list of mail metadata) will be maintained in **Cassandra** . 
+ Time series avoid the aforementioned tombstone anti-pattern, and no polling is performed on this projection.
+ - **ObjectStorage** (Swift or S3) holds large byte content. This avoids overwhelming other software which do not scale
+ as well in term of Input/Output operation per seconds.
+ 
+This implementation suffers from the following pitfall:
+
+ - **RabbitMQ** is hard to reliably operate in a cluster. Cluster queues were only added in the 3.8 release line. Consistency 
+ guaranties for exchanges are unclear.
+ - The RabbitMQ Java driver is boiler plate and error prone. Things like retries, 
+ exponential back-offs, dead-lettering do not come out of the box. Publish confirms are tricky. Blocking calls are 
+ often performed. The driver is not cluster aware and would operate connected to a single host.
+ - The driver reliability is questionable: we experienced some crashed consumers that are never restarted.
+ - Throughput and scalability of RabbitMQ is questionable.
+ - The current implementation do not support priorities, delays.
+ - The current implementation is known to be complex, hard to maintain, with some non-obvious tradeoffs.
+
+### A few words about Apache Pulsar
+
+Apache Pulsar is a cloud-native, distributed messaging and streaming platform. It is horizontally scalable, low latency 
+with durability, persistent, multi-tenant, geo replicated. The count of topics can reach several millions, making it suitable
+for all queuing usages existing in James, including the one of the Event Bus (cf [ADR 37](0037-eventbus.md) and
+[ADR 38](0038-distributed-eventbus.md)).
+
+Pulsar supports advanced features like delayed messages, priorities, for instance, making it suitable to a MailQueue 
+implementation.
+
+Helm charts to ease deployments are available.
+
+Pulsar is however complex to deploy and relies on the following components:
+
+ - Stateless brokers
+ - Bookies (Bookkeeper) maintaining the persistent log of messages
+ - ZooKeeper quorum used for cluster-level configuration and coordination
+ 
+This would make it suitable for large to very-large deployments or PaaS.
+
+The Pulsar SDK is handy and handle natively reactive calls, retries, dead lettering, making implementation less 
+boiler plate.
+
+## Decision
+
+Provide a distributed mail queue implemented on top of Pulsar for email metadata, using the blobStore to store email 
+content.
+
+Package this mail queue in a simple artifact dedicated to distributed mail processing.
+
+## Consequences
+
+We expect an easier to operate, cheaper, more reliable MailQueue. 
+
+We expect delays being supported as well.
+
+## Complementary work
+
+Pulsar technology would benefit from a broader adoption in James, eventually becoming the de-facto standard solution 
+backing Apache James messaging capabilities.
+
+To reach this status the following work needs to be under-taken:
+ - The Pulsar MailQueue need to work on top of a deduplicated blob store. To do this we need to be able to list blobs 

Review comment:
       ```suggestion
    - The Pulsar MailQueue needs to work on top of a deduplicated blob store. To do this we need to be able to list blobs 
   ```

##########
File path: src/adr/0051-pulsar-mailqueue.md
##########
@@ -0,0 +1,168 @@
+# 51. Pulsar MailQueue
+
+Date: 2022-01-07
+
+## Status
+
+Accepted (lazy consensus).
+
+Implemented.
+
+Provides an alternative to [ADR-31 Distributed MailQueue (RabbitMQ + Cassandra)](0031-distributed-mail-queue.md).
+
+## Context
+
+### Mail Queue
+
+MailQueue is a central component of SMTP infrastructure allowing asynchronous mail processing. This enables a short 
+SMTP reply time despite a potentially longer mail processing time. It also works as a buffer during SMTP peak workload
+to not overload a server. 
+
+Furthermore, when used as a Mail Exchange server (MX), the ability to add delays being observed before dequeing elements
+allows, among others:
+
+ - Delaying retries upon MX delivery failure to a remote site.
+ - Throttling, which could be helpful for not being considered a spammer.
+
+A mailqueue also enables advanced administration operations like traffic review, discarding emails, resetting wait 
+delays, purging the queue, etc.
+
+### Existing distributed MailQueue
+
+Distributed James currently ship a distributed MailQueue composing the following software with the following 
+responsibilities:
+
+ - **RabbitMQ** for messaging. A rabbitMQ consumer will trigger dequeue operations.
+ - A time series projection of the queue content (order by time list of mail metadata) will be maintained in **Cassandra** . 
+ Time series avoid the aforementioned tombstone anti-pattern, and no polling is performed on this projection.
+ - **ObjectStorage** (Swift or S3) holds large byte content. This avoids overwhelming other software which do not scale
+ as well in term of Input/Output operation per seconds.
+ 
+This implementation suffers from the following pitfall:
+
+ - **RabbitMQ** is hard to reliably operate in a cluster. Cluster queues were only added in the 3.8 release line. Consistency 
+ guaranties for exchanges are unclear.
+ - The RabbitMQ Java driver is boiler plate and error prone. Things like retries, 
+ exponential back-offs, dead-lettering do not come out of the box. Publish confirms are tricky. Blocking calls are 
+ often performed. The driver is not cluster aware and would operate connected to a single host.
+ - The driver reliability is questionable: we experienced some crashed consumers that are never restarted.
+ - Throughput and scalability of RabbitMQ is questionable.
+ - The current implementation do not support priorities, delays.
+ - The current implementation is known to be complex, hard to maintain, with some non-obvious tradeoffs.
+
+### A few words about Apache Pulsar
+
+Apache Pulsar is a cloud-native, distributed messaging and streaming platform. It is horizontally scalable, low latency 
+with durability, persistent, multi-tenant, geo replicated. The count of topics can reach several millions, making it suitable
+for all queuing usages existing in James, including the one of the Event Bus (cf [ADR 37](0037-eventbus.md) and
+[ADR 38](0038-distributed-eventbus.md)).
+
+Pulsar supports advanced features like delayed messages, priorities, for instance, making it suitable to a MailQueue 
+implementation.
+
+Helm charts to ease deployments are available.
+
+Pulsar is however complex to deploy and relies on the following components:
+
+ - Stateless brokers
+ - Bookies (Bookkeeper) maintaining the persistent log of messages
+ - ZooKeeper quorum used for cluster-level configuration and coordination
+ 
+This would make it suitable for large to very-large deployments or PaaS.
+
+The Pulsar SDK is handy and handle natively reactive calls, retries, dead lettering, making implementation less 
+boiler plate.
+
+## Decision
+
+Provide a distributed mail queue implemented on top of Pulsar for email metadata, using the blobStore to store email 
+content.
+
+Package this mail queue in a simple artifact dedicated to distributed mail processing.
+
+## Consequences
+
+We expect an easier to operate, cheaper, more reliable MailQueue. 

Review comment:
       ```suggestion
   We expect an easier way to operate a cheaper and more reliable MailQueue.
   ```

##########
File path: src/adr/0051-pulsar-mailqueue.md
##########
@@ -0,0 +1,168 @@
+# 51. Pulsar MailQueue
+
+Date: 2022-01-07
+
+## Status
+
+Accepted (lazy consensus).
+
+Implemented.
+
+Provides an alternative to [ADR-31 Distributed MailQueue (RabbitMQ + Cassandra)](0031-distributed-mail-queue.md).
+
+## Context
+
+### Mail Queue
+
+MailQueue is a central component of SMTP infrastructure allowing asynchronous mail processing. This enables a short 
+SMTP reply time despite a potentially longer mail processing time. It also works as a buffer during SMTP peak workload
+to not overload a server. 
+
+Furthermore, when used as a Mail Exchange server (MX), the ability to add delays being observed before dequeing elements
+allows, among others:
+
+ - Delaying retries upon MX delivery failure to a remote site.
+ - Throttling, which could be helpful for not being considered a spammer.
+
+A mailqueue also enables advanced administration operations like traffic review, discarding emails, resetting wait 
+delays, purging the queue, etc.
+
+### Existing distributed MailQueue
+
+Distributed James currently ship a distributed MailQueue composing the following software with the following 
+responsibilities:
+
+ - **RabbitMQ** for messaging. A rabbitMQ consumer will trigger dequeue operations.
+ - A time series projection of the queue content (order by time list of mail metadata) will be maintained in **Cassandra** . 
+ Time series avoid the aforementioned tombstone anti-pattern, and no polling is performed on this projection.
+ - **ObjectStorage** (Swift or S3) holds large byte content. This avoids overwhelming other software which do not scale
+ as well in term of Input/Output operation per seconds.
+ 
+This implementation suffers from the following pitfall:
+
+ - **RabbitMQ** is hard to reliably operate in a cluster. Cluster queues were only added in the 3.8 release line. Consistency 
+ guaranties for exchanges are unclear.
+ - The RabbitMQ Java driver is boiler plate and error prone. Things like retries, 
+ exponential back-offs, dead-lettering do not come out of the box. Publish confirms are tricky. Blocking calls are 
+ often performed. The driver is not cluster aware and would operate connected to a single host.
+ - The driver reliability is questionable: we experienced some crashed consumers that are never restarted.
+ - Throughput and scalability of RabbitMQ is questionable.
+ - The current implementation do not support priorities, delays.
+ - The current implementation is known to be complex, hard to maintain, with some non-obvious tradeoffs.
+
+### A few words about Apache Pulsar
+
+Apache Pulsar is a cloud-native, distributed messaging and streaming platform. It is horizontally scalable, low latency 
+with durability, persistent, multi-tenant, geo replicated. The count of topics can reach several millions, making it suitable
+for all queuing usages existing in James, including the one of the Event Bus (cf [ADR 37](0037-eventbus.md) and
+[ADR 38](0038-distributed-eventbus.md)).
+
+Pulsar supports advanced features like delayed messages, priorities, for instance, making it suitable to a MailQueue 
+implementation.
+
+Helm charts to ease deployments are available.
+
+Pulsar is however complex to deploy and relies on the following components:
+
+ - Stateless brokers
+ - Bookies (Bookkeeper) maintaining the persistent log of messages
+ - ZooKeeper quorum used for cluster-level configuration and coordination
+ 
+This would make it suitable for large to very-large deployments or PaaS.
+
+The Pulsar SDK is handy and handle natively reactive calls, retries, dead lettering, making implementation less 
+boiler plate.
+
+## Decision
+
+Provide a distributed mail queue implemented on top of Pulsar for email metadata, using the blobStore to store email 
+content.
+
+Package this mail queue in a simple artifact dedicated to distributed mail processing.
+
+## Consequences
+
+We expect an easier to operate, cheaper, more reliable MailQueue. 
+
+We expect delays being supported as well.
+
+## Complementary work
+
+Pulsar technology would benefit from a broader adoption in James, eventually becoming the de-facto standard solution 
+backing Apache James messaging capabilities.
+
+To reach this status the following work needs to be under-taken:
+ - The Pulsar MailQueue need to work on top of a deduplicated blob store. To do this we need to be able to list blobs 
+ referenced by the Pulsar MailQueue, see [JIRA-XXXX](TODO).
+ - The event bus (described in [ADR 37](0037-eventbus.md)) would benefit from a Pulsar implementation, replacing the 
+ existing RabbitMQ one (described in [ADR-38](0038-distributed-eventbus.md)). See [JIRA-XXXX](TODO).
+ - While being less critical, a task manager implementation would be needed as well to replace the RabbitMQ one
+ described in [ADR 2](0002-make-taskmanager-distributed.md) [ADR 3](0003-distributed-workqueue.md) 
+ [ADR 4](0004-distributed-tasks-listing.md) [ADR 5](0005-distributed-task-termination-ackowledgement.md) 
+ [ADR 6](0006-task-serialization.md) [ADR 7](0007-distributed-task-cancellation.md) 
+ [ADR 8](0008-distributed-task-await.md), eventually allowing to drop the RabbitMQ technology all-together.
+
+We could then create a new artifact relying solely on Pulsar, and deprecate the RabbitMQ based artifact.
+
+Priorities are not yet supported by the current implementation. See [JIRA-XXXX](TODO).
+
+A bug regarding clear not purging delayed messages had been 
+[reported](https://github.com/apache/james-project/pull/808#discussion_r780162174) as well.
+
+A broader adoption of Pulsar would benefit from performance insights.
+
+This work could be continued, for instance under the form of a Google Summer of Code for 2022.
+
+## Technical details
+
+[[This section requires a deep review]]
+
+[Akka](https://akka.io/) actor system is used in single node mode as a processing framework.
+
+The MailQueue relies on the following topology:
+
+ - out topic :  contains the mail that are ready to be dequeued.
+ - scheduled topic: emails that are delayed are first enqueued there.
+ - filter topic: Deletions (name, sender, recipients) prior a given sequence are synchronized between nodes using this topic.
+
+Upon enqueue, the blobs are first saved, then the Pulsar message payload is generated and published to the relevant 
+topic (out or scheduled).
+
+Scheduled messages have their `deliveredAt` property set to the desired value. When the delay is 
+expired, the message will be consumed and thus moved to the out topic. Flushes simply copy content of the scheduled
+topic to the out topic then reset the offset of the scheduled queue, atomically. Expired filters are removed.
+
+note that in current versions of pulsar there is a scheduled job that handles scheduled messages, the accuracy of scheduling is limited by the frequency at which this job runs.
+
+
+The size of the mail queue can be simply computed from the out and scheduled topics.
+
+Upon deletes, the condition of this deletion, as well as the sequence before which it applies is synchronized across
+nodes an in-memory datastructures wrapped in an actor. Each instance uses a unique subscription and thus will maintain a
+set of all deletions ever performed.
+
+Upon dequeues, messages of the out topic are filtered using that in-memory data structure, then exposed as a reactive 
+publisher.
+
+Upon browsing, both the out and scheduled topic are read from the consumption offset and filtering is applied.
+
+Upon clear, the out topic is deleted.
+
+
+Miscellaneous remarks:
+
+ - The pulsar admin client is used to list existing queues and to move  the current offset of scheduled message subscription upon flushes.
+ - Priorities are not yet supported.
+ - Only metadata transit through Pulsar. The general purpose James blobStore, backed by a S3 compatible API, is used to

Review comment:
       ```suggestion
    - Only metadata transit through Pulsar. The general purpose of James blobStore, backed by a S3 compatible API, is used to
   ```

##########
File path: src/adr/0051-pulsar-mailqueue.md
##########
@@ -0,0 +1,168 @@
+# 51. Pulsar MailQueue
+
+Date: 2022-01-07
+
+## Status
+
+Accepted (lazy consensus).
+
+Implemented.
+
+Provides an alternative to [ADR-31 Distributed MailQueue (RabbitMQ + Cassandra)](0031-distributed-mail-queue.md).
+
+## Context
+
+### Mail Queue
+
+MailQueue is a central component of SMTP infrastructure allowing asynchronous mail processing. This enables a short 
+SMTP reply time despite a potentially longer mail processing time. It also works as a buffer during SMTP peak workload
+to not overload a server. 
+
+Furthermore, when used as a Mail Exchange server (MX), the ability to add delays being observed before dequeing elements
+allows, among others:
+
+ - Delaying retries upon MX delivery failure to a remote site.
+ - Throttling, which could be helpful for not being considered a spammer.
+
+A mailqueue also enables advanced administration operations like traffic review, discarding emails, resetting wait 
+delays, purging the queue, etc.
+
+### Existing distributed MailQueue
+
+Distributed James currently ship a distributed MailQueue composing the following software with the following 
+responsibilities:
+
+ - **RabbitMQ** for messaging. A rabbitMQ consumer will trigger dequeue operations.
+ - A time series projection of the queue content (order by time list of mail metadata) will be maintained in **Cassandra** . 
+ Time series avoid the aforementioned tombstone anti-pattern, and no polling is performed on this projection.
+ - **ObjectStorage** (Swift or S3) holds large byte content. This avoids overwhelming other software which do not scale
+ as well in term of Input/Output operation per seconds.
+ 
+This implementation suffers from the following pitfall:
+
+ - **RabbitMQ** is hard to reliably operate in a cluster. Cluster queues were only added in the 3.8 release line. Consistency 
+ guaranties for exchanges are unclear.
+ - The RabbitMQ Java driver is boiler plate and error prone. Things like retries, 
+ exponential back-offs, dead-lettering do not come out of the box. Publish confirms are tricky. Blocking calls are 
+ often performed. The driver is not cluster aware and would operate connected to a single host.
+ - The driver reliability is questionable: we experienced some crashed consumers that are never restarted.
+ - Throughput and scalability of RabbitMQ is questionable.
+ - The current implementation do not support priorities, delays.
+ - The current implementation is known to be complex, hard to maintain, with some non-obvious tradeoffs.
+
+### A few words about Apache Pulsar
+
+Apache Pulsar is a cloud-native, distributed messaging and streaming platform. It is horizontally scalable, low latency 
+with durability, persistent, multi-tenant, geo replicated. The count of topics can reach several millions, making it suitable
+for all queuing usages existing in James, including the one of the Event Bus (cf [ADR 37](0037-eventbus.md) and
+[ADR 38](0038-distributed-eventbus.md)).
+
+Pulsar supports advanced features like delayed messages, priorities, for instance, making it suitable to a MailQueue 
+implementation.
+
+Helm charts to ease deployments are available.
+
+Pulsar is however complex to deploy and relies on the following components:
+
+ - Stateless brokers
+ - Bookies (Bookkeeper) maintaining the persistent log of messages
+ - ZooKeeper quorum used for cluster-level configuration and coordination
+ 
+This would make it suitable for large to very-large deployments or PaaS.
+
+The Pulsar SDK is handy and handle natively reactive calls, retries, dead lettering, making implementation less 
+boiler plate.
+
+## Decision
+
+Provide a distributed mail queue implemented on top of Pulsar for email metadata, using the blobStore to store email 
+content.
+
+Package this mail queue in a simple artifact dedicated to distributed mail processing.
+
+## Consequences
+
+We expect an easier to operate, cheaper, more reliable MailQueue. 
+
+We expect delays being supported as well.
+
+## Complementary work
+
+Pulsar technology would benefit from a broader adoption in James, eventually becoming the de-facto standard solution 
+backing Apache James messaging capabilities.
+
+To reach this status the following work needs to be under-taken:
+ - The Pulsar MailQueue need to work on top of a deduplicated blob store. To do this we need to be able to list blobs 
+ referenced by the Pulsar MailQueue, see [JIRA-XXXX](TODO).
+ - The event bus (described in [ADR 37](0037-eventbus.md)) would benefit from a Pulsar implementation, replacing the 
+ existing RabbitMQ one (described in [ADR-38](0038-distributed-eventbus.md)). See [JIRA-XXXX](TODO).
+ - While being less critical, a task manager implementation would be needed as well to replace the RabbitMQ one
+ described in [ADR 2](0002-make-taskmanager-distributed.md) [ADR 3](0003-distributed-workqueue.md) 
+ [ADR 4](0004-distributed-tasks-listing.md) [ADR 5](0005-distributed-task-termination-ackowledgement.md) 
+ [ADR 6](0006-task-serialization.md) [ADR 7](0007-distributed-task-cancellation.md) 
+ [ADR 8](0008-distributed-task-await.md), eventually allowing to drop the RabbitMQ technology all-together.
+
+We could then create a new artifact relying solely on Pulsar, and deprecate the RabbitMQ based artifact.
+
+Priorities are not yet supported by the current implementation. See [JIRA-XXXX](TODO).
+
+A bug regarding clear not purging delayed messages had been 
+[reported](https://github.com/apache/james-project/pull/808#discussion_r780162174) as well.
+
+A broader adoption of Pulsar would benefit from performance insights.
+
+This work could be continued, for instance under the form of a Google Summer of Code for 2022.
+
+## Technical details
+
+[[This section requires a deep review]]
+
+[Akka](https://akka.io/) actor system is used in single node mode as a processing framework.
+
+The MailQueue relies on the following topology:
+
+ - out topic :  contains the mail that are ready to be dequeued.
+ - scheduled topic: emails that are delayed are first enqueued there.
+ - filter topic: Deletions (name, sender, recipients) prior a given sequence are synchronized between nodes using this topic.
+
+Upon enqueue, the blobs are first saved, then the Pulsar message payload is generated and published to the relevant 
+topic (out or scheduled).
+
+Scheduled messages have their `deliveredAt` property set to the desired value. When the delay is 
+expired, the message will be consumed and thus moved to the out topic. Flushes simply copy content of the scheduled
+topic to the out topic then reset the offset of the scheduled queue, atomically. Expired filters are removed.
+
+note that in current versions of pulsar there is a scheduled job that handles scheduled messages, the accuracy of scheduling is limited by the frequency at which this job runs.
+
+
+The size of the mail queue can be simply computed from the out and scheduled topics.
+
+Upon deletes, the condition of this deletion, as well as the sequence before which it applies is synchronized across
+nodes an in-memory datastructures wrapped in an actor. Each instance uses a unique subscription and thus will maintain a
+set of all deletions ever performed.
+
+Upon dequeues, messages of the out topic are filtered using that in-memory data structure, then exposed as a reactive 
+publisher.
+
+Upon browsing, both the out and scheduled topic are read from the consumption offset and filtering is applied.

Review comment:
       ```suggestion
   Upon browsing, both the out and scheduled topics are read from the consumption offset and filtering is applied.
   ```

##########
File path: src/adr/0051-pulsar-mailqueue.md
##########
@@ -0,0 +1,168 @@
+# 51. Pulsar MailQueue
+
+Date: 2022-01-07
+
+## Status
+
+Accepted (lazy consensus).
+
+Implemented.
+
+Provides an alternative to [ADR-31 Distributed MailQueue (RabbitMQ + Cassandra)](0031-distributed-mail-queue.md).
+
+## Context
+
+### Mail Queue
+
+MailQueue is a central component of SMTP infrastructure allowing asynchronous mail processing. This enables a short 
+SMTP reply time despite a potentially longer mail processing time. It also works as a buffer during SMTP peak workload
+to not overload a server. 
+
+Furthermore, when used as a Mail Exchange server (MX), the ability to add delays being observed before dequeing elements
+allows, among others:
+
+ - Delaying retries upon MX delivery failure to a remote site.
+ - Throttling, which could be helpful for not being considered a spammer.
+
+A mailqueue also enables advanced administration operations like traffic review, discarding emails, resetting wait 
+delays, purging the queue, etc.
+
+### Existing distributed MailQueue
+
+Distributed James currently ship a distributed MailQueue composing the following software with the following 
+responsibilities:
+
+ - **RabbitMQ** for messaging. A rabbitMQ consumer will trigger dequeue operations.
+ - A time series projection of the queue content (order by time list of mail metadata) will be maintained in **Cassandra** . 
+ Time series avoid the aforementioned tombstone anti-pattern, and no polling is performed on this projection.
+ - **ObjectStorage** (Swift or S3) holds large byte content. This avoids overwhelming other software which do not scale
+ as well in term of Input/Output operation per seconds.
+ 
+This implementation suffers from the following pitfall:
+
+ - **RabbitMQ** is hard to reliably operate in a cluster. Cluster queues were only added in the 3.8 release line. Consistency 
+ guaranties for exchanges are unclear.
+ - The RabbitMQ Java driver is boiler plate and error prone. Things like retries, 
+ exponential back-offs, dead-lettering do not come out of the box. Publish confirms are tricky. Blocking calls are 
+ often performed. The driver is not cluster aware and would operate connected to a single host.
+ - The driver reliability is questionable: we experienced some crashed consumers that are never restarted.
+ - Throughput and scalability of RabbitMQ is questionable.
+ - The current implementation do not support priorities, delays.
+ - The current implementation is known to be complex, hard to maintain, with some non-obvious tradeoffs.
+
+### A few words about Apache Pulsar
+
+Apache Pulsar is a cloud-native, distributed messaging and streaming platform. It is horizontally scalable, low latency 
+with durability, persistent, multi-tenant, geo replicated. The count of topics can reach several millions, making it suitable
+for all queuing usages existing in James, including the one of the Event Bus (cf [ADR 37](0037-eventbus.md) and
+[ADR 38](0038-distributed-eventbus.md)).
+
+Pulsar supports advanced features like delayed messages, priorities, for instance, making it suitable to a MailQueue 
+implementation.
+
+Helm charts to ease deployments are available.
+
+Pulsar is however complex to deploy and relies on the following components:
+
+ - Stateless brokers
+ - Bookies (Bookkeeper) maintaining the persistent log of messages
+ - ZooKeeper quorum used for cluster-level configuration and coordination
+ 
+This would make it suitable for large to very-large deployments or PaaS.
+
+The Pulsar SDK is handy and handle natively reactive calls, retries, dead lettering, making implementation less 
+boiler plate.
+
+## Decision
+
+Provide a distributed mail queue implemented on top of Pulsar for email metadata, using the blobStore to store email 
+content.
+
+Package this mail queue in a simple artifact dedicated to distributed mail processing.
+
+## Consequences
+
+We expect an easier to operate, cheaper, more reliable MailQueue. 
+
+We expect delays being supported as well.
+
+## Complementary work
+
+Pulsar technology would benefit from a broader adoption in James, eventually becoming the de-facto standard solution 
+backing Apache James messaging capabilities.
+
+To reach this status the following work needs to be under-taken:
+ - The Pulsar MailQueue need to work on top of a deduplicated blob store. To do this we need to be able to list blobs 
+ referenced by the Pulsar MailQueue, see [JIRA-XXXX](TODO).
+ - The event bus (described in [ADR 37](0037-eventbus.md)) would benefit from a Pulsar implementation, replacing the 
+ existing RabbitMQ one (described in [ADR-38](0038-distributed-eventbus.md)). See [JIRA-XXXX](TODO).
+ - While being less critical, a task manager implementation would be needed as well to replace the RabbitMQ one
+ described in [ADR 2](0002-make-taskmanager-distributed.md) [ADR 3](0003-distributed-workqueue.md) 
+ [ADR 4](0004-distributed-tasks-listing.md) [ADR 5](0005-distributed-task-termination-ackowledgement.md) 
+ [ADR 6](0006-task-serialization.md) [ADR 7](0007-distributed-task-cancellation.md) 
+ [ADR 8](0008-distributed-task-await.md), eventually allowing to drop the RabbitMQ technology all-together.
+
+We could then create a new artifact relying solely on Pulsar, and deprecate the RabbitMQ based artifact.
+
+Priorities are not yet supported by the current implementation. See [JIRA-XXXX](TODO).
+
+A bug regarding clear not purging delayed messages had been 
+[reported](https://github.com/apache/james-project/pull/808#discussion_r780162174) as well.
+
+A broader adoption of Pulsar would benefit from performance insights.
+
+This work could be continued, for instance under the form of a Google Summer of Code for 2022.
+
+## Technical details
+
+[[This section requires a deep review]]
+
+[Akka](https://akka.io/) actor system is used in single node mode as a processing framework.
+
+The MailQueue relies on the following topology:
+
+ - out topic :  contains the mail that are ready to be dequeued.
+ - scheduled topic: emails that are delayed are first enqueued there.
+ - filter topic: Deletions (name, sender, recipients) prior a given sequence are synchronized between nodes using this topic.
+
+Upon enqueue, the blobs are first saved, then the Pulsar message payload is generated and published to the relevant 
+topic (out or scheduled).
+
+Scheduled messages have their `deliveredAt` property set to the desired value. When the delay is 
+expired, the message will be consumed and thus moved to the out topic. Flushes simply copy content of the scheduled
+topic to the out topic then reset the offset of the scheduled queue, atomically. Expired filters are removed.
+
+note that in current versions of pulsar there is a scheduled job that handles scheduled messages, the accuracy of scheduling is limited by the frequency at which this job runs.

Review comment:
       ```suggestion
   Note that in current versions of pulsar there is a scheduled job that handles scheduled messages, the accuracy of scheduling is limited by the frequency at which this job runs.
   ```

##########
File path: src/adr/0051-pulsar-mailqueue.md
##########
@@ -0,0 +1,168 @@
+# 51. Pulsar MailQueue
+
+Date: 2022-01-07
+
+## Status
+
+Accepted (lazy consensus).
+
+Implemented.
+
+Provides an alternative to [ADR-31 Distributed MailQueue (RabbitMQ + Cassandra)](0031-distributed-mail-queue.md).
+
+## Context
+
+### Mail Queue
+
+MailQueue is a central component of SMTP infrastructure allowing asynchronous mail processing. This enables a short 
+SMTP reply time despite a potentially longer mail processing time. It also works as a buffer during SMTP peak workload
+to not overload a server. 
+
+Furthermore, when used as a Mail Exchange server (MX), the ability to add delays being observed before dequeing elements
+allows, among others:
+
+ - Delaying retries upon MX delivery failure to a remote site.
+ - Throttling, which could be helpful for not being considered a spammer.
+
+A mailqueue also enables advanced administration operations like traffic review, discarding emails, resetting wait 
+delays, purging the queue, etc.
+
+### Existing distributed MailQueue
+
+Distributed James currently ship a distributed MailQueue composing the following software with the following 
+responsibilities:
+
+ - **RabbitMQ** for messaging. A rabbitMQ consumer will trigger dequeue operations.
+ - A time series projection of the queue content (order by time list of mail metadata) will be maintained in **Cassandra** . 
+ Time series avoid the aforementioned tombstone anti-pattern, and no polling is performed on this projection.
+ - **ObjectStorage** (Swift or S3) holds large byte content. This avoids overwhelming other software which do not scale
+ as well in term of Input/Output operation per seconds.
+ 
+This implementation suffers from the following pitfall:
+
+ - **RabbitMQ** is hard to reliably operate in a cluster. Cluster queues were only added in the 3.8 release line. Consistency 
+ guaranties for exchanges are unclear.
+ - The RabbitMQ Java driver is boiler plate and error prone. Things like retries, 
+ exponential back-offs, dead-lettering do not come out of the box. Publish confirms are tricky. Blocking calls are 
+ often performed. The driver is not cluster aware and would operate connected to a single host.
+ - The driver reliability is questionable: we experienced some crashed consumers that are never restarted.
+ - Throughput and scalability of RabbitMQ is questionable.
+ - The current implementation do not support priorities, delays.
+ - The current implementation is known to be complex, hard to maintain, with some non-obvious tradeoffs.
+
+### A few words about Apache Pulsar
+
+Apache Pulsar is a cloud-native, distributed messaging and streaming platform. It is horizontally scalable, low latency 
+with durability, persistent, multi-tenant, geo replicated. The count of topics can reach several millions, making it suitable
+for all queuing usages existing in James, including the one of the Event Bus (cf [ADR 37](0037-eventbus.md) and
+[ADR 38](0038-distributed-eventbus.md)).
+
+Pulsar supports advanced features like delayed messages, priorities, for instance, making it suitable to a MailQueue 
+implementation.
+
+Helm charts to ease deployments are available.
+
+Pulsar is however complex to deploy and relies on the following components:
+
+ - Stateless brokers
+ - Bookies (Bookkeeper) maintaining the persistent log of messages
+ - ZooKeeper quorum used for cluster-level configuration and coordination
+ 
+This would make it suitable for large to very-large deployments or PaaS.
+
+The Pulsar SDK is handy and handle natively reactive calls, retries, dead lettering, making implementation less 
+boiler plate.
+
+## Decision
+
+Provide a distributed mail queue implemented on top of Pulsar for email metadata, using the blobStore to store email 
+content.
+
+Package this mail queue in a simple artifact dedicated to distributed mail processing.
+
+## Consequences
+
+We expect an easier to operate, cheaper, more reliable MailQueue. 
+
+We expect delays being supported as well.
+
+## Complementary work
+
+Pulsar technology would benefit from a broader adoption in James, eventually becoming the de-facto standard solution 
+backing Apache James messaging capabilities.
+
+To reach this status the following work needs to be under-taken:
+ - The Pulsar MailQueue need to work on top of a deduplicated blob store. To do this we need to be able to list blobs 
+ referenced by the Pulsar MailQueue, see [JIRA-XXXX](TODO).
+ - The event bus (described in [ADR 37](0037-eventbus.md)) would benefit from a Pulsar implementation, replacing the 
+ existing RabbitMQ one (described in [ADR-38](0038-distributed-eventbus.md)). See [JIRA-XXXX](TODO).
+ - While being less critical, a task manager implementation would be needed as well to replace the RabbitMQ one
+ described in [ADR 2](0002-make-taskmanager-distributed.md) [ADR 3](0003-distributed-workqueue.md) 
+ [ADR 4](0004-distributed-tasks-listing.md) [ADR 5](0005-distributed-task-termination-ackowledgement.md) 
+ [ADR 6](0006-task-serialization.md) [ADR 7](0007-distributed-task-cancellation.md) 
+ [ADR 8](0008-distributed-task-await.md), eventually allowing to drop the RabbitMQ technology all-together.
+
+We could then create a new artifact relying solely on Pulsar, and deprecate the RabbitMQ based artifact.
+
+Priorities are not yet supported by the current implementation. See [JIRA-XXXX](TODO).
+
+A bug regarding clear not purging delayed messages had been 
+[reported](https://github.com/apache/james-project/pull/808#discussion_r780162174) as well.
+
+A broader adoption of Pulsar would benefit from performance insights.
+
+This work could be continued, for instance under the form of a Google Summer of Code for 2022.
+
+## Technical details
+
+[[This section requires a deep review]]
+
+[Akka](https://akka.io/) actor system is used in single node mode as a processing framework.
+
+The MailQueue relies on the following topology:
+
+ - out topic :  contains the mail that are ready to be dequeued.

Review comment:
       ```suggestion
    - out topic :  contains the mails that are ready to be dequeued.
   ```

##########
File path: src/adr/0051-pulsar-mailqueue.md
##########
@@ -0,0 +1,168 @@
+# 51. Pulsar MailQueue
+
+Date: 2022-01-07
+
+## Status
+
+Accepted (lazy consensus).
+
+Implemented.
+
+Provides an alternative to [ADR-31 Distributed MailQueue (RabbitMQ + Cassandra)](0031-distributed-mail-queue.md).
+
+## Context
+
+### Mail Queue
+
+MailQueue is a central component of SMTP infrastructure allowing asynchronous mail processing. This enables a short 
+SMTP reply time despite a potentially longer mail processing time. It also works as a buffer during SMTP peak workload
+to not overload a server. 
+
+Furthermore, when used as a Mail Exchange server (MX), the ability to add delays being observed before dequeing elements
+allows, among others:
+
+ - Delaying retries upon MX delivery failure to a remote site.
+ - Throttling, which could be helpful for not being considered a spammer.
+
+A mailqueue also enables advanced administration operations like traffic review, discarding emails, resetting wait 
+delays, purging the queue, etc.
+
+### Existing distributed MailQueue
+
+Distributed James currently ship a distributed MailQueue composing the following software with the following 
+responsibilities:
+
+ - **RabbitMQ** for messaging. A rabbitMQ consumer will trigger dequeue operations.
+ - A time series projection of the queue content (order by time list of mail metadata) will be maintained in **Cassandra** . 
+ Time series avoid the aforementioned tombstone anti-pattern, and no polling is performed on this projection.
+ - **ObjectStorage** (Swift or S3) holds large byte content. This avoids overwhelming other software which do not scale
+ as well in term of Input/Output operation per seconds.
+ 
+This implementation suffers from the following pitfall:
+
+ - **RabbitMQ** is hard to reliably operate in a cluster. Cluster queues were only added in the 3.8 release line. Consistency 
+ guaranties for exchanges are unclear.
+ - The RabbitMQ Java driver is boiler plate and error prone. Things like retries, 
+ exponential back-offs, dead-lettering do not come out of the box. Publish confirms are tricky. Blocking calls are 
+ often performed. The driver is not cluster aware and would operate connected to a single host.
+ - The driver reliability is questionable: we experienced some crashed consumers that are never restarted.
+ - Throughput and scalability of RabbitMQ is questionable.
+ - The current implementation do not support priorities, delays.

Review comment:
       ```suggestion
    - The current implementation does not support priorities, delays.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@james.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@james.apache.org
For additional commands, e-mail: notifications-help@james.apache.org