You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Benoit Tellier (Jira)" <se...@james.apache.org> on 2022/08/20 06:16:00 UTC

[jira] [Closed] (JAMES-3784) Ease mail repository / event dead letter operation

     [ https://issues.apache.org/jira/browse/JAMES-3784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benoit Tellier closed JAMES-3784.
---------------------------------
    Resolution: Fixed

Done

> Ease mail repository / event dead letter operation
> --------------------------------------------------
>
>                 Key: JAMES-3784
>                 URL: https://issues.apache.org/jira/browse/JAMES-3784
>             Project: James Server
>          Issue Type: Improvement
>          Components: mailbox, MailStore &amp; MailRepository
>    Affects Versions: master
>            Reporter: Benoit Tellier
>            Priority: Major
>             Fix For: 3.8.0
>
>          Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> h3. Mailing list thread
> https://www.mail-archive.com/server-dev@james.apache.org/msg72012.html
> h3. context
> James does mostly 2 kinds of processing:
>  - Mail processing: when receiving a mail in SMTP, the mail is enqueued and then the mailet processing is executed. Mailet/matchers are called against it and a serie of decision can be made: store this mail n a user mailbox, forward it, bounce, ignore it, etc...  - Event processing: Once actions are taken in a user mailbox, an event is emitted to the event bus. Listeners are then called to "decorate" features of the mailbox manager in a non invasive way. ElasticSearch indexing, quota management, JMAP projections and state, IMAP/JMAP notifications...
> Of course each of these processing can fail and error management is applied.
> Regarding Mail processing the mail is stored in /var/mail/error .
> To detect incidents:
>   - ERROR logs during processing
>   - Webadmin calls shows a non-zero /var/mail/error repository size
> To fix this incident:
>   - Explicit admin action is required, and if needed a reprocessing can be attempted (webadmin)
> Regarding Event processing, listener execution is retried several time with a delay. If it keeps failingit is eventually stored in dead letter.
> To detect incidents:
>  - ERROR logs during processing
>  - WebAdmin reports a non-zero size for deadletter
>  - Health check, wich eventually does a recuring WARNING log that cannot be missed.
> To fix this incident:
>   - Explicit admin action is required, and if needed a redelivery can be attempted (webadmin) 
> h3. Problem statement
> Most users misses this yet critical part of error management in James.
> Actions are never taken, problems piles up.
> While understandably major incidents with thousands of problems would clearly benefit from an admin intervention, I would like small incidents to self recover without a human intervention.
> In practice, none of my clients (me included) managed to set up a reliable action plan regarding processing failures. Problems could be takled months after they arise thus escalating in major issues needlessly. 
> h3. Proposed solution
>  - Implement a healthcheck that verifies var/mail/error is empty
>  - An upper bound on redelivery/reprocessing exposed by webadmin
> The goal of this limit is to prevent unbounded processing that could consume unbounded  resources. Auto-healing could be budgetted for (eg: 10 mails/min).
> A human intervention is still needed in some cases:
>  - Massive outage whose require a full redelivery/reprocessing
>  - Bugs that cause recurring failure.
> The goal is to have auto-healing in place, given those tasks are called with CRONs.
> CRONs remove the need for extra James based developments that adds complexity.
> h3. Proposed changes
> Add a `limit` parameter to reprocessing /redelivery.
> If specified, it enables to limit the count of element reprocessed/redelivered. If unspecified the count of processed element is unbounded (like today)
> Endpoints to modify:
>   - https://james.staged.apache.org/james-distributed-app/3.7.0/operate/webadmin.html#_reprocessing_mails_from_a_mail_repository
>  - https://james.staged.apache.org/james-distributed-app/3.7.0/operate/webadmin.html#_redeliver_all_events
>  - https://james.staged.apache.org/james-distributed-app/3.7.0/operate/webadmin.html#_redeliver_group_events
> We also need:
>  - to update webadmin documentation accordingly.
>  - to recommend a CRON of eg 10 redelivery/reprocessing per minute in our operation guides. (https://james.apache.org/server/manage-guice-distributed-james.html + https://james.staged.apache.org/james-distributed-app/3.7.0/operate/guide.html)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org