You are viewing a plain text version of this content. The canonical link for it is here.
Posted to server-dev@james.apache.org by "Benoit Tellier (Jira)" <se...@james.apache.org> on 2022/01/28 02:31:00 UTC

[jira] [Commented] (JAMES-3710) Restarting James while deleting using POP3 causes inconsistency

    [ https://issues.apache.org/jira/browse/JAMES-3710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17483524#comment-17483524 ] 

Benoit Tellier commented on JAMES-3710:
---------------------------------------

Hello, sadly there is no silver bullet for this as the underlying data store do not support transactions.

For completness, you should include the pop3metadata table in the analyse.

Here is how the deletes proceeds:

(Delete pointers pointing to the data)
1. Delete data in the table of truth : ImapUidTable
2. Delete data in the de denormalisation table : MessageIdTable. This is retried with exponential backoffs in an attempt to match content of the table of truth in a best attempt effort
3. A couple of mailbox related metadata projections are updated too. Retries if idenpotent.
4. From there the deletion is broadcasted to the mailbox listeners, in RabbitMQ, events are retried accross the cluster
5. the POP3 listener removes the entry from pop3metadata

There are additional mechanisms (applicative read repairs, consistency tasks) to further limit that.

The POP3 server detects such inconsistencies and fix them (emmitting a log in the process as you pointed out).

So there is little more that can be done to limit denormalisation inconsistencies.

Having a look at the POP3 server https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/ExecutorService.html#shutdown() is being called during graceful shutdowns which is supposed to let underlying tasks the time to complete as far as I am aware of. I'm not an expert here, and I would be happy to have an other pair of eyes looking at this.



> Restarting James while deleting using POP3 causes inconsistency
> ---------------------------------------------------------------
>
>                 Key: JAMES-3710
>                 URL: https://issues.apache.org/jira/browse/JAMES-3710
>             Project: James Server
>          Issue Type: Bug
>    Affects Versions: master
>            Reporter: Ilja Weis
>            Priority: Major
>         Attachments: logs.zip
>
>
> Running James (distributed-pop3-app) and restarting it while
> it is processing a lot of POP3 sessions that are deleting messages
> causes the tables messagev3, pop3metadata and imapuidtable to be
> inconsistent, leading to the problems discussed in JAMES-3709.
> The effect after applying the changes from https://github.com/apache/james-project/pull/861 is that POP3 sessions TOPing messages that no longer exist
> but still are in pop3metadata now cause "-ERR Message (12) does not exist."
> as expected but no longer destroy the whole session.
> The interesting part is why the data becomes inconsistent. This is probably
> an edge case because we're probably going to restart James all the time
> but perhaps there's a problem somewhere that's just more likely to happen
> when restarting?
> My setup: 4 James instances in Kubernetes, S3 storage, Cassandra
> cluster, 2 datacenters with 3 nodes each, replication DC1:3,DC2:3. Relevant
> cassandra properties:
> cassandra.consistency_level.regular=QUORUM
> cassandra.consistency_level.lightweight_transaction=SERIAL
> message.read.strong.consistency=false
> message.write.strong.consistency.unsafe=false
> mailbox.read.strong.consistency=false
> mailbox.read.repair.chance=0.00
> mailbox.counters.read.repair.chance.max=0.000
> mailbox.counters.read.repair.chance.one.hundred=0.000
> What I did:
> - Send a number of mails to 200 mailboxes. After this:
> > select count(*) from messagev3;
>  count
> -------
>  16837
> > select count(*) from imapuidtable;
>  count
> -------
>  16837
> > select count(*) from pop3metadata ;
>  count
> -------
>  16837
> Then, start deleting all those messages with 40 parallel sessions.
> There are no concurrent sessions to the same account.
> While the deletes are running, restart all the james instances.
> After a moment, we have:
> > select count(*) from messagev3;
>  count
> -------
>  14669
> > select count(*) from imapuidtable;
>  count
> -------
>  14194
> > select count(*) from pop3metadata ;
>  count
> -------
>  14669
> Not sure if messagev3 is relevant here, just adding it for completeness.
> Now if I'm accessing the mailboxes, this will touch some of the
> messages that are no longer in imapuidtable which of course leads to
> -ERR Message (12) does not exist.
> and
> 16:32:48.175 [WARN ] o.a.j.p.m.DistributedMailboxAdapter - Removing cc9b9f70-7f8d-11ec-b659-0f643f8b11e7 from 8870f760-7f87-11ec-9109-5fc36fc8b18d POP3 projection for user james-032@james.testing at it is not backed by a MailboxMessage
> Running the fixPop3Inconsistencies task in such a situation would then
> clean up all the pop3metadata messages as "stale".
> I have attached the James log while deleting/restarting and after
> restarting for all 4 instances.
> What do we make of this? Is this something relevant or more like something
> that just can happen if we restart James at the wrong time?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscribe@james.apache.org
For additional commands, e-mail: server-dev-help@james.apache.org