You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ponymail.apache.org by sebb <se...@gmail.com> on 2022/03/19 11:48:32 UTC

Are all emails properly archived?

AFAICT, all distinct email sources are currently stored in the
database, because the the id is derived from a hash of the source. (*)

However, that does not mean that they are recoverable.

The current database design requires that source entries are retrieved
via the corresponding mbox entry.
If a second email is received that hashes to the same mbox index, the
pointer back to the existing source entry will be overwritten.

Such duplicates are not unknown; mail transport glitches can result in
duplication of email content (but different ezmlm archive numbers and
some other headers).

In such cases, it is no longer possible to recover the original source.

I think this could be fixed, but until it is, I don't think Pony Mail
can be considered as a complete archival application, as it does not
give access to all the emails received by a mailing list.

Sebb
(*) discounting hash collisions, which should be vanishingly small

Re: Are all emails properly archived?

Posted by sebb <se...@gmail.com>.
On Sat, 19 Mar 2022 at 13:39, Daniel Gruno <hu...@apache.org> wrote:
>
> On 19/03/2022 12.48, sebb wrote:
> > AFAICT, all distinct email sources are currently stored in the
> > database, because the the id is derived from a hash of the source. (*)
> >
> > However, that does not mean that they are recoverable.
> >
> > The current database design requires that source entries are retrieved
> > via the corresponding mbox entry.
> > If a second email is received that hashes to the same mbox index, the
> > pointer back to the existing source entry will be overwritten.
> >
> > Such duplicates are not unknown; mail transport glitches can result in
> > duplication of email content (but different ezmlm archive numbers and
> > some other headers).
> >
> > In such cases, it is no longer possible to recover the original source.
>
> I think we discussed this before. One solution is to change the behavior
> at
> https://github.com/apache/incubator-ponymail-foal/blob/master/tools/archiver.py#L728
> - if an email is found to already exist with the same DKIM ID, we should
> fetch it and append the new mbox_source ID to the existing document. As
> ElasticSearch doesn't care if something is a string or an array of
> strings, this should be fine. A check for the source could then perhaps
> result in a HTTP 300 Multiple Choices response?

Yes, something like that should fix it.

> >
> > I think this could be fixed, but until it is, I don't think Pony Mail
> > can be considered as a complete archival application, as it does not
> > give access to all the emails received by a mailing list.
>
> We have to manage expectations and define what we mean by "archive". In

Exactly.

I expect an archive of an email list to contain the same emails as I
receive as a subscriber.

> my world, Pony Mail exists as a searchable/interactive archive for users
> to find _content_ and _intentions_, not necessarily as a bit-for-bit
> verbatim backup for system administrators. If people wish to insure
> against disasters, there are ways of doing that.

That is not the point.

> I find it sufficient that as long as you can find your email in the
> right place, it does not matter if it's technically a "de-duplicated
> duplicate".
>

I don't find it sufficient.

If I cannot find all my emails in the archive, I don't consider it complete.

> >
> > Sebb
> > (*) discounting hash collisions, which should be vanishingly small
>

Re: Are all emails properly archived?

Posted by Daniel Gruno <hu...@apache.org>.
On 19/03/2022 12.48, sebb wrote:
> AFAICT, all distinct email sources are currently stored in the
> database, because the the id is derived from a hash of the source. (*)
> 
> However, that does not mean that they are recoverable.
> 
> The current database design requires that source entries are retrieved
> via the corresponding mbox entry.
> If a second email is received that hashes to the same mbox index, the
> pointer back to the existing source entry will be overwritten.
> 
> Such duplicates are not unknown; mail transport glitches can result in
> duplication of email content (but different ezmlm archive numbers and
> some other headers).
> 
> In such cases, it is no longer possible to recover the original source.

I think we discussed this before. One solution is to change the behavior 
at 
https://github.com/apache/incubator-ponymail-foal/blob/master/tools/archiver.py#L728 
- if an email is found to already exist with the same DKIM ID, we should 
fetch it and append the new mbox_source ID to the existing document. As 
ElasticSearch doesn't care if something is a string or an array of 
strings, this should be fine. A check for the source could then perhaps 
result in a HTTP 300 Multiple Choices response?

> 
> I think this could be fixed, but until it is, I don't think Pony Mail
> can be considered as a complete archival application, as it does not
> give access to all the emails received by a mailing list.

We have to manage expectations and define what we mean by "archive". In 
my world, Pony Mail exists as a searchable/interactive archive for users 
to find _content_ and _intentions_, not necessarily as a bit-for-bit 
verbatim backup for system administrators. If people wish to insure 
against disasters, there are ways of doing that.

I find it sufficient that as long as you can find your email in the 
right place, it does not matter if it's technically a "de-duplicated 
duplicate".


> 
> Sebb
> (*) discounting hash collisions, which should be vanishingly small