You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@ponymail.apache.org by Daniel Gruno <hu...@apache.org> on 2020/01/07 14:56:12 UTC
Lowercase headers and hashes
Hi folks,
I stumbled upon an issue with gmail today, and wondering how to proceed
here. It seems that google modifies SMTP headers and lowercases things
like From: addresses (for instance, jim@jaguNET.com becomes
jim@jagunet.com in gmail), meaning you can't currently use a gmail
storage of email for recreating a database unless it itself was sourced
from gmail, as the digest would differ due to casing.
One "fix" would be to simply lowercase all header fields when we
calculate the document ID, but that would also mean "yet another
algorithm change" because someone at google thought it fun to modify
sources (they also hard-wrap the body at 78 char, even from other
origins, but that doesn't affect us from what I can tell).
I am open to suggestions and feedback.
With regards,
Daniel.
Re: Lowercase headers and hashes
Posted by sebb <se...@gmail.com>.
On Tue, 7 Jan 2020 at 14:56, Daniel Gruno <hu...@apache.org> wrote:
> Hi folks,
> I stumbled upon an issue with gmail today, and wondering how to proceed
> here. It seems that google modifies SMTP headers and lowercases things
> like From: addresses (for instance, jim@jaguNET.com becomes
> jim@jagunet.com in gmail), meaning you can't currently use a gmail
> storage of email for recreating a database unless it itself was sourced
> from gmail, as the digest would differ due to casing.
>
> One "fix" would be to simply lowercase all header fields when we
> calculate the document ID, but that would also mean "yet another
> algorithm change" because someone at google thought it fun to modify
> sources (they also hard-wrap the body at 78 char, even from other
> origins, but that doesn't affect us from what I can tell).
>
> I am open to suggestions and feedback.
>
>
This is an edge use case, so whatever is done should be optional (and well
documented as to the effects of using it).
Having said that, I think a new algorithm *is* still needed, because the
current algorithms are clearly still not sufficiently independent of the
mail transport or message routing.
However, whilst it is easy to create new algorithms, it is not easy to
manage the resulting Permalinks long-term.
The first part of the exercise would be to collect as many different
examples of emails as possible.
These can then be studied to see which parts are invariant across
transports etc.
This is not an easy task, as the current range of algorithms demonstrates.
I think we need more people to get involved in the design/review process.
The hash is currently used for both database id and Permalink.
These have different requirements. This makes it harder to design the
algorithm.
With regards,
> Daniel.
>