You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ponymail.apache.org by Daniel Gruno <hu...@apache.org> on 2020/01/07 14:56:12 UTC

Lowercase headers and hashes

Hi folks,
I stumbled upon an issue with gmail today, and wondering how to proceed 
here. It seems that google modifies SMTP headers and lowercases things 
like From: addresses (for instance, jim@jaguNET.com becomes 
jim@jagunet.com in gmail), meaning you can't currently use a gmail 
storage of email for recreating a database unless it itself was sourced 
from gmail, as the digest would differ due to casing.

One "fix" would be to simply lowercase all header fields when we 
calculate the document ID, but that would also mean "yet another 
algorithm change" because someone at google thought it fun to modify 
sources (they also hard-wrap the body at 78 char, even from other 
origins, but that doesn't affect us from what I can tell).

I am open to suggestions and feedback.

With regards,
Daniel.

Re: Lowercase headers and hashes

Posted by sebb <se...@gmail.com>.

On Tue, 7 Jan 2020 at 14:56, Daniel Gruno <hu...@apache.org> wrote:

> Hi folks,
> I stumbled upon an issue with gmail today, and wondering how to proceed
> here. It seems that google modifies SMTP headers and lowercases things
> like From: addresses (for instance, jim@jaguNET.com becomes
> jim@jagunet.com in gmail), meaning you can't currently use a gmail
> storage of email for recreating a database unless it itself was sourced
> from gmail, as the digest would differ due to casing.
>
> One "fix" would be to simply lowercase all header fields when we
> calculate the document ID, but that would also mean "yet another
> algorithm change" because someone at google thought it fun to modify
> sources (they also hard-wrap the body at 78 char, even from other
> origins, but that doesn't affect us from what I can tell).
>
> I am open to suggestions and feedback.
>
>
This is an edge use case, so whatever is done should be optional (and well
documented as to the effects of using it).

Having said that, I think a new algorithm *is* still needed, because the
current algorithms are clearly still not sufficiently independent of the
mail transport or message routing.
However, whilst it is easy to create new algorithms, it is not easy to
manage the resulting Permalinks long-term.

The first part of the exercise would be to collect as many different
examples of emails as possible.
These can then be studied to see which parts are invariant across
transports etc.

This is not an easy task, as the current range of algorithms demonstrates.
I think we need more people to get involved in the design/review process.

The hash is currently used for both database id and Permalink.
These have different requirements. This makes it harder to design the
algorithm.

With regards,
> Daniel.
>