You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ponymail.apache.org by GitBox <gi...@apache.org> on 2020/08/11 11:43:13 UTC

[GitHub] [incubator-ponymail] sbp opened a new pull request #517: Add DKIM style ID generation

sbp opened a new pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517


   This PR adds **DKIM style Ponymail ID generation**.
   
   ## Why?
   
   There are a number of existing Ponymail ID generators, two of which are currently recommended: `full` for a single node, and `cluster` for multiple nodes. The purpose of the latter in particular is to generate an ID based on the hash of the email once elements that may vary across cluster nodes have been excluded.
   
   There are other situations in which elements of an email may vary from environment to environment. Once such scenario is that archives of a mailing list collected by different people or organisations in different locations may contain different actual email sources. The `Received` headers, for example, will be different depending on the routes of emails sent out by the mailing list software.
   
   Variability in emails is a problem in another area too: guaranteeing email authenticity. Authenticity has been studied and gradually solved over the years by technologies such as SPF, DKIM, and ARC. Since DKIM involves signing emails, the designers had to solve email variability in such a way that signatures would be consistent. They did this by allowing signers to (a) subset the original headers before signing, and (b) apply different canonicalisation algorithms to the headers and to the body.
   
   This PR solves the problem of robust ID generation by leveraging the existing DKIM mechanism for robust signing.
   
   ## What?
   
   To generate a DKIM style Ponymail ID, we first parse the email using a superset of the algorithm used by the popular Python [dkimpy](https://pypi.org/project/dkimpy/) package. We then subset the headers to an RFC recommended subset, and apply DKIM `relaxed/simple` canonicalisation. Finally, we hash the canonicalised subset message using SHA3-256, and encode the first 80 bits of the digest using a custom base32 alphabet. The encoded digest prefix is the resulting Ponymail ID.
   
   ## How?
   
   We first implemented a superset of the RFC 822 parser in `dkimpy` (the _reference parser_) from scratch.
   
   Why did we use this reference parser instead of following an algorithm in an RFC? Neither RFC 822 nor RFC 6376 (the DKIM RFC) nor any other RFC that we found gives a parsing algorithm for inputs which are broken. Ponymail must generate an ID no matter what the form of the input is. Therefore we followed an RFC 6376 implementation as a reference because it already covers some broken inputs.
   
   But we went further. Our version shares no code with the original parser, and has other advantages in that it:
   
   * Gives identical output to the reference parser in all cases we tested for which the reference parser gave output.
   * Provides output for _all_ byte sequence inputs, unlike the reference parser. In cases where the reference parser would throw an error, we make reasonable assumptions as to what the form of the output should be.
   * Performs about 2x faster. It's so fast that even when it performs canonicalisation, it's still faster than the reference parser.
   * Depends on no modules, even from the Python standard library.
   * Can exclude headers whilst parsing.
   
   We subset the email using headers recommended by RFC 4871, the precursor to RFC 6376. We do this because the former had a more comprehensive set of recommendations on which headers to include. We sent the following email, which explains the situation in detail, to the authors of RFC 6376:
   
   ```
   In RFC 6376 you removed several entries from the list of headers in
   _Recommended Signature Content_ (5.4.1) that had been present in RFC
   4871. The (Resent-)Sender and (Resent-)Message-ID headers were
   removed, as well as all MIME headers.
   
   In an _INFORMATIVE OPERATIONS NOTE_ (5.4) earlier on in RFC 6376,
   though, it is "highly advised" that both Sender and all MIME headers
   are present. This earlier NOTE is unchanged from RFC 4871.
   
   Later on in RFC 6376 the removal of Message-ID is discussed, and its
   new status is explained to be contextual. But what was the rationale
   for removing the other headers, and what is their new status? Was the
   NOTE accidentally not updated when the other headers were removed from
   the list? Or are (Resent-)Sender and the MIME headers still
   recommended for inclusion?
   ```
   
   RFC 6376, whilst contradicting itself, does at least say that ultimately the choice is up to the implementer. We use the more comprehensive recommendations of RFC 4871 as proving a better baseline. We also add the `DKIM-Signature` header itself so that signed and unsigned messages cannot have the same ID.
   
   We hash the result with SHA3-256. RFC 6376 and its successors do not yet provide for signatures using SHA3-256. We chose SHA3-256 as it is more likely to be resistant to cryptanalysis than SHA2, and the sponge construction has more practical uses such as STROBE, and therefore SHA3 is more likely to be robust. We could have used SHAKE for a shorter digest to truncate from, but SHA3-256 is more ubiquitous. We use only the first 80 bits, to make IDs easier to share; Ponymail IDs are IDs, not hashes. Although 80 bytes is enough to make collisions in normal use unlikely, collision attacks are still possible. To avoid collision attacks, we allow configuration of a nonce which is added as a header to the email before canonicalisation.
   
   Finally, we encode the output using base32, but with the alphabet `[0-9b-df-hj-tv-z]` in place of the normal base32 alphabet. We position 0-9 first by analogy with base16. We use a-z without a, e, i, u to minimise the probability that cultural taboo words will appear in generated IDs. We call this the _pibble_ encoding, which is short for "pony nibble".
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] rbowen commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
rbowen commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679299947


   So ... it sounds like the concerns have been considered, addressed, and tested for?
   
   Like I said, I really look forward to more manageable permalinks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691482048


   What if it only appends that *if a list override has been used* ?
   So, normal use of List-ID in the email header would give you a short pibble, but using a --lid overrride would give you a longer ID for recovery purposes. That way installations that always have a list ID could keep things shrot and neat, and lists that rely on an override would have something helpful.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-674709117


   Arguably, if you use a custom list ID different from what's in the source, you are going to potentially 404 your permalink in any case if you change it or forget what it was when reimporting, no matter what generator you use. All current generators use the list ID in their input/output. The only difference of real importance, in my view, is that the list ID is visible in current generated IDs, and hidden in the dkim generator to get as short an ID as is reasonably possible (by hidden I mean it's used inside the hash, instead of being plaintext outside the hash).
   
   I will agree that having the list ID in the generated ID can be very helpful for administrators if they need to reimport and had a lot of custom overrides they can't remember or don't have a backed-up configuration of. What if we make this an option for the administrator to decide on? Thus, we could accept this as is, and then have a second, `dkim_long` or such, where the list-id is appended to the generated ID instead.
   
   the current dkim generator appeals to a certain group of people, and your suggestion of appending the list ID will appeal to other groups of people. Neither solution will be 100.00% stable against all edge cases (take for instance losing your database and re-importing from gmail mbox sources, that would not work).
   
   Let the administrators running PM decide between a shorter ID for neater URLs, or a longer ID that could make recovery easier. But let that be their decision, I don't think we should be imposing one or the other option.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-674759668


   "re-importing from gmail mbox sources, that would not work" - why not?
   
   It would certainly work with the mod_mbox software, as that relies on an intrinsic part of the message (Message-Id)
   
   That is surely one of the main ideas behind the dkim hash - generate an id that is the same for all instances of the same original message?
   It would not be needed if Message-Id were universal and unique, but unfortunately that's not the case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-702772743


   It does not suffice to "[use a best guess as to the original LID that was used](https://github.com/apache/incubator-ponymail/pull/517#issuecomment-693372710)" to reconstruct an **OL** permalink. If somebody has an `opaque_lid` permalink, and the message is lost but the Ponymail operator guesses an incorrect new lid, then the original `opaque_lid` link will break. If links break then they cannot be considered permalinks.
   
   Moreover, if the appended lid in **OL** were not an integral part of the permalink then it would be necessary to use it anyway to disambiguate which list UI to apply to a message in the archives. This is because, for example and amongst other possible scenarios, the [same message could be imported twice](https://github.com/apache/incubator-ponymail/pull/517#issuecomment-692033864) with different manual command line lid overrides. Using lid UI disambiguation has [already been proposed](https://github.com/apache/incubator-ponymail/pull/517#issuecomment-693372710), and requires [handling of the case](https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691953754) where the lid is omitted from the link.
   
   One could consider that the current commit dfd18eb already implements **OL** in the presence of lid UI disambiguation. It is therefore not an accurate characterisation that the current commit "[will need to be fixed](https://github.com/apache/incubator-ponymail/pull/517#issuecomment-693372710)" to include the lid in the input, because the intention was that lid UI disambiguation would be introduced alongside the current commit, or implemented in a future version with a warning to users in the documentation meanwhile.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-675058525


   gmail does some (nasty) normalization of header values, such as lower-casing email addresses, which is not standard practice, so you cannot reliably generate the same hash for all emails if your previous import was based off non-gmail mbox files. I've run into this problem a few times, where the mbox address had caps in it, and gmail removed those caps, hence why I know.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679067179


   Sorry, but I don't think the discussion is yet complete.
   We need more reviews by other interested parties. 
   And we need more tests using the same emails from multiple sources.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-687812671


   I'm just working my way through the new documentation, which seems very thorough.
   
   There is a section in dkim_id_test.html which states that a mbox From line may be confused with a header. It says that:
   
   From MAILER-DAEMON Fri Jul  8 12:08:34 2011
   
   could be interpreted as a header with field name:
   
   From MAILER-DAEMON Fri Jul  8 12
   
   According to the original rfc822 and current rfc5322, header field-names cannot contain spaces, so it's not possible to confuse a line that starts with "From text...:" for a valid header.
   
   [There is an obsolete syntax that allows trailing white-space just before the colon, but that could not match a valid mbox header which must have some printable chars before the colon]


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-692672620


   This PR is now more than a month old.
   I am satisfied that the generator, as intended, works. I intend to merge it.
   I agree that we should work on the alternate strategy elsewhere. I see the benefits of it, but I think it's ultimately unfair to the original submission to keep requiring changes that are sort of tangential at best to the original issue, as well as a "side-step" from DKIM.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-673408683


   AIUI, the destination list-id (not the one in the origin, which may not exist) is appended in the generator with:
   ~~~python
           headers.append([b"X-Archive-List-ID", xali_value])
   ~~~
   the destination list ID is thus both in the hash and also in the mbox document (as it always is).
   I'm unsure what you mean by the permalink containing the list id. I don't see that as a must, it's in the document itself, and the pibble would change if the destination list ID changed, thus two identical emails for the two separate lists would have different pibbles for each list.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-673011821


   @sebbASF
   
   An 80 bit truncated hash provides 80 bits of preimage resistance, but only 40 bits of collision resistance. In terms of Ponymail, preimage resistance prevents forgery of new messages with *existing* IDs. Collision resistance prevents a forger generating two different messages with a single *new* ID.
   
   Back of the envelope calculations, which I am _not claiming as the actual security of this PR code_ (i.e. this must be independently verified), using SUPERCOP benchmarks show that 40 bits enables a forgery under that model in a matter of hours. 64 bit security would be possible only by a nation-state actor. Forgery under 80 bit security is infeasible now, and is likely to remain so for a reasonable duration.
   
   In other words, 80 bit encoded DKIM style IDs are not suitable for use in the way that you suggest, as a nonce-free identifier for abstract emails. They would be far too easily forged, in a matter of hours even with standard hardware. This is why the option for adding a nonce exists.
   
   General forgeries could be mitigated by storing the actual DKIM signature (if present) as metadata within Ponymail, and this would probably be a good idea in any case. Identifier forgeries can be partially mitigated in Ponymail by canonically showing the first message to have been received by the system. But this only works for currently hosted mailing lists; it does not work for message archive imports.
   
   A minimum of 128 bit security is standard for security in general scenarios. That is why 256 bit hashes have become widespread, because they provide 128 bits of collision resistance security. But for Ponymail preimage resistance is the primary concern, and collision resistance less so.
   
   For your use case I would suggest 128 bit identifiers, providing 64 bits of collision resistance. The code could still be written with the idea that collision attacks would be feasible, providing further mitigations even in this very unlikely case. Another advantage of using 128 bits is that SHAKE-128 output could be used untruncated.
   
   The alternative base32 pibble encoding proposed in this PR has the advantages that it is case insensitive and avoids a wide range of cultural taboo substrings, but it is less compact than urlsafe base64, which would be a potential alternative if using longer IDs. Here is an example of a 16 character pibble encoded 80 bit identifier:
   
       t3wfqdtjor8kghwn
   
   And here is a 22 character base64 encoded 128 bit identifier:
   
       MTIzNDU2Nzg5MDEyMzQ1Ng
   
   The hashes of original messages should use 256 bit hashes. They could use SHA3-256 and be stored using urlsafe base64 encoding. Specifically, it is not secure to use 128 bit identifiers because the threat model is different. The DKIM style identifer is a one-to-many mapping, where the many values are preserved. Collisions in that context can be mitigated. The message source identifier is a one-to-one mapping, and so a collision would result in lost data, which is not acceptable.
   
   In summary: if the nonce option is preserved, 80 bit truncated cryptographically secure hash (CSH) identifiers may be suitable for generic messages but not for message sources. If the nonce option is removed, 128 bit CSH identifiers may be suitable for generic messages, but not for message sources. Only 256 bit CSH identifiers are suitable for message sources.
   
   I apologise that you were led astray by the _ignis fatuus_ of the `dkim` options. They are not user facing options. They are only there for:
   
   * The participants of this PR thread, to facilitate changing parameters if the initially submitted defaults are agreed to be unreasonable.
   * Advanced programmer users who have the ability to patch Ponymail code, would like to customise their identifiers, and can therefore use the `dkim` options to achieve what would have taken them more effort as programmers.
   
   The options can even be safely removed if `dkim` is standardised.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691187392






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on a change in pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on a change in pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#discussion_r479368516



##########
File path: tools/generators.py
##########
@@ -19,14 +19,137 @@
 This file contains the various ID generators for Pony Mail's archivers.
 """
 
+import base64
 import hashlib
 import email.utils
 import time
 import re
 
+# For optional nonce
+config = None
+
+# Headers from RFC 4871, the precursor to RFC 6376
+rfc4871_subset = {
+    b"from", b"sender", b"reply-to", b"subject", b"date",
+    b"message-id", b"to", b"cc", b"mime-version", b"content-type",
+    b"content-transfer-encoding", b"content-id",
+    b"content-description", b"resent-date", b"resent-from",
+    b"resent-sender", b"resent-to", b"resent-cc",
+    b"resent-message-id", b"in-reply-to", b"references", b"list-id",
+    b"list-help", b"list-unsubscribe", b"list-subscribe",
+    b"list-post", b"list-owner", b"list-archive", b"dkim-signature"
+}
+
+# Authenticity headers from RFC 8617
+rfc4871_and_rfc8617_subset = rfc4871_subset | {
+    b"arc-authentication-results", b"arc-message-signature",
+    b"arc-seal"
+}
+
+def rfc822_parse_dkim(suffix,
+        head_canon = False, body_canon = False,
+        head_subset = None, archive_list_id = None):
+    headers = []
+    keep = True
+    list_ids = set()
+
+    while suffix:
+        # Edge case: headers don't end LF (add LF)
+        line, suffix = (suffix.split(b"\n", 1) + [None])[:2]
+        if line in {b"\r", b"", None}:
+            break
+        lf = line.endswith(b"\r") and (suffix is not None)
+        end = b"\n" if lf else b"\r\n"
+        if line[0] in {0x09, 0x20}:
+            # Edge case: starts with a continuation (treat like From)
+            if headers and (keep is True):
+                headers[-1][1] += line + end
+        elif not line.startswith(b"From "):
+            # Edge case: header start contains no colon (use whole line)
+            # "A field-name MUST be contained on one line." (RFC 822 B.2)
+            k, v = (line.split(b":", 1) + [b""])[:2]
+            k_lower = k.lower()
+            if k_lower == "list-id":
+                list_ids.add(k_lower)
+            if (head_subset is None) or (k_lower in head_subset):
+                keep = True
+                headers.append([k, v + end])
+            else:
+                keep = False
+    # The remaining suffix is the body
+    body = (suffix or b"").replace(b"\r\n", b"\n")
+    body = body.replace(b"\n", b"\r\n")
+
+    # Optional X-Archive-List-ID augmentation
+    if (archive_list_id is not None) and (archive_list_id not in list_ids):
+        xali_value = b" " + bytes(archive_list_id, "ascii")
+        headers.append([b"X-Archive-List-ID", xali_value])
+    # Optional nonce from local config
+    if config is not None:
+        if (config.has_section("archiver") and
+            config.has_option("archiver", "nonce")):
+            nonce = config.get("archiver", "nonce")
+            headers.append([b"X-Archive-Nonce", nonce])
+    # Optional head canonicalisation (DKIM relaxed)
+    if head_canon is True:
+        for i in range(len(headers)):
+            k, v = headers[i]
+            crlf = v.endswith(b"\r\n")
+            if crlf is True:
+                v = v[:-2]
+            v = v.replace(b"\r\n", b"")
+            v = v.replace(b"\t", b" ")
+            v = v.strip(b" ")
+            v = b" ".join(vv for vv in v.split(b" ") if vv)
+            if crlf is True:
+                v = v + b"\r\n"
+            headers[i] = [k.lower(), v]
+    # Optional body canonicalisation (DKIM simple)
+    if body_canon is True:
+        while body.endswith(b"\r\n\r\n"):
+            body = body[:-2]

Review comment:
       It looks like most of our mbox files have only LF line-ends, so I doubt this will ever be triggered.
   
   AFAICT mails arrive at the archiver with LF only as well

##########
File path: tools/generators.py
##########
@@ -19,14 +19,137 @@
 This file contains the various ID generators for Pony Mail's archivers.
 """
 
+import base64
 import hashlib
 import email.utils
 import time
 import re
 
+# For optional nonce
+config = None
+
+# Headers from RFC 4871, the precursor to RFC 6376
+rfc4871_subset = {
+    b"from", b"sender", b"reply-to", b"subject", b"date",
+    b"message-id", b"to", b"cc", b"mime-version", b"content-type",
+    b"content-transfer-encoding", b"content-id",
+    b"content-description", b"resent-date", b"resent-from",
+    b"resent-sender", b"resent-to", b"resent-cc",
+    b"resent-message-id", b"in-reply-to", b"references", b"list-id",
+    b"list-help", b"list-unsubscribe", b"list-subscribe",
+    b"list-post", b"list-owner", b"list-archive", b"dkim-signature"
+}
+
+# Authenticity headers from RFC 8617
+rfc4871_and_rfc8617_subset = rfc4871_subset | {
+    b"arc-authentication-results", b"arc-message-signature",
+    b"arc-seal"
+}
+
+def rfc822_parse_dkim(suffix,
+        head_canon = False, body_canon = False,
+        head_subset = None, archive_list_id = None):
+    headers = []
+    keep = True
+    list_ids = set()
+
+    while suffix:
+        # Edge case: headers don't end LF (add LF)
+        line, suffix = (suffix.split(b"\n", 1) + [None])[:2]
+        if line in {b"\r", b"", None}:
+            break
+        lf = line.endswith(b"\r") and (suffix is not None)
+        end = b"\n" if lf else b"\r\n"
+        if line[0] in {0x09, 0x20}:
+            # Edge case: starts with a continuation (treat like From)
+            if headers and (keep is True):
+                headers[-1][1] += line + end
+        elif not line.startswith(b"From "):
+            # Edge case: header start contains no colon (use whole line)
+            # "A field-name MUST be contained on one line." (RFC 822 B.2)
+            k, v = (line.split(b":", 1) + [b""])[:2]
+            k_lower = k.lower()
+            if k_lower == "list-id":
+                list_ids.add(k_lower)

Review comment:
       The above line looks wrong, as does the comparison -- surely k_lower is bytes?
   
   There should ideally be some unit tests to detect such issues...




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-673476509


   I don't think the ID should include the list name by default, I like it short and neat - makes life easier for people using links :)
   It could perhaps be an option to append, but I don't really see a need to always have the list ID in the permalink.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on a change in pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on a change in pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#discussion_r483526359



##########
File path: tools/generators.py
##########
@@ -19,14 +19,137 @@
 This file contains the various ID generators for Pony Mail's archivers.
 """
 
+import base64
 import hashlib
 import email.utils
 import time
 import re
 
+# For optional nonce
+config = None
+
+# Headers from RFC 4871, the precursor to RFC 6376
+rfc4871_subset = {
+    b"from", b"sender", b"reply-to", b"subject", b"date",
+    b"message-id", b"to", b"cc", b"mime-version", b"content-type",
+    b"content-transfer-encoding", b"content-id",
+    b"content-description", b"resent-date", b"resent-from",
+    b"resent-sender", b"resent-to", b"resent-cc",
+    b"resent-message-id", b"in-reply-to", b"references", b"list-id",
+    b"list-help", b"list-unsubscribe", b"list-subscribe",
+    b"list-post", b"list-owner", b"list-archive", b"dkim-signature"
+}
+
+# Authenticity headers from RFC 8617
+rfc4871_and_rfc8617_subset = rfc4871_subset | {
+    b"arc-authentication-results", b"arc-message-signature",
+    b"arc-seal"
+}
+
+def rfc822_parse_dkim(suffix,
+        head_canon = False, body_canon = False,
+        head_subset = None, archive_list_id = None):
+    headers = []
+    keep = True
+    list_ids = set()
+
+    while suffix:
+        # Edge case: headers don't end LF (add LF)
+        line, suffix = (suffix.split(b"\n", 1) + [None])[:2]
+        if line in {b"\r", b"", None}:
+            break
+        lf = line.endswith(b"\r") and (suffix is not None)
+        end = b"\n" if lf else b"\r\n"
+        if line[0] in {0x09, 0x20}:
+            # Edge case: starts with a continuation (treat like From)
+            if headers and (keep is True):
+                headers[-1][1] += line + end
+        elif not line.startswith(b"From "):
+            # Edge case: header start contains no colon (use whole line)
+            # "A field-name MUST be contained on one line." (RFC 822 B.2)
+            k, v = (line.split(b":", 1) + [b""])[:2]
+            k_lower = k.lower()
+            if k_lower == "list-id":
+                list_ids.add(k_lower)
+            if (head_subset is None) or (k_lower in head_subset):
+                keep = True
+                headers.append([k, v + end])
+            else:
+                keep = False
+    # The remaining suffix is the body
+    body = (suffix or b"").replace(b"\r\n", b"\n")
+    body = body.replace(b"\n", b"\r\n")
+
+    # Optional X-Archive-List-ID augmentation
+    if (archive_list_id is not None) and (archive_list_id not in list_ids):
+        xali_value = b" " + bytes(archive_list_id, "ascii")
+        headers.append([b"X-Archive-List-ID", xali_value])
+    # Optional nonce from local config
+    if config is not None:
+        if (config.has_section("archiver") and
+            config.has_option("archiver", "nonce")):
+            nonce = config.get("archiver", "nonce")
+            headers.append([b"X-Archive-Nonce", nonce])
+    # Optional head canonicalisation (DKIM relaxed)
+    if head_canon is True:
+        for i in range(len(headers)):
+            k, v = headers[i]
+            crlf = v.endswith(b"\r\n")
+            if crlf is True:
+                v = v[:-2]
+            v = v.replace(b"\r\n", b"")
+            v = v.replace(b"\t", b" ")
+            v = v.strip(b" ")
+            v = b" ".join(vv for vv in v.split(b" ") if vv)
+            if crlf is True:
+                v = v + b"\r\n"

Review comment:
       Lines [58-62](https://github.com/apache/incubator-ponymail/pull/517/commits/b1643f7bedad52276a0977e6a9e497cedb9fe799#diff-29178abd2d43d3075e7721e865041930R58-R62) have the effect of normalising LF line endings to CRLF. The following message, for example, converts to the same output whether it uses LF or CRLF as separators:
   
   ```
   >>> from generators import rfc822_parse_dkim
   >>> rfc822_parse_dkim(b"To: You\nFrom: Me\n\nHello\n")
   ([[b'To', b' You\r\n'], [b'From', b' Me\r\n']], b'Hello\r\n')
   >>> rfc822_parse_dkim(b"To: You\r\nFrom: Me\r\n\r\nHello\r\n")
   ([[b'To', b' You\r\n'], [b'From', b' Me\r\n']], b'Hello\r\n')
   ```
   
   This normalisation is part of the RFC 6376 algorithm. In a revision of the code which I am preparing, this normalisation will be modularised, tested, and documented in such a way that it will be made much clearer.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679215118


   I agree that the generator looks very promising.
   However, I think it has some issues that need to be solved.
   
   For example, the current implementation includes the lid override in the hash.
   The override is not an intrinsic part of the source message, so unless the same override is used when reloading, a different id will be generated.
   This is going to cause problems if lists are renamed.
   e.g. if one imports an archive for dev@httpd.apache.org, re-importing using the lid '<de...@httpd.apache.org>' changes the ids even though it is for the same list.
   It is advisable to use the lid override, especially when importing older archives which may not have any list headers.
   
   It needs to be tested to see whether it properly distinguishes duplicates. Sometimes MTAs hiccup, and cause the same message to appear twice on a mailing list. These should have different ids.
   
   Does the generator work well with mails that don't have list headers?
   For example, aliases and older mail archives.
   
   Also the hash strength has yet to be decided; is the current length sufficient?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691515959


   Previous [arguments against](https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679952122) including manual List-IDs in generated permalink IDs include:
   
   * There is only a single and unlikely scenario where backups of manual List-IDs would be required
   * Permalink IDs are an unreliable and *highly idiosyncratic* backup medium
   * Backups recording manual List-ID metadata would be compact, making it easy to create distributed replicas
   * Even if manual List-IDs were to be a part of URLs, they should not be part of their permalink ID portions
   * Separating manual List-IDs from permalink IDs would follow a widely established URL pattern
   
   Were these arguments overlooked?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691190787


   Perhaps we could accept this PR and make a dkim_mboxrd or some such additional generator that addresses this mbox issue.
   Leave the DKIM as is (I think it's strictly following the spec, and so making additional changes would diverge from that), but make a second option, pick a name for it, we could even make it the default. If it's not following DKIM specs, it shouldn't call itself that, is what I guess I'm trying to say.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-672849648


   On closer examination, I see that the DKIM generator has several options as to how the hash is generated.
   
   This means that the generated hash will depend on which options are chosen, rather than just on the mail content.
   
   If for some reason emails have to be reloaded either in the same installation or another, it is vital that the same hashes are generated. 
   
   My conclusion is that options cannot be allowed if Permalinks are to be truly permanent.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679033012


   I am not aware of an RFC with the changes that GMail employs - in other words, I don't know what they do in addition to lowercasing the sender address, so I can't really say for sure what should happen.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-672237119


   Isn't that just trading one potential risk (if you will) for another?
   If you keep *any* received headers in the permalink ID, you risk losing reproducibility in permalinks for emails that either don't have a list-id, or are sent via multiple lists.
   
   I think the DKIM list makes sense, and would trade the stable generated hash over whether or not you can debug ezmlm with it. 
   ezmlm's return path only applies if your email address is the direct recipient after the list was invoked. If it goes past that, you may lose it, and the index number will be lost anyway. I'm also unsure whether it's Pony Mail's job to debug whether postfix/etc received an email or not.
   
   Having said all that, what I could envision is for the next generation pony mail to use this generator for the permalink, but store multiple copies of the _source_, as they come in, and referencing them in a metadata table (or just storing the pibble in the source as well, but not use it for the document ID). When viewing the source, you could be presented with all existing options. Thus, you would have just the one copy of the email when viewing the list (which I think makes the most sense in any case), but multiple options when viewing the source.
   
   That is, to sum up, I like this generator and what it accomplishes - I think it looks far better than our previous/current solutions, and if not for this version, I'd be interested in pulling this PR into the next generation of pony mail.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-703113706


   If this PR were to be modified so that the lid is used in the input of the DKIM-ID algorithm, but only when it does not match the `List-Id` header value of an email, would that be sufficient for this PR to be merged? Or would there be further objections?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691010564


   We should consider canonicalising 'From ' lines in the message body.
   
   This is because it's not possible to ensure that an email imported from an mbox file will have 'From ' lines in its body correctly unprefixed such that it matches the original.
   
   I think the only way to handle this is to remove all the '>' prefixes before 'From ', i.e. something like
   
   s/\n>+From /\nFrom /
   
   The result may not be the same as the original email, but it would remove all instances of prefixing.
   
   Note: this is only intended to apply to the hash input, not the raw source as stored when archiving or importing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-693372710


   In the case of an OL Permalink which is missing from a rebuilt database, there are two possibilities:
   - The link contains the LID as a suffix, in which case the lid is obviously known.
   - The link does not contain a LID, in which case it is not needed.
   
   When rebuilding a database using OL links, any mails that don't have LID headers will need a suffix.
   If possible the original LIDs should be used, but unlike with the O-style links it is not essential.
   It suffices to use a best guess as to the original LID that was used.
   This is because the opaque hash part is unaffected by the chosen LID.
   The opaque hash prefix can be used to find a missing link; it is not essential to rebuild the full Permalink (though of course that is better).
   
   In the case of an O-style Permalink, in order to match the opaque hash, it is necessary to know the exact LID that was used.
   Of course if the LID was the same as in the headers that's not necessary, but in that case the O and OL Permalinks are the same anyway.
   If the LID is different from the headers, then it's not possible to regenerate the same hash unless the same LID is used.
   Given a missing Permalink, AFAICT it will be necessary to generate hashes for each mail with each possible LID, of which there may be hundreds.
   
   **Note**: I am assuming here that the O-style Permalink includes the lid in the hash, either as an existing header, or as an addition. *That is not actually the case for the amended PR, however as I indicated earlier that will need to be fixed.*
   **If the same message is posted to two different lists, the copies need to have different Permalinks**.
   If the O-style link does not include a LID in the hash, then some other means will need to be found to distinguish copies on other lists, for example by appending the LID in clear, which is the OL-style approach.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-673468896


   @sebbASF 
   
   One nonce can be used for all messages archived by a host, but it must never be disclosed. It is more accurately called a [pepper](https://en.wikipedia.org/wiki/Pepper_(cryptography)), which is a secret salt. Once it is set in the configuration it does not have to be changed, but disclosure is catastrophic.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-672927852


   Yes, you would have to change the options.
   However if different instances have different settings, then their hashes won't be the same.
   
   I see the DKIM hash being used as a more reliable Message-Id.
   This means it must be immutable.
   Changing the options is akin to changing a Message-Id.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-687210087


   @sebbASF 
   
   The test that you linked to also produces equal hashes when modified to use the current Ponymail recommended `cluster` generator. The modified test is included below for independent verification. Given that `cluster` and `dkim` exhibit the same behaviour, why is `cluster` included and explicitly recommended but `dkim` regarded as unsuitable?
   
   ```
   generators:
     # Identical mails, different archivers
     # N.B. Should be different from below
     corpus/ponymail-dev-1079-mailarchiesao.mbox:
     corpus/ponymail-dev-1079-mboxvm.mbox:
       cluster:
       - index: 0
         message-id: <CA...@mail.gmail.com>
         generated: ra504b0e636a1ac3de84a3372744b5738bc6a9e34a28645cf94855d5f@<dev.ponymail.apache.org>
     # Identical mails, different archivers
     # N.B. Should be different from above
     corpus/ponymail-dev-1080-mailarchiesao.mbox:
     corpus/ponymail-dev-1080-mboxvm.mbox:
     corpus/ponymail-dev-1080-listsao.mbox:
       cluster:
       - index: 0
         message-id: <CA...@mail.gmail.com>
         generated: ra504b0e636a1ac3de84a3372744b5738bc6a9e34a28645cf94855d5f@<dev.ponymail.apache.org>
   ```
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-692276883


   Discussion of DKIM-IDs with manual List-IDs appended should be moved to Issue #523.
   
   This thread is for discussion of DKIM-IDs as they appear in the present commit of this PR.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691187392






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-672954409


   If different instance have different settings, then that is the problem of the person that set that up, not us, not the generator's fault.
   Having options make it easier for others to reuse the code in my view. I don't see it as a fault or a bad thing.
   
   As for the other comment, infinite values is *exactly the point* of the nonce as I read it. It is to prevent collision attacks from being possible. In my mind, this is a great addition to the generator. If I set up an instance with a nonce, only I can reasonably collide IDs (with a lot of effort still), no one else. If you don't want that extra security, don't set a nonce.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-672852222


   Color me stupid, but...you would manually have to go in and change those settings to get a different result, would you not?
   Whether there are options set or not would then not matter, as you could change the behavior however you see fit in any case.
   What matters is that the defaults, and what is used, stay the same, IMHO.
   
   The nonce is the only option I could see anyone purposefully changing, and that's not defined in the generator itself, but in the pony mail config.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-703105355


   AFAICT, commit dfd18eb does *not* implement OL, at least not in the way that I defined it.
   The output from the generator does not depend on the LID, whereas my definition requires the LID to be appended if the LID differs from the List-Id in the message headers.
   
   If the generator output is also used as the database id, as is currently the case, only one message will be stored.
   That needs to be fixed, either by changing the generator output, or by changing the underlying design of Ponymail.
   
   This issue occurs where a message is sent to multiple lists and where the headers used by dkim are identical.
   I think this applies mainly to list aliases. However I think Ponymail should be able to handle any type of mail archives.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-686813208


   This PR affects various different aspects of generation:
   - the fields used to build the hash source
   - how many bits of hash are generated
   - how the hash is presented for use.
   
   In order to determine its suitability, it first needs to be decided whether the hash is to be used for Permalinks and database ids or just Permalinks, as this will affect the fields that must be taken into account.
   
   If it is to be used only as a Permalink, then it does not matter if a few non-identical messages generate the same hash, so long as the system can display all matches. Obviously the chance of any duplicates should be minimised. I think the current design of the PR largely fulfils that requirement. [There are some edge cases where a single message is sent to a list twice that still need to be investigated. This can occur anywhere between the originator and the mailing list software e.g. if a retransmission occurs (I hope to add some examples to the test corpus soon). There are also some lists with aliases and mails have been copied to the alias. These appear as a separate emails on the list.]
   
   However if it is also used as a database id, then the hash must be unique for different messages.
   
   As to the hash presentation, shorter is better for Permalinks, and the charset is important.
   However for the internal database id those are not really a concern.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-673298592


   @sbp This sounds like we have two options here then:
   1) pibble with 80 bits if nonce is set, 128 bits if no nonce?
   2) always use 128 bits for pibbling?
   
   As for using the complete 256 bit CSH for sources, I am fully on board there, but I think this is best saved for the next gen. That doesn't mean the dkim generator isn't suitable for this gen (from what I can tell, it is far superior for clustered environments compared to any other algorithm), but rather that we'd save using the 256 bit CSH for sources for the next generation. Does that make sense?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691010564






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-720445607


   @sebbASF Would it be acceptable to insert the lid into the DKIM-ID input in case of a mismatch instead of appending it to the DKIM-ID output? Thirty days have elapsed since this thread was active.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691660802


   Mails where we use a list override (that is, any mail that arrives to an archiver with --lid $something) should have the list ID appended to the generated ID as you proposed. All emails that do not go through a list override should not need to have anything appended. Does this satisfy you both? :) It would satisfy my requirements.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-673462395


   I mean that the Permalink should include the list id, as it does at present. For example: aabbcc@<lid>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-727569901


   @sbp I find it acceptable, I think a solution here is to implement that as you wish, and if someone else wants to make an alternate DKIM generator that appends to the ID, then they can make that a reality at a later point in time.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-674584890


   If an  externally provided list-id is included in the hash, then the has will change if the lid changes.
   Suppose there is an mbox to be imported.
   If the individual messages all have list-ids there is no need to specify the lid on the command-line.
   However if you do provide one (which is advisable in case there are any missing ids), then the hashes will be different.
   
   It is vital that the hash depends *only* on the message source (possibly plus a fixed pepper), otherwise reloading the messages may generate a different hash and thus Permalink.
   
   As to characters such as @&lt;&gt; being a problem, I agree, but there are other characters that could be used to separate the lid from the hash. For example one could use hash_dev.ponymail.apache.org.
   
   I agree that it would be nice to have shorter links, but that should not be at the cost of unstable Permalinks.
   Now it already looks like it should be possible to use a shorter hash by using base64 so a Permalink of the form:
   
   r3358b63557d3a40e179f6ca498a38b9aaf0b2532aba48bfc03c7a1a0@&lt;dev.project.apache.org&gt;
   might become
   MTIzNDU2Nzg5MDEyMzQ1Ng_dev.project.apache.org
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-692046481


   Obviously it is better to have backups that simplify and speed up restoration.
   My point is that with the appropriate Permalink design all is not lost even if such backups are *not* available.
   
   ==
   
   I still don't understand your point about de-duplication.
   Perhaps you can provide some examples of such emails?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679033211


   (furthermore, we shouldn't be beholden to proprietary changes outside RFCs)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-687236055


   I do not personally recommend the cluster generator.
   Its output depends on the parsed message and how attachments are processed.
   This is not stable over time.
   
   The dkim generator is better than any existing one we have seen so far.
   It's more a question of whether it can be improved further.
   If not, so be it, but I think this needs to be determined before proceeding with what may be a temporary solution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-692261291


   The `Received` headers are the same.
   
   There is no `Delivered-To` header. It is not a standard header.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679282204


   As for `Does the generator work well with mails that don't have list headers?`, I see no indication that it would make any difference if a header is present or not. 
   
   The pibble length can be debated, but we have to balance the requirement of the administrators and that of the end users.
   The current length is good enough that random collisions are unlikely to happen (1:10^24 chance of that).
   It is short enough that people will use the permalinks and not use a URL shortener.
   I'm not knowledgeable enough to know about the precise preimage attack chances here. 
   
   I am however satisfied that this generator works as intended, and improves upon what we have to offer. If that means going from 16 bytes in the pibble to 24, for added preimage security, I'd be okay with that. What I would dislike is doing a full length digest, as that discourages using permalinks in their raw form, which is counterproductive for permalinks.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-692102834


   If an email is sent with the addresses of two or more mailing lists in the `To` headers, then Ponymail will store identical copies of that email in `mbox_source` except that the `List-ID` headers will be different. This has been tested with Mailman as the mailing list software.
   
   Since the discussion topic is the case where `List-ID` is for some reason not added, and since the rest of the `mbox_source` values are identical apart from that header, the values would be exact duplicates if `List-ID` were not added.
   
   As for examples, any email with two or more mailing lists in the `To` header should suffice. The test did not involve any special conditions or setup, just writing a regular mailing list message in a regular email client. Nor was there any Mailman or Ponymail configuration that would affect the test.
   
   This is not the only possible example scenario, as mentioned.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691976137


   Seems to me if no-one has a copy of the permalink then it does not matter as  much if it is lost.
   
   Whereas the other way round, if there is a known permalink, but it is not in the database, it is vital to be able to reconstruct it.
   
   If the link contains a lid suffix, then it should be possible to find the mail by re-loading all the relevant archives.
   
   If there is no separate lid suffix in the missing Permalink, then the entire corpus may need to be reloaded unless there is some context that helps narrow down the search (e.g. dates, possible list names).
   
   The advantage of the above approach is that there is no need for additional backup data beyond the original archives.
   Indeed it should be possible to use sibling archives recorded by other subscribers to the list.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691515959






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691187392






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691010564






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-672167623


   Thanks very much. This looks very good.
   
   However I think the hash won't be sufficiently unique, given that PonyMail uses the hash both for the Permalink and for storing the copy of the email.
   
   It is possible that two separate messages sent to a mailing list can end up with the same hash.
   AFAICT, the sender has control over all the fields that are checked (assuming the MTA does not generate DKIM headers), so can send mails that are duplicates as far as the DKIM calculation is concerned. 
   
   But list software such as ezmlm generally does not care what an email contains and would treat this as a new message, so would generate a new sequence number for the message. This could result in gaps in the PonyMail record, since that uses the hash as a unique id for the message.
   
   There would be no way to know what caused the missing sequence number -- was it such a duplicate, or did the email get lost in transit? -- without access to a separate archive.
   
   I think this problem can be avoided by including some of the last received headers in the hash. The ones after any List-xxx headers added by the list software will relate to the journey the mail took to reach the mail server, so should not be affected by subsequent deliveries to mail archivers.
   
   Note: this is only an issue because PonyMail uses the same hash for Permalinks and the database id.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-672255008


   They would have the same pibble, but different SHA3 if the SHA3 is done using the full message source. What headers are/aren't in the source wouldn't matter, as it would refer to the same pibble at any time because of how the DKIM generator works.
   
   As for the return path you mention, that is specific to a specific installation. I don't think we should allow/deny PRs based on just that. I for one don't get such return paths in my mbox source, as it goes through another alias before it hits my inbox - the same could be true for the archiver, in which case it wouldn't matter what the original return path was.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691515959


   Previous [arguments against](https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679952122) including manual List-IDs in generated permalink IDs include:
   
   * There is only a single and unlikely scenario where backups of manual List-IDs would be required
   * Permalink IDs are an unreliable and *highly idiosyncratic* backup medium
   * Backups recording manual List-ID metadata would be compact, making it easy to create distributed replicas
   * Even if manual List-IDs were to be a part of URLs, they should not be part of their permalink ID portions
   * Separating manual List-IDs from permalink IDs would follow a widely established URL pattern
   
   Were these arguments overlooked?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-693333242


   In **Style OL** it is not possible to "regenerate Permalinks from just the mail sources" because the LID must still be known to obtain the suffix of the full Permalink. It is only possible to regenerate a prefix of the Permalink. Therefore the LIDs must still be known even in **Style OL**, to obtain the full Permalink. This disproves the supposed advantage of **Style OL**.  And since the LIDs must be known even in **Style OL**, **Style O** is always possible too. Therefore **Style OL** is an inferior variant of **Style O**.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-703115457


   I have no issue with the code in dkim-id.py per se.
   
   However the dkimid method in generators.py needs to take note of the LID where it differs from the List-ID in the headers.
   For example by appending it to the returned id.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679287598


   @sbp what are your thoughts on the pibble length? Is it safe as is, does it need to be longer?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on a change in pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on a change in pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#discussion_r482187709



##########
File path: tools/generators.py
##########
@@ -19,14 +19,137 @@
 This file contains the various ID generators for Pony Mail's archivers.
 """
 
+import base64
 import hashlib
 import email.utils
 import time
 import re
 
+# For optional nonce
+config = None
+
+# Headers from RFC 4871, the precursor to RFC 6376
+rfc4871_subset = {
+    b"from", b"sender", b"reply-to", b"subject", b"date",
+    b"message-id", b"to", b"cc", b"mime-version", b"content-type",
+    b"content-transfer-encoding", b"content-id",
+    b"content-description", b"resent-date", b"resent-from",
+    b"resent-sender", b"resent-to", b"resent-cc",
+    b"resent-message-id", b"in-reply-to", b"references", b"list-id",
+    b"list-help", b"list-unsubscribe", b"list-subscribe",
+    b"list-post", b"list-owner", b"list-archive", b"dkim-signature"
+}
+
+# Authenticity headers from RFC 8617
+rfc4871_and_rfc8617_subset = rfc4871_subset | {
+    b"arc-authentication-results", b"arc-message-signature",
+    b"arc-seal"
+}
+
+def rfc822_parse_dkim(suffix,
+        head_canon = False, body_canon = False,
+        head_subset = None, archive_list_id = None):
+    headers = []
+    keep = True
+    list_ids = set()
+
+    while suffix:
+        # Edge case: headers don't end LF (add LF)
+        line, suffix = (suffix.split(b"\n", 1) + [None])[:2]
+        if line in {b"\r", b"", None}:
+            break
+        lf = line.endswith(b"\r") and (suffix is not None)
+        end = b"\n" if lf else b"\r\n"
+        if line[0] in {0x09, 0x20}:
+            # Edge case: starts with a continuation (treat like From)
+            if headers and (keep is True):
+                headers[-1][1] += line + end
+        elif not line.startswith(b"From "):
+            # Edge case: header start contains no colon (use whole line)
+            # "A field-name MUST be contained on one line." (RFC 822 B.2)
+            k, v = (line.split(b":", 1) + [b""])[:2]
+            k_lower = k.lower()
+            if k_lower == "list-id":
+                list_ids.add(k_lower)
+            if (head_subset is None) or (k_lower in head_subset):
+                keep = True
+                headers.append([k, v + end])
+            else:
+                keep = False
+    # The remaining suffix is the body
+    body = (suffix or b"").replace(b"\r\n", b"\n")
+    body = body.replace(b"\n", b"\r\n")
+
+    # Optional X-Archive-List-ID augmentation
+    if (archive_list_id is not None) and (archive_list_id not in list_ids):
+        xali_value = b" " + bytes(archive_list_id, "ascii")
+        headers.append([b"X-Archive-List-ID", xali_value])
+    # Optional nonce from local config
+    if config is not None:
+        if (config.has_section("archiver") and
+            config.has_option("archiver", "nonce")):
+            nonce = config.get("archiver", "nonce")
+            headers.append([b"X-Archive-Nonce", nonce])
+    # Optional head canonicalisation (DKIM relaxed)
+    if head_canon is True:
+        for i in range(len(headers)):
+            k, v = headers[i]
+            crlf = v.endswith(b"\r\n")
+            if crlf is True:
+                v = v[:-2]
+            v = v.replace(b"\r\n", b"")
+            v = v.replace(b"\t", b" ")
+            v = v.strip(b" ")
+            v = b" ".join(vv for vv in v.split(b" ") if vv)
+            if crlf is True:
+                v = v + b"\r\n"

Review comment:
       Whilst the standard requires the trailing CRLF to be kept, I'm not sure that is the best approach for a Permalink.
   
   This is because many/most of the mbox files use LF rather than CRLF, so they will end up without any line terminator.
   
   I think it would be better to strip all terminators.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh edited a comment on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh edited a comment on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-673476509


   I don't think the ID should include the list name by default, I like it short and neat - makes life easier for people using links :)
   It could perhaps be an option to append, but I don't really see a need to always have the list ID in the permalink.
   having a long permalink leads to a worse user experience, and using <@> chars etc often leads to encoding bugs. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-672257012


   Email A: pibble is abcdefg1234, SHA3 of full message is 123412341234
   Email B: pibble is still abcdefg1234, SHA3 of full message is 432143214321
   Both have the same pibble, both could show up as variants when you went to look at the source.
   
   When importing from a third source (email C) into a DB from scratch, pibble would again be abcdefg1234, and SHA3 could perhaps be 111222333444, it wouldn't matter as the pibble metadata is the same, so a search would find it in the DB.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-673405041


   @sbp 
   In the case of the nonce, does the additional security rely on using a variable nonce, or would a fixed nonce be sufficient?
   
   ==
   
   Not all emails in archives have List- headers.
   For example email aliases, and early emails before list managers were common.
   I think this means that the Permalink must contain the list id separately from the message hash


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-671903245


   Please note that this PR also adds an extra parameter passing the bytes of the original message to the `compute_updates` and `generator` functions in `archiver.py` and `generators.py` respectively. This is because the DKIM style ID generator works on the original bytes, and recovering these from `msg.as_bytes()` is not only slower but also more fragile in that the behaviour of that method depends on the upstream implementation in the Python standard library.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-687153531


   Assuming the emails are the same with all the same elements according to the DKIM rules, I disagree that it's essential with two IDs - I think they should be the same if the origin email is the same unless it's sent to different lists. I believe de-duplicating this is the best choice for the end user.
   
   I'm not interested in duplicates of the same message in any place other than sources for "cyber-archeological digging".
   To me this is a matter of choice, I don't see it as a technically broken or incorrect measure. 
   The return path should not be a part of this algorithm.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-703070465


   The lid is not *an integral* part of the permalink. One may also say that the lid is not part of the permalink at all, if only the mandatory parts of a link intended for permanence are defined as "the permalink", but that is mere terminological choice.
   
   Commit dfd18eb implements **OL** in the presence of lid UI disambiguation. Are there any remaining objections to merging this pull request? There are no review comments on the code in the most recent commit, a month later.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-672932799


   The optional nonce makes things worse as there are effectively infinite values it can take.
   At least with boolean options it would be possible to generate all the different hashes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-694796709


   The original PR included the LID in the hash. This should ensure that the target is an email on the correct list
   
   The current PR -- dfd18eb -- does not include the LID in either the hash or as a suffix.
   As such, the same hash may be generated for mails on multiple lists.
   A Permalink needs to lead directly to the correct list; it should not be necessary to select a list id from a list of options.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691655391


   > DKIM-IDs as of commit [dfd18eb](https://github.com/apache/incubator-ponymail/commit/dfd18ebc6bcb61a6130710114a58b6c64a654766) do not include data that is not present in the email source.
   
   I know.
   
   However I think that means mails without List headers won't be unique if they were sent to multiple such lists.
   For example mail aliases. Many early lists did not have standard list headers.
   
   Therefore I think the lid will either need to be appended or added back to the hash input.
   Does that make sense?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-687257317


   @sebbASF 
   
   There are several improvements in the pipeline, including:
   
   * highly modularised code
   * simplified algorithms
   * better RFC compliance
   * a better reference model
   * type annotations for static verification
   * unit tests
   * property checks
   * extensive documentation
   * greater collision and preimage resistance of the output
   * removal of unnecessary features
   
   These improvements are very nearly complete, waiting to be pushed.
   
   The `dkim` generator can therefore indeed be improved, but if there are any further suggestions as to what could be better about the approach or the algorithm then it would be helpful if they were noted in this thread.
   
   The aim of the improvements is partly to make the algorithm less idiosyncratic, transforming it into a consistent method which *could hypothetically* be written up as an RFC or other standalone specification for canonicalised message identifiers.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691993650


   What if somebody has a permalink that the administrator doesn't know about? Is it vital to be able to reconstruct that?
   
   The advantage of including manual List-IDs is that it requires no additional backup. But the disadvantages are that it is unreliable, and that it does not remove duplicates across lists, which is what DKIM-ID was designed to do in the first place.
   
   Why recommend an unreliable backup strategy that forces *users* to bear the burden of storage when there is a reliable alternative that keeps that responsibility with the archive administrator?
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679033755


   I am inclined to merge this PR with the fixes required to do so. We can then add a longer version of the output ID for those that wish to include the list ID in the generated document ID.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691460489


   I think that would defeat the purpose of shortening the hash, I am not in favor of it.
   Perhaps a compromise could be to again have two generators, a dkim and a dkim_long, with the _long adding the list ID to the end.
   Speaking as both a user and freelance administrator of some lists, and having spoken to other users, there is a strong interest in as short an ID as feasible. Some might prefer a longer one if the list ID suffix helps with recovery, but that should be up to whomever runs and uses the system.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-703075274


   I think we should merge. We can work out differences in opinion later on. I'm leaning towards having two generators and leaving it up to the end user to make a decision. Can you adjust the PR so it can be merged?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679952122


   @sebbASF
   
   It is not yet documented why the command line list ID would need to be present in the permalink. Am I right in thinking that the following is the only use case?
   
   Consider an mbox archive whose messages contain no list IDs in common with the command line list ID imposed by the administrator. All of its messages in Ponymail are later lost, but the original mbox archive file is still available. Since the messages in Ponymail were lost, the command line list ID is also lost. But since the command line list ID was present in permalinks, if a user of the list has that any permalink available to them then the command line list ID can be recovered.
   
   I can think of no other use case.
   
   There are far better data recovery strategies available. One could, for example, maintain a mapping of command line list IDs to any individual DKIM IDs only contained within that list. This is suitable in the case where an entire archive is expected to be recovered. Such a mapping file would be extremely small, on the order of KiB, and would therefore be easily replicated across many systems.
   
   If only individual messages are expected to be recovered, then the mapping of command line list IDs to all DKIM IDs would be necessary. This would only require storing sixteen bytes for every email in the system, so even an archive with a million emails would only require a mapping file of about 15 MiB.
   
   Even in the original suboptimal strategy, it is not necessary to make the command line list ID a mandatory part of a permalink. It could instead be made optional, like labels used in Amazon URLs, some weblog software, and on some news sites, as the following examples demonstrate:
   
   ```
   https://www.amazon.com/Apache-Definitive-Guide-Ben-Laurie/dp/0596002033
   https://www.amazon.com/Anything-Can-Go-Here/dp/0596002033
   https://www.amazon.com/dp/0596002033
   
   https://lobste.rs/s/j7p2ow/what_are_you_doing_this_week
   https://lobste.rs/s/j7p2ow/anything_can_go_here
   https://lobste.rs/s/j7p2ow
   
   https://www.reuters.com/article/apache-moves-on-traffic-server-machine-learning-projects-idUS57202199920100504
   https://www.reuters.com/article/anything-can-go-here-idUS57202199920100504
   https://www.reuters.com/article/idUS57202199920100504
   ```
   
   Amazon and Reuters use an infix pattern, whereas Lobsters uses a suffix pattern. Users could strip the Ponymail list ID, whether command line or archive metadata derived, from the permalink:
   
   ```
   https://lists.apache.org/thread/MTIzNDU2Nzg5MDEyMzQ1Ng/dev.project.apache.org
   https://lists.apache.org/thread/MTIzNDU2Nzg5MDEyMzQ1Ng/anything.can.go.here
   https://lists.apache.org/thread/MTIzNDU2Nzg5MDEyMzQ1Ng
   ```
   
   Or if the malleability of `anything.can.go.here` is undesirable, the UI software could ensure that the message actually appears in the list ID in the optional part of the URL. But I think that, as @rbowen noted, the first thing any user wants to do with a URL that's too long to easily share is to shorten it, either by taking out optional components or by submitting it to a link shortener.
   
   *Links are themselves UI, and they ought to be designed in a user friendly way.*
   
   Links which are too long are not user friendly, and this is why sites use IDs like `0596002033`, `j7p2ow`, and `idUS57202199920100504`, to recapitulate the actual examples mentioned above. They don't use mandatory IDs like `MTIzNDU2Nzg5MDEyMzQ1Ng_dev.project.apache.org`. Even `MTIzNDU2Nzg5MDEyMzQ1Ng` could be regarded as too long, but unlike Amazon, Lobsters, and Reuters we have the constraint that we would like to be able to generate the ID again from the content, which means using a hash, which means considering the hash security; and indeed I provided an informal analysis earlier in this thread.
   
   I would very much like wider review and more discussion of this pull request. I notice, however, that the 40 or so messages, from four contributors, currently in this thread compares rather favourably to the following number of messages in the threads of all previous PRs on Ponymail:
   
   **0, 0, 0, 5, 2, 3, 0, 2, 1, 0, 3, 3, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 2, 0, 2, 2, 1, 0, 3, 2, 10, 4**
   
   Combined, this is 54 messages across every single PR, merged or unmerged. I count that 17 out of 35 PRs were merged. I also counted the number of participants in the threads of *only the merged* PRs, giving the following figures:
   
   **1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 3, 3, 2, 3, 2, 2**
   
   I notice, therefore, that the current PR already far exceeds the amount of review of all existing PRs, almost surpassing their combined number of messages, and that the number of contributors to the thread already surpasses that of every existing merged PR.
   
   Despite this, I repeat the call for wider review. Clearly this is a substantial contribution, and many of the prior PRs were trivial. I would especially like, for example, somebody to audit the behaviour of my algorithm vs the reference algorithm in the `dkimpy` package, and to provide a more formal analysis of the security parameters of the hash.
   
   It is also clear that this PR needs to be modified before it can be accepted. As I understand it, the following modifications could aid consensus:
   
   * The hash encoding could be converted to base64
   * The hash digest length could be 128 bits, encoded as 22 characters
   * The pepper mechanism should be removed
   * The command line List ID should not be added to the message before hashing
   * The algorithm could potentially also be renamed
   
   It would also be useful if objecting participants would *concisely* state all of their remaining objections.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on a change in pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on a change in pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#discussion_r468523637



##########
File path: tools/generators.py
##########
@@ -19,14 +19,137 @@
 This file contains the various ID generators for Pony Mail's archivers.
 """
 
+import base64
 import hashlib
 import email.utils
 import time
 import re
 
+# For optional nonce
+config = None
+
+# Headers from RFC 4871, the precursor to RFC 6376
+rfc4871_subset = {
+    b"from", b"sender", b"reply-to", b"subject", b"date",
+    b"message-id", b"to", b"cc", b"mime-version", b"content-type",
+    b"content-transfer-encoding", b"content-id",
+    b"content-description", b"resent-date", b"resent-from",
+    b"resent-sender", b"resent-to", b"resent-cc",
+    b"resent-message-id", b"in-reply-to", b"references", b"list-id",
+    b"list-help", b"list-unsubscribe", b"list-subscribe",
+    b"list-post", b"list-owner", b"list-archive", b"dkim-signature"
+}
+
+# Authenticity headers from RFC 8617
+rfc4871_and_rfc8617_subset = rfc4871_subset | {
+    b"arc-authentication-results", b"arc-message-signature",
+    b"arc-seal"
+}
+
+def rfc822_parse_dkim(suffix,
+        head_canon = False, body_canon = False,
+        head_subset = None, archive_list_id = None):
+    headers = []
+    keep = True
+    list_ids = set()
+
+    while suffix:
+        # Edge case: headers don't end LF (add LF)
+        line, suffix = (suffix.split(b"\n", 1) + [None])[:2]
+        if line in {b"\r", b"", None}:
+            break
+        lf = line.endswith(b"\r") and (suffix is not None)
+        end = b"\n" if lf else b"\r\n"
+        if line[0] in {0x09, 0x20}:
+            # Edge case: starts with a continuation (treat like From)
+            if headers and (keep is True):
+                headers[-1][1] += line + end
+        elif not line.startswith(b"From "):
+            # Edge case: header start contains no colon (use whole line)
+            # "A field-name MUST be contained on one line." (RFC 822 B.2)
+            k, v = (line.split(b":", 1) + [b""])[:2]
+            k_lower = k.lower()
+            if k_lower == "list-id":
+                list_ids.add(k_lower)
+            if (head_subset is None) or (k_lower in head_subset):
+                keep = True
+                headers.append([k, v + end])
+            else:
+                keep = False
+    # The remaining suffix is the body
+    body = (suffix or b"").replace(b"\r\n", b"\n")
+    body = body.replace(b"\n", b"\r\n")
+
+    # Optional X-Archive-List-ID augmentation
+    if (archive_list_id is not None) and (archive_list_id not in list_ids):
+        xali_value = b" " + bytes(archive_list_id, "ascii")
+        headers.append([b"X-Archive-List-ID", xali_value])
+    # Optional nonce from local config
+    if config is not None:
+        if (config.has_section("archiver") and
+            config.has_option("archiver", "nonce")):
+            nonce = config.get("archiver", "nonce")
+            headers.append([b"X-Archive-Nonce", nonce])
+    # Optional head canonicalisation (DKIM relaxed)
+    if head_canon is True:
+        for i in range(len(headers)):
+            k, v = headers[i]
+            crlf = v.endswith(b"\r\n")
+            if crlf is True:
+                v = v[:-2]
+            v = v.replace(b"\r\n", b"")
+            v = v.replace(b"\t", b" ")
+            v = v.strip(b" ")
+            v = b" ".join(vv for vv in v.split(b" ") if vv)

Review comment:
       NOTE: Used `--amend` to fix final `v` to `vv` here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on a change in pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on a change in pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#discussion_r483526906



##########
File path: tools/generators.py
##########
@@ -19,14 +19,137 @@
 This file contains the various ID generators for Pony Mail's archivers.
 """
 
+import base64
 import hashlib
 import email.utils
 import time
 import re
 
+# For optional nonce
+config = None
+
+# Headers from RFC 4871, the precursor to RFC 6376
+rfc4871_subset = {
+    b"from", b"sender", b"reply-to", b"subject", b"date",
+    b"message-id", b"to", b"cc", b"mime-version", b"content-type",
+    b"content-transfer-encoding", b"content-id",
+    b"content-description", b"resent-date", b"resent-from",
+    b"resent-sender", b"resent-to", b"resent-cc",
+    b"resent-message-id", b"in-reply-to", b"references", b"list-id",
+    b"list-help", b"list-unsubscribe", b"list-subscribe",
+    b"list-post", b"list-owner", b"list-archive", b"dkim-signature"
+}
+
+# Authenticity headers from RFC 8617
+rfc4871_and_rfc8617_subset = rfc4871_subset | {
+    b"arc-authentication-results", b"arc-message-signature",
+    b"arc-seal"
+}
+
+def rfc822_parse_dkim(suffix,
+        head_canon = False, body_canon = False,
+        head_subset = None, archive_list_id = None):
+    headers = []
+    keep = True
+    list_ids = set()
+
+    while suffix:
+        # Edge case: headers don't end LF (add LF)
+        line, suffix = (suffix.split(b"\n", 1) + [None])[:2]
+        if line in {b"\r", b"", None}:
+            break
+        lf = line.endswith(b"\r") and (suffix is not None)
+        end = b"\n" if lf else b"\r\n"
+        if line[0] in {0x09, 0x20}:
+            # Edge case: starts with a continuation (treat like From)
+            if headers and (keep is True):
+                headers[-1][1] += line + end
+        elif not line.startswith(b"From "):
+            # Edge case: header start contains no colon (use whole line)
+            # "A field-name MUST be contained on one line." (RFC 822 B.2)
+            k, v = (line.split(b":", 1) + [b""])[:2]
+            k_lower = k.lower()
+            if k_lower == "list-id":
+                list_ids.add(k_lower)
+            if (head_subset is None) or (k_lower in head_subset):
+                keep = True
+                headers.append([k, v + end])
+            else:
+                keep = False
+    # The remaining suffix is the body
+    body = (suffix or b"").replace(b"\r\n", b"\n")
+    body = body.replace(b"\n", b"\r\n")
+
+    # Optional X-Archive-List-ID augmentation
+    if (archive_list_id is not None) and (archive_list_id not in list_ids):
+        xali_value = b" " + bytes(archive_list_id, "ascii")
+        headers.append([b"X-Archive-List-ID", xali_value])
+    # Optional nonce from local config
+    if config is not None:
+        if (config.has_section("archiver") and
+            config.has_option("archiver", "nonce")):
+            nonce = config.get("archiver", "nonce")
+            headers.append([b"X-Archive-Nonce", nonce])
+    # Optional head canonicalisation (DKIM relaxed)
+    if head_canon is True:
+        for i in range(len(headers)):
+            k, v = headers[i]
+            crlf = v.endswith(b"\r\n")
+            if crlf is True:
+                v = v[:-2]
+            v = v.replace(b"\r\n", b"")
+            v = v.replace(b"\t", b" ")
+            v = v.strip(b" ")
+            v = b" ".join(vv for vv in v.split(b" ") if vv)
+            if crlf is True:
+                v = v + b"\r\n"
+            headers[i] = [k.lower(), v]
+    # Optional body canonicalisation (DKIM simple)
+    if body_canon is True:
+        while body.endswith(b"\r\n\r\n"):
+            body = body[:-2]

Review comment:
       [As noted elsewhere](https://github.com/apache/incubator-ponymail/pull/517/files/1c720707b138fa112c58e6a9773b582aa5961bf2#r483526359), LF is normalised to CRLF and so this code is very likely to be evaluated.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691953754


   Imagine we have a small mailing list archive in an mbox file called `ancient.mbox`. This mbox archive contains three emails, none of which contain a List-ID header. We import it using the manual command line List-ID `alt.small.archive`, and use the DKIM-ID generator that appends the manual List-ID. The permalinks are like this:
   
   ```
   https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj_alt.small.archive
   https://lists.example.org/t/jnb2msdg3j6of3dco2bw5fct_alt.small.archive
   https://lists.example.org/t/oxbrv2q2wbd23vfz5ttfcs7v_alt.small.archive
   ```
   
   Now the catastrophic scenario happens, and the Ponymail database is lost! But we still have `ancient.mbox`, and we want to restore its three emails back into the `lists.example.org` Ponymail instance. How do we know what manual List-ID to use? The three emails do not include a List-ID. We made up `alt.small.archive`, but we did not record this fact and we no longer remember it. How do we get the List-ID?
   
   The idea behind putting `_alt.small.archive` in the permalinks is that now we can send a plea to our users asking: "does anybody have any links to an email that was in the database I just lost?", or use a search engine to try to find such permalinks.
   
   There are a couple of problems with this:
   
   * If the archive is small, what if *nobody, including search engines*, ever recorded those links? Our mailing list archives may be unpopular, private, or hidden from search engines using `robots.txt`.
   * Whom do we consult to find those permalinks? In other words, how do we even know who our users are? For sites that have a community around them, there may be a straightforward answer to this. But there is not a *general answer* to this.
   
   Therefore, the only reason to put a manual List-ID in permalinks is to support an **unreliable backup strategy**. The strategy is unreliable because it depends on arbitrary users to retain copies of the permalinks that can then be consulted to restore the data in the case of catastrophic database loss. And an unreliable backup strategy is made **unacceptable** when there is a reliable alternative.
   
   One alternative is that we can just rename `archive.mbox` to `alt.small.archive.mbox` when we import it. Or we can record the hash of `archive.mbox` into a file called `alt.small.archive.mbox-sha3` and keep it alongside `archive.mbox`. But those approaches have drawbacks too, e.g. if we obtain an mbox file which is differently ordered.
   
   Instead, here is a reliable alternative:
   
   Imagine we import our three emails from `ancient.mbox`, but this time *without* a manual List-ID in the permalinks. The permalinks are like this:
   
   ```
   https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj
   https://lists.example.org/t/jnb2msdg3j6of3dco2bw5fct
   https://lists.example.org/t/oxbrv2q2wbd23vfz5ttfcs7v
   ```
   
   When we performed this import, we generated the three DKIM-IDs `bnbqz6hb4gplpvtz7zmlhymj`, `jnb2msdg3j6of3dco2bw5fct`, and `oxbrv2q2wbd23vfz5ttfcs7v`. These are each encodings of 16 bytes, for a total of 48 bytes. In general this is `16 * n` bytes, where `n` is the number of emails imported. We store these bytes in a file called `alt.small.archive.dkim-ids`.
   
   We now perform *standard backup procedures* for `alt.small.archive.dkim-ids`. We replicate it across environments, storing as many copies as possible in different geographic locations using different setups. This is easy to do because the file is only *48 bytes* long. We only need to store *48 bytes*, several times, to have a reliable backup of our manual List-ID. We can even include the manual List-ID plus line feed at the start of the file, so that we're not relying on the filename itself.
   
   Does this strategy scale? Consider a very large mailing list that has a million emails in it. The manual List-ID `.dkim-ids` backup file for such a list would be `16 * n` or `16 * 1,000,000` or `16,000,000` bytes long. This is only `15.2 MiB`. As of 2020 it is trivial to widely and reliably replicate fifteen mebibytes for backup purposes.
   
   What are the problems with this strategy? Unlike the unreliable and unacceptable backup strategy described above, it does *not* rely on arbitrary users or search engines to backup our data for us. It does *not* lead to the problem of wondering who to consult to restore that data. It follows established, standard industry practices for backing up our manual List-IDs, instead of the existing *ad hoc* and *idiosyncratic* method.
   
   For that reason, I could never recommend the strategy where manual List-IDs are part of the permalinks. I could never recommend that people use it as their backup strategy, because this superior strategy is available instead and it ticks all the boxes.
   
   It is, however, sometimes necessary to include the manual List-ID in the URL somewhere for UI purposes. Consider the email `bnbqz6hb4gplpvtz7zmlhymj` above. Let's say that as well as appearing in our `ancient.mbox` archive it was also sent to five other mailing lists, for a total of six, all of which are in the `lists.example.org` Ponymail instance. What should the UI say if a user visits the following address?
   
   ```
   https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj
   ```
   
   It doesn't know which List-ID to present to the user. In fact, in Ponymail and in Foal it doesn't even *retain the information* that this message was sent to six mailing lists if the DKIM-ID is the primary permalink, which is necessary in Ponymail if DKIM-ID is used at all, and is not necessary in Foal but it still possible. 
   
   This would not be a problem if the manual List-ID were part of the permalink, but it solves one problem and causes another. DKIM-IDs were designed to deduplicate emails. If List-IDs are part of the DKIM-ID permalink, this means we would have to store *six* copies of the `bnbqz6hb4gplpvtz7zmlhymj` metadata, and *six* copies of its source too. But DKIM-IDs were explicitly designed to prevent this. Therefore we should solve the problem of retaining manual List-IDs another way. If we added them to the hash input of DKIM-IDs then we would lose our reliable backup strategy presented above.
   
   Thankfully there is a simple solution.
   
   In Foal commit <code>[178b729](https://github.com/apache/incubator-ponymail-foal/commit/178b729b9084a83034c0a87f150f23fd2ca48291)</code>, the multiple ID generators feature was added with the field `permalinks`. To support multiple manual List-IDs for a DKIM-ID identified message, all that would be required is to have an analogous field called `lids` for an array of List-IDs, just like `permalinks` is an array of generated permalink IDs.
   
   Then, if the user browses the URL above:
   
   ```
   https://lists.example.org/t/bnbqz6hb4gplpvtz7zmlhymj
   ```
   
   They can be presented with a list showing all List-IDs that this message belongs to, and the option to display the message *in its context in those lists*, specialising its UI. Or, they can still browse a version that contains the List-ID:
   
   ```
   https://lists.example.org/alt.small.archive/t/bnbqz6hb4gplpvtz7zmlhymj
   ```
   
   But, importantly, `alt.small.archive` is not part of the DKIM-ID here. This means that messages are still deduplicated even when they appear in multiple mailing lists.
   
   To argue that manual List-IDs should be part of DKIM-IDs would remove *all* of the above. In particular:
   
   * Metadata and sources would be duplicated across mailing lists
   * Showing what lists a DKIM-ID appears in would require a prefix search incompatible with elasticsearch keyword arguments
   * Ponymail administrators would be induced to just rely on their users to backup the permalinks in case of catastrophic data loss instead of performing the reliable backup method described above
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-702817136


   I agree that choosing a different LID when re-importing will affect a full OL permalink.
   However, assuming that the missing message has been reloaded somewhere, it will have the same opaque part, and it is fairly simple to match abcd_lid1 against abcd_lid2.
   
   Of course, if the original LIDs are known then this won't be an issue.
   
   The point is that an OL-style Permalink can be recovered without needing to keep a database of the imports.
   
   ===
   
   I don't know what "if the appended lid in OL were not an integral part of the permalink" means.
   The OL is the variable part of the permalink, and the appended lid is part of the OL, so I don't see how the lid can be "not part of the permalink".


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] rbowen commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
rbowen commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679135455


   I, for one, love having the ID be shorter. The current permalinks are long, cumbersome, and frequently get broken by linebreaks in mail clients, leading me to use link shorteners almost every time I need to use a Ponymail permalink - which kinda defeats the purpose.
   
   I expect I'm missing something but I don't understand the strenuous resistance to this improvement - and it is an improvement.
   
   Definitely +1 to the change.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-672260461


   Furthermore, blue-skying here, this could be made backwards compatible with older databases easily.
   For all new sources, store the source document with a pibble field inside (next to the source field).
   For all old sources, the doc ID is the permalink.
   
   Thus, to figure out if we need to access directly via permalink ID or via a pibble keyword search, we'd just assess whether the ID of the email being accessed is a pibble or not.
   
   Things to consider for later:
   - Not accessing documents directly via their ID has performance costs - I don't know how significant these are. For general single-source viewing, this should not be much if any. For generating mbox files, it will need some careful thought.
   - Viewing source via source.(lua|py|whatever) should only yield one result. There should perhaps be an X-Source-Alternate header if an alternate source exists in the DB, so you can look this up if debugging why an email is presumed missing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691576607


   DKIM-IDs as of commit dfd18eb do not include data that is not present in the email source.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-692167013


   Sorry, but I am not convinced. I need to see actual data.
   
   I have been looking for examples in the ASF corpus, but have yet to find one where the list-ids are not added.
   However, if the list-id headers added by ezmlm are ignored, the mails are still not identical.
   See for example:
   https://github.com/apache/incubator-ponymail-unit-tests/commit/be029d9c17a26876685125c5f95d0d87a64bb928
   The Received and Delivered-To lines are different. I don't see how that can fail to be the case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-692002181


   No, if the link is not known, then clearly it does not have to be reconstructed at that time.
   
   However if all the archives are available, they can all be reloaded.
   If a different lid suffix is used, that does not matter because the hash prefix will still be the same, so if the Permalink turns up later it will have the same prefix and can thus be found.
   
   ==
   
   I don't understand the part about DKIM and de-duplication across lists.
   I thought the hash included the List-id etc as input, so it will only de-duplicate mails sent to lists without such headers?
   
   I don't think it's possible to share message content across lists, because the sources will have different headers.
   This is true even if no List-* headers are involved.
   
   The current design does share attachments, but there is no attempt to share any other parts of messages. Doing so would require a redesign, and I'm not sure it would be worth the extra complication and house-keeping.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691556686


   Whilst Message-Ids are far from ideal, one big advantage is that a Permalink using them (e.g. mod_mbox) is readily recoverable.
   
   That is not the case for Permalinks that include data that is not present in the email source.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on a change in pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on a change in pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#discussion_r483522248



##########
File path: tools/generators.py
##########
@@ -19,14 +19,137 @@
 This file contains the various ID generators for Pony Mail's archivers.
 """
 
+import base64
 import hashlib
 import email.utils
 import time
 import re
 
+# For optional nonce
+config = None
+
+# Headers from RFC 4871, the precursor to RFC 6376
+rfc4871_subset = {
+    b"from", b"sender", b"reply-to", b"subject", b"date",
+    b"message-id", b"to", b"cc", b"mime-version", b"content-type",
+    b"content-transfer-encoding", b"content-id",
+    b"content-description", b"resent-date", b"resent-from",
+    b"resent-sender", b"resent-to", b"resent-cc",
+    b"resent-message-id", b"in-reply-to", b"references", b"list-id",
+    b"list-help", b"list-unsubscribe", b"list-subscribe",
+    b"list-post", b"list-owner", b"list-archive", b"dkim-signature"
+}
+
+# Authenticity headers from RFC 8617
+rfc4871_and_rfc8617_subset = rfc4871_subset | {
+    b"arc-authentication-results", b"arc-message-signature",
+    b"arc-seal"
+}
+
+def rfc822_parse_dkim(suffix,
+        head_canon = False, body_canon = False,
+        head_subset = None, archive_list_id = None):
+    headers = []
+    keep = True
+    list_ids = set()
+
+    while suffix:
+        # Edge case: headers don't end LF (add LF)
+        line, suffix = (suffix.split(b"\n", 1) + [None])[:2]
+        if line in {b"\r", b"", None}:
+            break
+        lf = line.endswith(b"\r") and (suffix is not None)
+        end = b"\n" if lf else b"\r\n"
+        if line[0] in {0x09, 0x20}:
+            # Edge case: starts with a continuation (treat like From)
+            if headers and (keep is True):
+                headers[-1][1] += line + end
+        elif not line.startswith(b"From "):
+            # Edge case: header start contains no colon (use whole line)
+            # "A field-name MUST be contained on one line." (RFC 822 B.2)
+            k, v = (line.split(b":", 1) + [b""])[:2]
+            k_lower = k.lower()
+            if k_lower == "list-id":
+                list_ids.add(k_lower)

Review comment:
       I plan to add type annotations, unit tests, and property checks to guard against this sort of error. This specific `list-id` feature is going to be removed anyway as a consequence of the discussion thread.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691349802


   On further reflection I think it's not worth normalising the From lines fully.
   The most common case would be covered.
   
   However, I think it's vital that we do not implement the a new generator until we are sure that all issues have been addressed.
   Every change to the algorithm reduces Permalink stability.
   
   The latest changes to the PR dropped the LID from the hash, so the lid needs to be provided separately in order to create a unique Permalink.
   
   One way to do this would be to append the lid as below:
   
   p8xtmy67z162x8827pk3hy3czn_dev.ponymail.apache.org
   
   There is no need for the enclosing < and >, and _ is better than @ as a separator.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-687120076


   The following test:
   https://github.com/apache/incubator-ponymail-unit-tests/blob/master/yaml/gens-ponymail-dev-1079-1080.yaml
   shows that currently the generator produces the same hash for 2 different entries on the list.
   
   Admittedly they are the same original message, but it would be better if the generator could distinguish them.
   
   If the dkim generator is to be used for the database id then it is essential that it produces different results.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679070399


   I'm okay with giving it a few more days for others to review, but at some point we should acknowledge that we have hit the number of reviewers we are going to get.
   As for testing, I'll see if I can put together something using the same email from different sources. It's a bit out of the current test framework, but shouldn't be too difficult.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679278984


   If lists are renamed, then the permalinks would change on a re-import in any case, so I consider the argument moot.
   It's not the generator's job to secure against bad backup planning. We expect the same input to generate the same output - if that input changes, we are in no way guaranteeing the same output, and that holds true for all our generators. If you rename a list, the permalink doesn't change, but it would if you re-imported with the new list ID.
   
   As for the same message appearing twice, I can see cases where it _could_ cause duplicates (if included headers differ), but that isn't a desired feature for the end user. In my opinion, it should de-dup messages where the contents are the same, but allow for both *sources* to be present in the system. This needs to be solved either outside the scope of this PR, or by some general changes to all generators, wherein the source ID is distinct from the permalink. It shouldn't create two different permalinks for the same message. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF edited a comment on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF edited a comment on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-692002181


   No, if the link is not known, then clearly it does not have to be reconstructed at that time.
   
   However if all the archives are available, they can all be reloaded.
   If a different lid suffix is used (for mails with no List-Id header), that does not matter because the hash prefix will still be the same, so if the Permalink turns up later it will have the same prefix and can thus be found.
   
   ==
   
   I don't understand the part about DKIM and de-duplication across lists.
   I thought the hash included the List-id etc as input, so it will only de-duplicate mails sent to lists without such headers?
   
   I don't think it's possible to share message content across lists, because the sources will have different headers.
   This is true even if no List-* headers are involved.
   
   The current design does share attachments, but there is no attempt to share any other parts of messages. Doing so would require a redesign, and I'm not sure it would be worth the extra complication and house-keeping.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-693319064


   There is a trade-off here.
   
   Let's call the two Permalink designs O and OL, where:
   
   O = opaque hash created from the message source *plus* the LID (if it differs from the LID in the headers)
   
   OL = opaque hash created from the message source only. If the LID differs from the headers, it is appended to the opaque hash.
   
   Style O
   -------
   Advantage: fixed Permalink style
   Disadvantage: not feasible to regenerate Permalinks without a separate database to relate the LIDs to the messages
   
   Style OL
   -------
   Advantage: can regenerate Permalinks from just the mail sources
   Disadvantage: Permalinks are longer for some mails
   
   Neither is perfect; seems to me that the appropriate choice will depend on the installation.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691696493


   I think that should work for mails that are archived.
   
   For the importer, generally it makes sense to always provide the --lid override to ensure the mails are added to the expected list. In think it makes sense to pick up an idea from the original PR, which is to only add the lid if it is different from the lid (if any) in the email. However instead of including the lid in the hash input as before, now it is appended to the generated Permalink.
   
   I hope this would avoid most of the issues with reproducibility of Permalinks.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-672242055


   To expand upon my previous comment:
   
   Email A comes in. It gets "pibbled" to abcdefg1234. A SHA3 digest is 123412341234
   Email B comes in, identical to A but with a different route. It gets pibbled as abcdefg1234, and the SHA3 digest is now 432143214321
   The permalink for BOTH emails would then be abcdefg1234 (as we only ever need that one copy of the contents for basic search/viewing), but because that pibble is stored in both sources (which have different SHA3 digests), you would be able to see both options when wanting to view the source.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691187392


   While I like the idea, I think that's moving beyond the scope of this PR - it sounds like something you could add a flag for, potentially a second generator that normalizes this....but ultimately something we shouldn't force upon this PR.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-687645749


   Commit dfd18eb contains the improvements [previously mentioned](https://github.com/apache/incubator-ponymail/pull/517#issuecomment-687257317).
   
   The generator has been renamed to **DKIM-ID** with the function name `dkimid`, and most of its code has been moved to a new file called `dkim_id.py`. This file is not only used as a module by `generators.py`, but can also be used as a command line script to generate the DKIM-ID of its input.
   
   Moving the code to a new file also facilitated the creation of the extensive test suite in `dkim_id_test.py`. This test suite uses a combination of doctest and hypothesis to cover many properties of the DKIM-ID code. Although hypothesis is a useful way to test code, it is not a property prover, only a fuzzer. Moreover, properties can still be missed or specified incorrectly. In short, hypothesis is not magic and bugs may remain.
   
   The code has been formatted using `isort` and `black`, and checked using `flake8` and `mypy --strict`, in addition to the doctest and hypothesis tests. The doctest documentation in `dkim_id_test.py` is formatted in reStructuredText, and exported to HTML in the file `dkim_id_test.html`.
   
   The custom base32 encoding is the same, but `X-Archive-List-ID` and `X-Archive-Nonce` have been removed. The DKIM style algorithm has been changed in a variety of minor ways for greater compliance to the relevant RFCs. The reference code that it is based on is now `libopendkim`, written in C, rather than `dkimpy` as the former has greater RFC compliance and is also likely more widely used. The DKIM-ID output is now 128 bits, up from 80 bits.
   
   Despite there being many more lines of code, mostly due to improved modularity and documentation, the DKIM-ID algorithm is in many ways simplified compared to the initial version in this pull request.
   
   Thanks to all the commenters who helped to motivate the improvement of this code.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-673483655


   @Humbedooh 
   
   Collision forgery would require control over entire input messages, unless the source identifier algorithm uses a subset. It also does not enable attacks against the identifiers of existing messages. If a `Received` header was added and is used to compute the identifier, this increases the difficulty of the attack. If an unpredictable header is added by Ponymail and used, this thwarts attacks even against imported archives. But using a 256 bit CSH of the whole message means you get all the security of the hash and no longer have to threat model such collision forgeries. A 256 bit CSH is cheap and currently reliable security.
   
   It is reasonable to save this feature for the next generation as long as at least one kind of existing message source identifier has enough collision resistance to make these attacks impractical. Since a range of identifiers are available, their security levels could be noted in the documentation so that implementers can understand the security consequences and decide.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] Humbedooh commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
Humbedooh commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-679085054


   I've run 2124 tests (1062 different emails) on dkim using two different sources, and it matches on every single one.
   I've added it to the unit tests repo.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-691469821


   Well of course the id should be as short as possible -- but no shorter.
   
   The current dkim_id ignores the lid, so I think the generated hash is not guaranteed unique (not all lists have list-id headers).
   So either the lid needs to be separately appended as I suggest, or the lid must be added to the hash input as was done in a previous version of the PR.
   
   However I think including the lid in the hash input would be a mistake.
   
   I don't believe it's feasible to recover a message from an id that includes the lid as part of the hash, even if one has all the original archives readily available. It would mean processing all the messages with all possible lids, considerably increasing the workload.
   
   Whereas with an independent lid, not only would it not be necessary to try all possible lids, it would only be necessary to try the archives for the relevant list.
   
   ==
   
   It occurs to me that the lid suffix can be shortened for a particular installation if it only serves emails from a limited set of lists.
   For example:
   p8xtmy67z162x8827pk3hy3czn_dev.ponymail.apache.org
   could become
   p8xtmy67z162x8827pk3hy3czn_dev.ponymail_
   where the trailing '_' means '.apache.org' (for this installation)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-675075884


   In which case maybe dkim should do the same normalisation as GMail to avoid the issue?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF edited a comment on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF edited a comment on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-673462395


   I mean that the Permalink should include the list id, as it does at present. For example: aabbcc@&lt;lid&gt;


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-692033864


   If the link is not reconstructed before it is requested then the request fails. Whether a link is known or not to the adminstrator does not necessarily correlate with how "vital" that link is. An administrator's knowledge of link usage can only ever be *ad hoc* and incomplete. The reliable backup mechanism avoids this issue, and enables prompt restoration.
   
   Duplicate headers were tested by sending a message to two different mailing lists, both using Mailman and Ponymail. The only difference was the `List-ID` header, even when including the headers that DKIM-ID discards. Deduplication may occur in other cases too, such as when importing an email that only exists in one archive but is addressed to multiple mailing lists.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sbp commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sbp commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-703117779


   > For example by appending it to the returned id.
   
   The question was specifically about using the lid in the DKIM-ID input. Would that be acceptable?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-687288365


   Thanks very much.
   As already indicated, I think the approach is very good.
   
   There are some areas that *may* need tweaking, for example which headers are included in the hash, and how the hash is turned into a string for use as a Permalink. These should not affect the basic algorithm.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [incubator-ponymail] sebbASF commented on pull request #517: Add DKIM style ID generation

Posted by GitBox <gi...@apache.org>.
sebbASF commented on pull request #517:
URL: https://github.com/apache/incubator-ponymail/pull/517#issuecomment-672252421


   AFAICT, so long as one only takes into account the Received headers that relate to the hops before arrival at ezmlm, all recipients of the email will be able to generate the same hash. If for some reason the email does not have a list-id or another way of determining whether the Received header was added before arrival at the list server, then ignore the header.
   
   Note that this is not a question of 'debugging' ezmlm.
   It is a question of knowing why Ponymail does not have a particular email sent by the mailing list.
   Was it lost in transit, or was it a duplicate?
   
   For example, can you tell me why
   dev-return-1032-archive-asf-public=cust-asf.ponee.io@ponymail.incubator.apache.org
   is missing from Ponymail without looking at some other archive?
   
   As to your example, if email A and email B are identical, I don't see how they will get different pibbles/digests *unless* at least some of the received headers are taken into account. Please explain.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org