You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@ponymail.apache.org by hu...@apache.org on 2017/09/02 08:02:44 UTC

incubator-ponymail git commit: crop out trailing whitespace for redundant archiver

Repository: incubator-ponymail
Updated Branches:
  refs/heads/master c8f4d3b7d -> df0b7ee1c


crop out trailing whitespace for redundant archiver

This deals with spurious whitespace that can exist on
clustered setups due to corrections inside the MTAs.
This only deals with trailing whitespace, everything else
is preserved.

Also add some spacing in the code.


Project: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/commit/df0b7ee1
Tree: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/tree/df0b7ee1
Diff: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/diff/df0b7ee1

Branch: refs/heads/master
Commit: df0b7ee1c5676b3286e4b30c135f410b1e0a92da
Parents: c8f4d3b
Author: Daniel Gruno <hu...@apache.org>
Authored: Sat Sep 2 10:02:10 2017 +0200
Committer: Daniel Gruno <hu...@apache.org>
Committed: Sat Sep 2 10:02:10 2017 +0200

----------------------------------------------------------------------
 tools/generators.py | 9 +++++++++
 1 file changed, 9 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-ponymail/blob/df0b7ee1/tools/generators.py
----------------------------------------------------------------------
diff --git a/tools/generators.py b/tools/generators.py
index e320a06..b37fa8b 100644
--- a/tools/generators.py
+++ b/tools/generators.py
@@ -22,6 +22,7 @@ This file contains the various ID generators for Pony Mail's archivers.
 import hashlib
 import email.utils
 import time
+import re
 
 # Full generator: uses the entire email (including server-dependent data)
 # This is the recommended generator for single-node setups.
@@ -95,8 +96,13 @@ def redundant(msg, body, lid, attachments):
     if not body: # Make sure body is not None, which will fail.
         body = ""
     xbody = body if type(body) is bytes else body.encode('ascii', 'ignore')
+    
+    # Crop out any trailing whitespace in body
+    xbody = re.sub(b"\s+$", b"", xbody)
+    
     # Use List ID
     xbody += bytes(lid, encoding='ascii')
+    
     # Use Date header. Don't use archived-at, as the archiver sets this if not present.
     mdate = None
     mdatestring = "(null)" # Default to null, ONLY changed if replicable across imports
@@ -106,14 +112,17 @@ def redundant(msg, body, lid, attachments):
     except:
         pass
     xbody += bytes(mdatestring, encoding='ascii')
+    
     # Use sender
     sender = msg.get('from', None)
     if sender:
         xbody += bytes(sender, encoding = 'ascii')
+    
     # Use subject
     subject = msg.get('subject', None)
     if subject:
         xbody += bytes(subject, encoding = 'ascii')
+    
     # Use attachment hashes if present
     if attachments:
         for a in attachments:

Re: incubator-ponymail git commit: crop out trailing whitespace for redundant archiver

Posted by sebb <se...@gmail.com>.

On 7 September 2017 at 11:08, Daniel Gruno <hu...@apache.org> wrote:
> On 09/07/2017 12:01 PM, sebb wrote:
>> On 7 September 2017 at 06:06, Daniel Gruno <hu...@apache.org> wrote:
>>> On 09/07/2017 12:24 AM, sebb wrote:
>>>> On 6 September 2017 at 07:32, Daniel Gruno <hu...@apache.org> wrote:
>>>>> On 09/06/2017 12:09 AM, sebb wrote:
>>>>>> On 2 September 2017 at 09:02,  <hu...@apache.org> wrote:
>>>>>>> Repository: incubator-ponymail
>>>>>>> Updated Branches:
>>>>>>>   refs/heads/master c8f4d3b7d -> df0b7ee1c
>>>>>>>
>>>>>>>
>>>>>>> crop out trailing whitespace for redundant archiver
>>>>>>>
>>>>>>> This deals with spurious whitespace that can exist on
>>>>>>> clustered setups due to corrections inside the MTAs.
>>>>>>> This only deals with trailing whitespace, everything else
>>>>>>> is preserved.
>>>>>>
>>>>>> -1
>>>>>>
>>>>>> I don't think this is a good idea.
>>>>>
>>>>> Your -1 is noted, but I don't consider the reasoning valid for a veto,
>>>>> so I'll interpret this as just a plain -1.
>>>>
>>>> AIUI, that's not your call.
>>>
>>> It's not my call to determine whether technical merit is sound (that
>>> would be for the PPMC in such cases), but there has to be technical
>>> merit in -1 in the first place. Saying "this is not a good idea" does
>>> not convey technical reason. You've since elaborated on that in your
>>> reply, and _that_ I believe constitutes a technical reason.
>>
>> Ah ok.
>>
>> I guess I was too terse, I should have linked to the previous mails I
>> sent about the same issue.
>>
>>>>
>>>>> I think it's a good idea, I think it solves some real problems that have
>>>>> been spotted in clustered setup. It could also solve problems where one
>>>>> archives as mbox with an extra newline by mistake. It's also an optional
>>>>> generator, not the default. Could you elaborate on why trailing
>>>>> whitespace would matter?
>>>>
>>>> I already wrote that ignoring whitespace causes a problem because it
>>>> means two different inputs end up with the same database id.
>>>> There's no way of knowing which one was correct; the wrong one may end
>>>> up being stored.
>>>
>>> But they would both have the same sender, date, list, message,
>>> attachments etc filed under the same ID - is that not what we want? What
>>> we _don't_ want is for trailing whitespace to cause duplicates. Put in
>>> other words: Why would we at all care whether one has the added newline
>>> or two and the other one doesn't? We're dealing with showing people
>>> emails, but bit-perfect of what was sent (including duplicates as a
>>> result of bit-diversion), but rather of what was intended.
>>
>> I disagree; I think it's important to show the input email as exactly
>> as possible.
>> Whitespace trimming could damage some emails.
>>
>>> If we wanted
>>> a perfect copy, we'd use the full digest and skip clustered setups all
>>> together, hoping machines don't die on us.
>>
>> Not so, it must be possible to have perfect copies in clustered setups.
>> Otherwise clustered backup systems would be impossible.
>> It's just that the current design may make this tricky.
>>
>>> This is for those rare
>>> occasions where something _does_ go wrong, and as seen, sometimes
>>> postfix will add some extra newlines - I still don't know why it does
>>> that in every case, I only know that it does, and likely other MTAs do
>>> as well.
>>
>> That's largely my point.
>> The cause needs to be determined otherwise the generator is being used
>> to ignore what may be a bug.
>>
>> Besides, in the cases I have seen (and noted on this list), it is not
>> only a difference in trailing whitespace.
>> The archived-at header is missing in one of the copies.
>> As I have written already, that points to non-identical treatment by
>> the different cluster members.
>>
>
> The archived-at, and possibly the extra whitespace, likely stems from a
> postfix oddity (that I really can't fix :p), in that mail delivered
> locally will be handled internally, even if it's supposed to be rerouted
> to a different address than the original.
>
> The case is as follows, I think:
> - 3 nodes act as MTAs
> - Each node will receive an email (whichever node has highest priority
> and is awake) and duplicate it into 3 copies, that each go to all the
> boxes in the MX setup. So, one of these copies go to the box itself, and
> the two other copies go to the other boxes if/when they are online (this
> is to not bounce an email if a box should be down or erroring out). Now,
> since one of the copies go to the box itself (even though it goes to a
> new email address), it is somehow rerouted differently, and that causes
> the header (and possibly the whitespace fixes) to either be there or not.

I thought that the archived-at header was only added by the archiver?
Also the archiver always adds it if it is missing.

How can the postfix setup affect this?

> I'll gladly admit that we're working on figuring this out differently,
> and possibly publishing this as a "what to do and not do" guideline
> later on, when the issue is fixed. I can also accept that if we have
> this guideline, we can omit the whitespace trimming.

We are getting there.

However the trimming is unnecessary for non-clustered setups.
And it's not yet clear that it's necessary for all clustered setups.

Thus the 'fix' should not be in the generator code.

Re: incubator-ponymail git commit: crop out trailing whitespace for redundant archiver

Posted by Daniel Gruno <hu...@apache.org>.

On 09/07/2017 12:01 PM, sebb wrote:
> On 7 September 2017 at 06:06, Daniel Gruno <hu...@apache.org> wrote:
>> On 09/07/2017 12:24 AM, sebb wrote:
>>> On 6 September 2017 at 07:32, Daniel Gruno <hu...@apache.org> wrote:
>>>> On 09/06/2017 12:09 AM, sebb wrote:
>>>>> On 2 September 2017 at 09:02,  <hu...@apache.org> wrote:
>>>>>> Repository: incubator-ponymail
>>>>>> Updated Branches:
>>>>>>   refs/heads/master c8f4d3b7d -> df0b7ee1c
>>>>>>
>>>>>>
>>>>>> crop out trailing whitespace for redundant archiver
>>>>>>
>>>>>> This deals with spurious whitespace that can exist on
>>>>>> clustered setups due to corrections inside the MTAs.
>>>>>> This only deals with trailing whitespace, everything else
>>>>>> is preserved.
>>>>>
>>>>> -1
>>>>>
>>>>> I don't think this is a good idea.
>>>>
>>>> Your -1 is noted, but I don't consider the reasoning valid for a veto,
>>>> so I'll interpret this as just a plain -1.
>>>
>>> AIUI, that's not your call.
>>
>> It's not my call to determine whether technical merit is sound (that
>> would be for the PPMC in such cases), but there has to be technical
>> merit in -1 in the first place. Saying "this is not a good idea" does
>> not convey technical reason. You've since elaborated on that in your
>> reply, and _that_ I believe constitutes a technical reason.
> 
> Ah ok.
> 
> I guess I was too terse, I should have linked to the previous mails I
> sent about the same issue.
> 
>>>
>>>> I think it's a good idea, I think it solves some real problems that have
>>>> been spotted in clustered setup. It could also solve problems where one
>>>> archives as mbox with an extra newline by mistake. It's also an optional
>>>> generator, not the default. Could you elaborate on why trailing
>>>> whitespace would matter?
>>>
>>> I already wrote that ignoring whitespace causes a problem because it
>>> means two different inputs end up with the same database id.
>>> There's no way of knowing which one was correct; the wrong one may end
>>> up being stored.
>>
>> But they would both have the same sender, date, list, message,
>> attachments etc filed under the same ID - is that not what we want? What
>> we _don't_ want is for trailing whitespace to cause duplicates. Put in
>> other words: Why would we at all care whether one has the added newline
>> or two and the other one doesn't? We're dealing with showing people
>> emails, but bit-perfect of what was sent (including duplicates as a
>> result of bit-diversion), but rather of what was intended.
> 
> I disagree; I think it's important to show the input email as exactly
> as possible.
> Whitespace trimming could damage some emails.
> 
>> If we wanted
>> a perfect copy, we'd use the full digest and skip clustered setups all
>> together, hoping machines don't die on us.
> 
> Not so, it must be possible to have perfect copies in clustered setups.
> Otherwise clustered backup systems would be impossible.
> It's just that the current design may make this tricky.
> 
>> This is for those rare
>> occasions where something _does_ go wrong, and as seen, sometimes
>> postfix will add some extra newlines - I still don't know why it does
>> that in every case, I only know that it does, and likely other MTAs do
>> as well.
> 
> That's largely my point.
> The cause needs to be determined otherwise the generator is being used
> to ignore what may be a bug.
> 
> Besides, in the cases I have seen (and noted on this list), it is not
> only a difference in trailing whitespace.
> The archived-at header is missing in one of the copies.
> As I have written already, that points to non-identical treatment by
> the different cluster members.
> 

The archived-at, and possibly the extra whitespace, likely stems from a
postfix oddity (that I really can't fix :p), in that mail delivered
locally will be handled internally, even if it's supposed to be rerouted
to a different address than the original.

The case is as follows, I think:
- 3 nodes act as MTAs
- Each node will receive an email (whichever node has highest priority
and is awake) and duplicate it into 3 copies, that each go to all the
boxes in the MX setup. So, one of these copies go to the box itself, and
the two other copies go to the other boxes if/when they are online (this
is to not bounce an email if a box should be down or erroring out). Now,
since one of the copies go to the box itself (even though it goes to a
new email address), it is somehow rerouted differently, and that causes
the header (and possibly the whitespace fixes) to either be there or not.

I'll gladly admit that we're working on figuring this out differently,
and possibly publishing this as a "what to do and not do" guideline
later on, when the issue is fixed. I can also accept that if we have
this guideline, we can omit the whitespace trimming.

Re: incubator-ponymail git commit: crop out trailing whitespace for redundant archiver

Posted by sebb <se...@gmail.com>.

On 7 September 2017 at 06:06, Daniel Gruno <hu...@apache.org> wrote:
> On 09/07/2017 12:24 AM, sebb wrote:
>> On 6 September 2017 at 07:32, Daniel Gruno <hu...@apache.org> wrote:
>>> On 09/06/2017 12:09 AM, sebb wrote:
>>>> On 2 September 2017 at 09:02,  <hu...@apache.org> wrote:
>>>>> Repository: incubator-ponymail
>>>>> Updated Branches:
>>>>>   refs/heads/master c8f4d3b7d -> df0b7ee1c
>>>>>
>>>>>
>>>>> crop out trailing whitespace for redundant archiver
>>>>>
>>>>> This deals with spurious whitespace that can exist on
>>>>> clustered setups due to corrections inside the MTAs.
>>>>> This only deals with trailing whitespace, everything else
>>>>> is preserved.
>>>>
>>>> -1
>>>>
>>>> I don't think this is a good idea.
>>>
>>> Your -1 is noted, but I don't consider the reasoning valid for a veto,
>>> so I'll interpret this as just a plain -1.
>>
>> AIUI, that's not your call.
>
> It's not my call to determine whether technical merit is sound (that
> would be for the PPMC in such cases), but there has to be technical
> merit in -1 in the first place. Saying "this is not a good idea" does
> not convey technical reason. You've since elaborated on that in your
> reply, and _that_ I believe constitutes a technical reason.

Ah ok.

I guess I was too terse, I should have linked to the previous mails I
sent about the same issue.

>>
>>> I think it's a good idea, I think it solves some real problems that have
>>> been spotted in clustered setup. It could also solve problems where one
>>> archives as mbox with an extra newline by mistake. It's also an optional
>>> generator, not the default. Could you elaborate on why trailing
>>> whitespace would matter?
>>
>> I already wrote that ignoring whitespace causes a problem because it
>> means two different inputs end up with the same database id.
>> There's no way of knowing which one was correct; the wrong one may end
>> up being stored.
>
> But they would both have the same sender, date, list, message,
> attachments etc filed under the same ID - is that not what we want? What
> we _don't_ want is for trailing whitespace to cause duplicates. Put in
> other words: Why would we at all care whether one has the added newline
> or two and the other one doesn't? We're dealing with showing people
> emails, but bit-perfect of what was sent (including duplicates as a
> result of bit-diversion), but rather of what was intended.

I disagree; I think it's important to show the input email as exactly
as possible.
Whitespace trimming could damage some emails.

> If we wanted
> a perfect copy, we'd use the full digest and skip clustered setups all
> together, hoping machines don't die on us.

Not so, it must be possible to have perfect copies in clustered setups.
Otherwise clustered backup systems would be impossible.
It's just that the current design may make this tricky.

> This is for those rare
> occasions where something _does_ go wrong, and as seen, sometimes
> postfix will add some extra newlines - I still don't know why it does
> that in every case, I only know that it does, and likely other MTAs do
> as well.

That's largely my point.
The cause needs to be determined otherwise the generator is being used
to ignore what may be a bug.

Besides, in the cases I have seen (and noted on this list), it is not
only a difference in trailing whitespace.
The archived-at header is missing in one of the copies.
As I have written already, that points to non-identical treatment by
the different cluster members.

>
>>
>> If messages are being treated differently by different parts of the
>> cluster then that needs to be addressed.
>>
>>>>
>>>> Please raise an issue for significant changes such as this.
>>>
>>> I don't see it as significant changes - it's an optional generator that
>>> we haven't released, it's not gonna break any API/ABI we currently have.
>>
>> Once a generator has been released, it effectively becomes part of the
>> public API.
>> It's then not possible to fix it without breaking the API.
>> So any problems with the code must be fixed before release.
>>
>>> If I am to raise a ticket, it would be for the full generator to be
>>> admitted as an option.
>>
>> That is definitely a separate ticket.
>>
>>>>
>>>>> Also add some spacing in the code.
>>>>>
>>>>>
>>>>> Project: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/repo
>>>>> Commit: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/commit/df0b7ee1
>>>>> Tree: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/tree/df0b7ee1
>>>>> Diff: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/diff/df0b7ee1
>>>>>
>>>>> Branch: refs/heads/master
>>>>> Commit: df0b7ee1c5676b3286e4b30c135f410b1e0a92da
>>>>> Parents: c8f4d3b
>>>>> Author: Daniel Gruno <hu...@apache.org>
>>>>> Authored: Sat Sep 2 10:02:10 2017 +0200
>>>>> Committer: Daniel Gruno <hu...@apache.org>
>>>>> Committed: Sat Sep 2 10:02:10 2017 +0200
>>>>>
>>>>> ----------------------------------------------------------------------
>>>>>  tools/generators.py | 9 +++++++++
>>>>>  1 file changed, 9 insertions(+)
>>>>> ----------------------------------------------------------------------
>>>>>
>>>>>
>>>>> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/blob/df0b7ee1/tools/generators.py
>>>>> ----------------------------------------------------------------------
>>>>> diff --git a/tools/generators.py b/tools/generators.py
>>>>> index e320a06..b37fa8b 100644
>>>>> --- a/tools/generators.py
>>>>> +++ b/tools/generators.py
>>>>> @@ -22,6 +22,7 @@ This file contains the various ID generators for Pony Mail's archivers.
>>>>>  import hashlib
>>>>>  import email.utils
>>>>>  import time
>>>>> +import re
>>>>>
>>>>>  # Full generator: uses the entire email (including server-dependent data)
>>>>>  # This is the recommended generator for single-node setups.
>>>>> @@ -95,8 +96,13 @@ def redundant(msg, body, lid, attachments):
>>>>>      if not body: # Make sure body is not None, which will fail.
>>>>>          body = ""
>>>>>      xbody = body if type(body) is bytes else body.encode('ascii', 'ignore')
>>>>> +
>>>>> +    # Crop out any trailing whitespace in body
>>>>> +    xbody = re.sub(b"\s+$", b"", xbody)
>>>>> +
>>>>>      # Use List ID
>>>>>      xbody += bytes(lid, encoding='ascii')
>>>>> +
>>>>>      # Use Date header. Don't use archived-at, as the archiver sets this if not present.
>>>>>      mdate = None
>>>>>      mdatestring = "(null)" # Default to null, ONLY changed if replicable across imports
>>>>> @@ -106,14 +112,17 @@ def redundant(msg, body, lid, attachments):
>>>>>      except:
>>>>>          pass
>>>>>      xbody += bytes(mdatestring, encoding='ascii')
>>>>> +
>>>>>      # Use sender
>>>>>      sender = msg.get('from', None)
>>>>>      if sender:
>>>>>          xbody += bytes(sender, encoding = 'ascii')
>>>>> +
>>>>>      # Use subject
>>>>>      subject = msg.get('subject', None)
>>>>>      if subject:
>>>>>          xbody += bytes(subject, encoding = 'ascii')
>>>>> +
>>>>>      # Use attachment hashes if present
>>>>>      if attachments:
>>>>>          for a in attachments:
>>>>>
>>>
>

Re: incubator-ponymail git commit: crop out trailing whitespace for redundant archiver

Posted by Daniel Gruno <hu...@apache.org>.

On 09/07/2017 12:24 AM, sebb wrote:
> On 6 September 2017 at 07:32, Daniel Gruno <hu...@apache.org> wrote:
>> On 09/06/2017 12:09 AM, sebb wrote:
>>> On 2 September 2017 at 09:02,  <hu...@apache.org> wrote:
>>>> Repository: incubator-ponymail
>>>> Updated Branches:
>>>>   refs/heads/master c8f4d3b7d -> df0b7ee1c
>>>>
>>>>
>>>> crop out trailing whitespace for redundant archiver
>>>>
>>>> This deals with spurious whitespace that can exist on
>>>> clustered setups due to corrections inside the MTAs.
>>>> This only deals with trailing whitespace, everything else
>>>> is preserved.
>>>
>>> -1
>>>
>>> I don't think this is a good idea.
>>
>> Your -1 is noted, but I don't consider the reasoning valid for a veto,
>> so I'll interpret this as just a plain -1.
> 
> AIUI, that's not your call.

It's not my call to determine whether technical merit is sound (that
would be for the PPMC in such cases), but there has to be technical
merit in -1 in the first place. Saying "this is not a good idea" does
not convey technical reason. You've since elaborated on that in your
reply, and _that_ I believe constitutes a technical reason.

> 
>> I think it's a good idea, I think it solves some real problems that have
>> been spotted in clustered setup. It could also solve problems where one
>> archives as mbox with an extra newline by mistake. It's also an optional
>> generator, not the default. Could you elaborate on why trailing
>> whitespace would matter?
> 
> I already wrote that ignoring whitespace causes a problem because it
> means two different inputs end up with the same database id.
> There's no way of knowing which one was correct; the wrong one may end
> up being stored.

But they would both have the same sender, date, list, message,
attachments etc filed under the same ID - is that not what we want? What
we _don't_ want is for trailing whitespace to cause duplicates. Put in
other words: Why would we at all care whether one has the added newline
or two and the other one doesn't? We're dealing with showing people
emails, but bit-perfect of what was sent (including duplicates as a
result of bit-diversion), but rather of what was intended. If we wanted
a perfect copy, we'd use the full digest and skip clustered setups all
together, hoping machines don't die on us. This is for those rare
occasions where something _does_ go wrong, and as seen, sometimes
postfix will add some extra newlines - I still don't know why it does
that in every case, I only know that it does, and likely other MTAs do
as well.

> 
> If messages are being treated differently by different parts of the
> cluster then that needs to be addressed.
> 
>>>
>>> Please raise an issue for significant changes such as this.
>>
>> I don't see it as significant changes - it's an optional generator that
>> we haven't released, it's not gonna break any API/ABI we currently have.
> 
> Once a generator has been released, it effectively becomes part of the
> public API.
> It's then not possible to fix it without breaking the API.
> So any problems with the code must be fixed before release.
> 
>> If I am to raise a ticket, it would be for the full generator to be
>> admitted as an option.
> 
> That is definitely a separate ticket.
> 
>>>
>>>> Also add some spacing in the code.
>>>>
>>>>
>>>> Project: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/repo
>>>> Commit: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/commit/df0b7ee1
>>>> Tree: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/tree/df0b7ee1
>>>> Diff: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/diff/df0b7ee1
>>>>
>>>> Branch: refs/heads/master
>>>> Commit: df0b7ee1c5676b3286e4b30c135f410b1e0a92da
>>>> Parents: c8f4d3b
>>>> Author: Daniel Gruno <hu...@apache.org>
>>>> Authored: Sat Sep 2 10:02:10 2017 +0200
>>>> Committer: Daniel Gruno <hu...@apache.org>
>>>> Committed: Sat Sep 2 10:02:10 2017 +0200
>>>>
>>>> ----------------------------------------------------------------------
>>>>  tools/generators.py | 9 +++++++++
>>>>  1 file changed, 9 insertions(+)
>>>> ----------------------------------------------------------------------
>>>>
>>>>
>>>> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/blob/df0b7ee1/tools/generators.py
>>>> ----------------------------------------------------------------------
>>>> diff --git a/tools/generators.py b/tools/generators.py
>>>> index e320a06..b37fa8b 100644
>>>> --- a/tools/generators.py
>>>> +++ b/tools/generators.py
>>>> @@ -22,6 +22,7 @@ This file contains the various ID generators for Pony Mail's archivers.
>>>>  import hashlib
>>>>  import email.utils
>>>>  import time
>>>> +import re
>>>>
>>>>  # Full generator: uses the entire email (including server-dependent data)
>>>>  # This is the recommended generator for single-node setups.
>>>> @@ -95,8 +96,13 @@ def redundant(msg, body, lid, attachments):
>>>>      if not body: # Make sure body is not None, which will fail.
>>>>          body = ""
>>>>      xbody = body if type(body) is bytes else body.encode('ascii', 'ignore')
>>>> +
>>>> +    # Crop out any trailing whitespace in body
>>>> +    xbody = re.sub(b"\s+$", b"", xbody)
>>>> +
>>>>      # Use List ID
>>>>      xbody += bytes(lid, encoding='ascii')
>>>> +
>>>>      # Use Date header. Don't use archived-at, as the archiver sets this if not present.
>>>>      mdate = None
>>>>      mdatestring = "(null)" # Default to null, ONLY changed if replicable across imports
>>>> @@ -106,14 +112,17 @@ def redundant(msg, body, lid, attachments):
>>>>      except:
>>>>          pass
>>>>      xbody += bytes(mdatestring, encoding='ascii')
>>>> +
>>>>      # Use sender
>>>>      sender = msg.get('from', None)
>>>>      if sender:
>>>>          xbody += bytes(sender, encoding = 'ascii')
>>>> +
>>>>      # Use subject
>>>>      subject = msg.get('subject', None)
>>>>      if subject:
>>>>          xbody += bytes(subject, encoding = 'ascii')
>>>> +
>>>>      # Use attachment hashes if present
>>>>      if attachments:
>>>>          for a in attachments:
>>>>
>>

Re: incubator-ponymail git commit: crop out trailing whitespace for redundant archiver

Posted by sebb <se...@gmail.com>.

On 6 September 2017 at 07:32, Daniel Gruno <hu...@apache.org> wrote:
> On 09/06/2017 12:09 AM, sebb wrote:
>> On 2 September 2017 at 09:02,  <hu...@apache.org> wrote:
>>> Repository: incubator-ponymail
>>> Updated Branches:
>>>   refs/heads/master c8f4d3b7d -> df0b7ee1c
>>>
>>>
>>> crop out trailing whitespace for redundant archiver
>>>
>>> This deals with spurious whitespace that can exist on
>>> clustered setups due to corrections inside the MTAs.
>>> This only deals with trailing whitespace, everything else
>>> is preserved.
>>
>> -1
>>
>> I don't think this is a good idea.
>
> Your -1 is noted, but I don't consider the reasoning valid for a veto,
> so I'll interpret this as just a plain -1.

AIUI, that's not your call.

> I think it's a good idea, I think it solves some real problems that have
> been spotted in clustered setup. It could also solve problems where one
> archives as mbox with an extra newline by mistake. It's also an optional
> generator, not the default. Could you elaborate on why trailing
> whitespace would matter?

I already wrote that ignoring whitespace causes a problem because it
means two different inputs end up with the same database id.
There's no way of knowing which one was correct; the wrong one may end
up being stored.

If messages are being treated differently by different parts of the
cluster then that needs to be addressed.

>>
>> Please raise an issue for significant changes such as this.
>
> I don't see it as significant changes - it's an optional generator that
> we haven't released, it's not gonna break any API/ABI we currently have.

Once a generator has been released, it effectively becomes part of the
public API.
It's then not possible to fix it without breaking the API.
So any problems with the code must be fixed before release.

> If I am to raise a ticket, it would be for the full generator to be
> admitted as an option.

That is definitely a separate ticket.

>>
>>> Also add some spacing in the code.
>>>
>>>
>>> Project: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/repo
>>> Commit: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/commit/df0b7ee1
>>> Tree: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/tree/df0b7ee1
>>> Diff: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/diff/df0b7ee1
>>>
>>> Branch: refs/heads/master
>>> Commit: df0b7ee1c5676b3286e4b30c135f410b1e0a92da
>>> Parents: c8f4d3b
>>> Author: Daniel Gruno <hu...@apache.org>
>>> Authored: Sat Sep 2 10:02:10 2017 +0200
>>> Committer: Daniel Gruno <hu...@apache.org>
>>> Committed: Sat Sep 2 10:02:10 2017 +0200
>>>
>>> ----------------------------------------------------------------------
>>>  tools/generators.py | 9 +++++++++
>>>  1 file changed, 9 insertions(+)
>>> ----------------------------------------------------------------------
>>>
>>>
>>> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/blob/df0b7ee1/tools/generators.py
>>> ----------------------------------------------------------------------
>>> diff --git a/tools/generators.py b/tools/generators.py
>>> index e320a06..b37fa8b 100644
>>> --- a/tools/generators.py
>>> +++ b/tools/generators.py
>>> @@ -22,6 +22,7 @@ This file contains the various ID generators for Pony Mail's archivers.
>>>  import hashlib
>>>  import email.utils
>>>  import time
>>> +import re
>>>
>>>  # Full generator: uses the entire email (including server-dependent data)
>>>  # This is the recommended generator for single-node setups.
>>> @@ -95,8 +96,13 @@ def redundant(msg, body, lid, attachments):
>>>      if not body: # Make sure body is not None, which will fail.
>>>          body = ""
>>>      xbody = body if type(body) is bytes else body.encode('ascii', 'ignore')
>>> +
>>> +    # Crop out any trailing whitespace in body
>>> +    xbody = re.sub(b"\s+$", b"", xbody)
>>> +
>>>      # Use List ID
>>>      xbody += bytes(lid, encoding='ascii')
>>> +
>>>      # Use Date header. Don't use archived-at, as the archiver sets this if not present.
>>>      mdate = None
>>>      mdatestring = "(null)" # Default to null, ONLY changed if replicable across imports
>>> @@ -106,14 +112,17 @@ def redundant(msg, body, lid, attachments):
>>>      except:
>>>          pass
>>>      xbody += bytes(mdatestring, encoding='ascii')
>>> +
>>>      # Use sender
>>>      sender = msg.get('from', None)
>>>      if sender:
>>>          xbody += bytes(sender, encoding = 'ascii')
>>> +
>>>      # Use subject
>>>      subject = msg.get('subject', None)
>>>      if subject:
>>>          xbody += bytes(subject, encoding = 'ascii')
>>> +
>>>      # Use attachment hashes if present
>>>      if attachments:
>>>          for a in attachments:
>>>
>

Re: incubator-ponymail git commit: crop out trailing whitespace for redundant archiver

Posted by Daniel Gruno <hu...@apache.org>.

On 09/06/2017 12:09 AM, sebb wrote:
> On 2 September 2017 at 09:02,  <hu...@apache.org> wrote:
>> Repository: incubator-ponymail
>> Updated Branches:
>>   refs/heads/master c8f4d3b7d -> df0b7ee1c
>>
>>
>> crop out trailing whitespace for redundant archiver
>>
>> This deals with spurious whitespace that can exist on
>> clustered setups due to corrections inside the MTAs.
>> This only deals with trailing whitespace, everything else
>> is preserved.
> 
> -1
> 
> I don't think this is a good idea.

Your -1 is noted, but I don't consider the reasoning valid for a veto,
so I'll interpret this as just a plain -1.

I think it's a good idea, I think it solves some real problems that have
been spotted in clustered setup. It could also solve problems where one
archives as mbox with an extra newline by mistake. It's also an optional
generator, not the default. Could you elaborate on why trailing
whitespace would matter?

> 
> Please raise an issue for significant changes such as this.

I don't see it as significant changes - it's an optional generator that
we haven't released, it's not gonna break any API/ABI we currently have.
If I am to raise a ticket, it would be for the full generator to be
admitted as an option.

> 
>> Also add some spacing in the code.
>>
>>
>> Project: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/repo
>> Commit: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/commit/df0b7ee1
>> Tree: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/tree/df0b7ee1
>> Diff: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/diff/df0b7ee1
>>
>> Branch: refs/heads/master
>> Commit: df0b7ee1c5676b3286e4b30c135f410b1e0a92da
>> Parents: c8f4d3b
>> Author: Daniel Gruno <hu...@apache.org>
>> Authored: Sat Sep 2 10:02:10 2017 +0200
>> Committer: Daniel Gruno <hu...@apache.org>
>> Committed: Sat Sep 2 10:02:10 2017 +0200
>>
>> ----------------------------------------------------------------------
>>  tools/generators.py | 9 +++++++++
>>  1 file changed, 9 insertions(+)
>> ----------------------------------------------------------------------
>>
>>
>> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/blob/df0b7ee1/tools/generators.py
>> ----------------------------------------------------------------------
>> diff --git a/tools/generators.py b/tools/generators.py
>> index e320a06..b37fa8b 100644
>> --- a/tools/generators.py
>> +++ b/tools/generators.py
>> @@ -22,6 +22,7 @@ This file contains the various ID generators for Pony Mail's archivers.
>>  import hashlib
>>  import email.utils
>>  import time
>> +import re
>>
>>  # Full generator: uses the entire email (including server-dependent data)
>>  # This is the recommended generator for single-node setups.
>> @@ -95,8 +96,13 @@ def redundant(msg, body, lid, attachments):
>>      if not body: # Make sure body is not None, which will fail.
>>          body = ""
>>      xbody = body if type(body) is bytes else body.encode('ascii', 'ignore')
>> +
>> +    # Crop out any trailing whitespace in body
>> +    xbody = re.sub(b"\s+$", b"", xbody)
>> +
>>      # Use List ID
>>      xbody += bytes(lid, encoding='ascii')
>> +
>>      # Use Date header. Don't use archived-at, as the archiver sets this if not present.
>>      mdate = None
>>      mdatestring = "(null)" # Default to null, ONLY changed if replicable across imports
>> @@ -106,14 +112,17 @@ def redundant(msg, body, lid, attachments):
>>      except:
>>          pass
>>      xbody += bytes(mdatestring, encoding='ascii')
>> +
>>      # Use sender
>>      sender = msg.get('from', None)
>>      if sender:
>>          xbody += bytes(sender, encoding = 'ascii')
>> +
>>      # Use subject
>>      subject = msg.get('subject', None)
>>      if subject:
>>          xbody += bytes(subject, encoding = 'ascii')
>> +
>>      # Use attachment hashes if present
>>      if attachments:
>>          for a in attachments:
>>

Re: incubator-ponymail git commit: crop out trailing whitespace for redundant archiver

Posted by sebb <se...@gmail.com>.

On 2 September 2017 at 09:02,  <hu...@apache.org> wrote:
> Repository: incubator-ponymail
> Updated Branches:
>   refs/heads/master c8f4d3b7d -> df0b7ee1c
>
>
> crop out trailing whitespace for redundant archiver
>
> This deals with spurious whitespace that can exist on
> clustered setups due to corrections inside the MTAs.
> This only deals with trailing whitespace, everything else
> is preserved.

-1

I don't think this is a good idea.

Please raise an issue for significant changes such as this.

> Also add some spacing in the code.
>
>
> Project: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/repo
> Commit: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/commit/df0b7ee1
> Tree: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/tree/df0b7ee1
> Diff: http://git-wip-us.apache.org/repos/asf/incubator-ponymail/diff/df0b7ee1
>
> Branch: refs/heads/master
> Commit: df0b7ee1c5676b3286e4b30c135f410b1e0a92da
> Parents: c8f4d3b
> Author: Daniel Gruno <hu...@apache.org>
> Authored: Sat Sep 2 10:02:10 2017 +0200
> Committer: Daniel Gruno <hu...@apache.org>
> Committed: Sat Sep 2 10:02:10 2017 +0200
>
> ----------------------------------------------------------------------
>  tools/generators.py | 9 +++++++++
>  1 file changed, 9 insertions(+)
> ----------------------------------------------------------------------
>
>
> http://git-wip-us.apache.org/repos/asf/incubator-ponymail/blob/df0b7ee1/tools/generators.py
> ----------------------------------------------------------------------
> diff --git a/tools/generators.py b/tools/generators.py
> index e320a06..b37fa8b 100644
> --- a/tools/generators.py
> +++ b/tools/generators.py
> @@ -22,6 +22,7 @@ This file contains the various ID generators for Pony Mail's archivers.
>  import hashlib
>  import email.utils
>  import time
> +import re
>
>  # Full generator: uses the entire email (including server-dependent data)
>  # This is the recommended generator for single-node setups.
> @@ -95,8 +96,13 @@ def redundant(msg, body, lid, attachments):
>      if not body: # Make sure body is not None, which will fail.
>          body = ""
>      xbody = body if type(body) is bytes else body.encode('ascii', 'ignore')
> +
> +    # Crop out any trailing whitespace in body
> +    xbody = re.sub(b"\s+$", b"", xbody)
> +
>      # Use List ID
>      xbody += bytes(lid, encoding='ascii')
> +
>      # Use Date header. Don't use archived-at, as the archiver sets this if not present.
>      mdate = None
>      mdatestring = "(null)" # Default to null, ONLY changed if replicable across imports
> @@ -106,14 +112,17 @@ def redundant(msg, body, lid, attachments):
>      except:
>          pass
>      xbody += bytes(mdatestring, encoding='ascii')
> +
>      # Use sender
>      sender = msg.get('from', None)
>      if sender:
>          xbody += bytes(sender, encoding = 'ascii')
> +
>      # Use subject
>      subject = msg.get('subject', None)
>      if subject:
>          xbody += bytes(subject, encoding = 'ascii')
> +
>      # Use attachment hashes if present
>      if attachments:
>          for a in attachments:
>