You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@community.apache.org by "Sebb (JIRA)" <ji...@apache.org> on 2015/09/25 12:57:04 UTC

[jira] [Created] (COMDEV-161) mailglomper.py may count a message multiple times

Sebb created COMDEV-161:
---------------------------

             Summary: mailglomper.py may count a message multiple times
                 Key: COMDEV-161
                 URL: https://issues.apache.org/jira/browse/COMDEV-161
             Project: Community Development
          Issue Type: Bug
          Components: Reporter Tool
            Reporter: Sebb


The mailglomper.py script counts messages by matching /Date: (.*)/.
It is looking to match header lines of the form:

Date: Thu, 01 May 2008 05:06:51 +0000

However such lines are not guaranteed to be unique within a message.

In particular SVN commit messages have a "Date:" line which matches, and the parsed timestamp will be much the same as the header date. For example:

Author: cml
Date: Wed Sep 16 19:06:03 2015
New Revision: 1703436

Furthermore, the RE does not anchor the match at the start of a line, this allows further Date: entries to match.

The mailbox format currently used by the ASF guarantees that each message is prefixed with a line in the format:

>From user@example.com Thu May 01 05:10:32 2008

[Lines in the message body starting "From " are prefixed as ">From "; the prefix is removed when messages are extracted]

Only lines starting "From " are guaranteed not to occur in message bodies.

The problem is trivial to fix, but it will change the generated statistics, particularly for mailboxes that receive SVN commit messages (Git commits use a different prefix for the timestamp). SVN mails will generally be counted twice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)