You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@community.apache.org by "Sebb (JIRA)" <ji...@apache.org> on 2015/09/26 02:27:04 UTC

[jira] [Resolved] (COMDEV-161) mailglomper.py may count a message multiple times

     [ https://issues.apache.org/jira/browse/COMDEV-161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sebb resolved COMDEV-161.
-------------------------
    Resolution: Fixed

URL: http://svn.apache.org/viewvc?rev=1705389&view=rev
Log:
COMDEV-161 mailglomper.py may count a message multiple times
Fixed RE to look for "From " at the start of a line
Also changed code to read data by line rather than slurping entire mailbox into memory
Added some timestamp traces to check on performance

Modified:
    comdev/reporter.apache.org/trunk/mailglomper.py


> mailglomper.py may count a message multiple times
> -------------------------------------------------
>
>                 Key: COMDEV-161
>                 URL: https://issues.apache.org/jira/browse/COMDEV-161
>             Project: Community Development
>          Issue Type: Bug
>          Components: Reporter Tool
>            Reporter: Sebb
>
> The mailglomper.py script counts messages by matching /Date: (.*)/.
> It is looking to match header lines of the form:
> Date: Thu, 01 May 2008 05:06:51 +0000
> However such lines are not guaranteed to be unique within a message.
> In particular SVN commit messages have a "Date:" line which matches, and the parsed timestamp will be much the same as the header date. For example:
> Author: cml
> Date: Wed Sep 16 19:06:03 2015
> New Revision: 1703436
> The mailbox format currently used by the ASF guarantees that each message is prefixed with a line in the format:
> From user@example.com Thu May 01 05:10:32 2008
> [Lines in the message body starting "From " are prefixed as ">From "; the prefix is removed when messages are extracted]
> Only lines starting "From " are guaranteed not to occur in message bodies.
> The problem is trivial to fix, but it will change the generated statistics, particularly for mailboxes that receive SVN commit messages (Git commits use a different prefix for the timestamp). SVN mails will generally be counted twice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)