You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/09/26 21:40:07 UTC

[Bug 3824] New: Identify spam with duplicative message ids

http://bugzilla.spamassassin.org/show_bug.cgi?id=3824

           Summary: Identify spam with duplicative message ids
           Product: Spamassassin
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: Learner
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: Bob@Menschel.net


My understanding is that emails issued by systems should have unique message 
ids. In other words, if system A issues an email with message-id abcd@a, then 
no other message should have that message-id. 

Spamming systems often do not follow this requirement; multiple spams reach 
our system with identical message-ids. 

I propose a (version 3.1?) enhancement to the learning systems that will track 
message ids, and identify non-duplicate messages with duplicate message-ids as 
spam. 

Possible mechanism: 

a) as each message is first seen by SA, identify its message-id, date, and 
subject. Message-id can be hashed to save space. At least the first 20 or so 
characters of subject should be maintained for later use. 

b) Check for this message-id in the message-id database (similar to the bayes 
database). If this is a new message-id, store the message-id, date, and 
subject into the database, and apply no score to the message. This is unique. 

c) If this is not a new message-id, compare the date and new subject to the 
old one. If the dates and subjects are identical, accept the message.  This 
will allow for multiple processing of cc and bcc copies. 

d) If the subjects are identical, but the dates are just slightly off (within 
an hour or two, or maybe within a day or two), this might have been an 
accidental or intentional multi-send of the same message. It might even be a 
valid redirect. Accept the message.

e) If this is not a new message-id, and one subject header matches the other 
except for a prefix (a [listname] prefix or something similar), and the dates 
are close, then accept the message. This will allow for messages from mailing 
lists to be processed correctly. 

e) If this is not a new message-id, and the dates are significantly different 
and/or the subject is significantly different, flag this via "NON_UNIQUE_MID" 
(or similar rule name) and corresponding score. 

f) There should be an automated purge, such that any message ids where the age 
is greater than some threshold (3 months? settable via local.cf/user_prefs?) 
is automatically deleted from the database. For efficiency, this might be done 
automatically by a spawned child once a day. 

g) Like Bayes, AWL, and net tests, this test should be switchable on/off via 
local.cf and/or user_prefs. If this is a single test (NON_UNIQUE_MID), then 
perhaps this can be done through its score (zero or not-zero). If this is best 
implemented some other way, or if the score method won't work, then it should 
be a parameter that can be specified in these files.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.