You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/09/26 21:40:07 UTC
[Bug 3824] New: Identify spam with duplicative message ids
http://bugzilla.spamassassin.org/show_bug.cgi?id=3824
Summary: Identify spam with duplicative message ids
Product: Spamassassin
Version: unspecified
Platform: All
OS/Version: All
Status: NEW
Severity: enhancement
Priority: P5
Component: Learner
AssignedTo: dev@spamassassin.apache.org
ReportedBy: Bob@Menschel.net
My understanding is that emails issued by systems should have unique message
ids. In other words, if system A issues an email with message-id abcd@a, then
no other message should have that message-id.
Spamming systems often do not follow this requirement; multiple spams reach
our system with identical message-ids.
I propose a (version 3.1?) enhancement to the learning systems that will track
message ids, and identify non-duplicate messages with duplicate message-ids as
spam.
Possible mechanism:
a) as each message is first seen by SA, identify its message-id, date, and
subject. Message-id can be hashed to save space. At least the first 20 or so
characters of subject should be maintained for later use.
b) Check for this message-id in the message-id database (similar to the bayes
database). If this is a new message-id, store the message-id, date, and
subject into the database, and apply no score to the message. This is unique.
c) If this is not a new message-id, compare the date and new subject to the
old one. If the dates and subjects are identical, accept the message. This
will allow for multiple processing of cc and bcc copies.
d) If the subjects are identical, but the dates are just slightly off (within
an hour or two, or maybe within a day or two), this might have been an
accidental or intentional multi-send of the same message. It might even be a
valid redirect. Accept the message.
e) If this is not a new message-id, and one subject header matches the other
except for a prefix (a [listname] prefix or something similar), and the dates
are close, then accept the message. This will allow for messages from mailing
lists to be processed correctly.
e) If this is not a new message-id, and the dates are significantly different
and/or the subject is significantly different, flag this via "NON_UNIQUE_MID"
(or similar rule name) and corresponding score.
f) There should be an automated purge, such that any message ids where the age
is greater than some threshold (3 months? settable via local.cf/user_prefs?)
is automatically deleted from the database. For efficiency, this might be done
automatically by a spawned child once a day.
g) Like Bayes, AWL, and net tests, this test should be switchable on/off via
local.cf and/or user_prefs. If this is a single test (NON_UNIQUE_MID), then
perhaps this can be done through its score (zero or not-zero). If this is best
implemented some other way, or if the score method won't work, then it should
be a parameter that can be specified in these files.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.