You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/01/29 00:49:21 UTC
[Bug 2981] New: inoculation support?
http://bugzilla.spamassassin.org/show_bug.cgi?id=2981
Summary: inoculation support?
Product: Spamassassin
Version: SVN Trunk (Latest Devel Version)
Platform: Other
OS/Version: other
Status: NEW
Severity: enhancement
Priority: P4
Component: Learner
AssignedTo: spamassassin-dev@incubator.apache.org
ReportedBy: jm@jmason.org
BTW, here's an interesting idea we ran into at the Spam Conf 2004.
http://lists.netsys.com/pipermail/full-disclosure/2003-November/013840.html
http://www.nuclearelephant.com/projects/dspam/draft-spamfilt-inoculation-01.txt
Basically, it's quite simple -- a standard MIME wrapper for training
spam filters.
My issue with this proposal, however, is what happens when you have
a trained db with these tokens:
SPAMCOUNT HAMCOUNT TOKEN
1 3 foo
1 3 bar
Note, both are hammy tokens.
If you have 8 friends who have you in their inoculation list, and they all
get copies of *1* single spam message containing "bar" as a token, and they all
inoculate you, that'll result in:
SPAMCOUNT HAMCOUNT TOKEN
1 3 foo
9 3 bar
hence -- "bar" becomes a strongly spammy token, even though in reality that was
a result of a single spam run.
In other words, inoculation does bad things for Bayes training; inoculated
tokens, IMO, are likely to be "stronger" in result than personally-trained tokens.
This could be avoided by using a hash of the message body somehow as a message
identifier, so that once 1 person inoculates you for a given spam, you will
learn it once and ignore future inoculations. -- but then the issue there is,
what is a reliable message id for spam, given that spammers routinely evade body
hashing, fake message-id headers, etc.?
comments?
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.