You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/01/29 00:49:21 UTC
[Bug 2981] New: inoculation support?

http://bugzilla.spamassassin.org/show_bug.cgi?id=2981

           Summary: inoculation support?
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: enhancement
          Priority: P4
         Component: Learner
        AssignedTo: spamassassin-dev@incubator.apache.org
        ReportedBy: jm@jmason.org


BTW, here's an interesting idea we ran into at the Spam Conf 2004.

http://lists.netsys.com/pipermail/full-disclosure/2003-November/013840.html
http://www.nuclearelephant.com/projects/dspam/draft-spamfilt-inoculation-01.txt

Basically, it's quite simple -- a standard MIME wrapper for training
spam filters.

My issue with this proposal, however, is what happens when you have
a trained db with these tokens:

        SPAMCOUNT       HAMCOUNT        TOKEN
        1               3               foo
        1               3               bar

Note, both are hammy tokens.

If you have 8 friends who have you in their inoculation list, and they all     
 get copies of *1* single spam message containing "bar" as a token, and they all
inoculate you, that'll result in:

        SPAMCOUNT       HAMCOUNT        TOKEN
        1               3               foo
        9               3               bar

hence -- "bar" becomes a strongly spammy token, even though in reality that was
a result of a single spam run.

In other words, inoculation does bad things for Bayes training; inoculated
tokens, IMO, are likely to be "stronger" in result than personally-trained tokens.

This could be avoided by using a hash of the message body somehow as a message
identifier, so that once 1 person inoculates you for a given spam, you will
learn it once and ignore future inoculations.    -- but then the issue there is,
what is a reliable message id for spam, given that spammers routinely evade body
hashing, fake message-id headers, etc.?

comments?



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.