You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/02/18 01:18:30 UTC

[Bug 3055] New: Bayes: use hash instead of Message-Id?

http://bugzilla.spamassassin.org/show_bug.cgi?id=3055

           Summary: Bayes: use hash instead of Message-Id?
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: minor
          Priority: P5
         Component: Learner
        AssignedTo: spamassassin-dev@incubator.apache.org
        ReportedBy: jm@jmason.org


Folks --

this has come up before, but I think we might as well raise it again ;)

Basically, Robert Menschel noted on Fri, 13 Feb 2004 20:59:56 -0800
in this mail

  Subject: Re[2]: Some real anti-bayes stuffing followup
  Date: Fri, 13 Feb 2004 20:59:56 -0800
  Cc: spamassassin-users.incubator.apache.org

the following:

'I've received multiple spams all using the same message id.

a) If a ham is sent to my domain with four recipients here, then because
of the way I run SA, I could process that email four times, once for each
mailbox. That's expected. And it's expected that each of those emails
will have identical bodies, and identical subjects.

b) I receive spam where in a given day I can receive similar spam,
identical message ids, but with different subject headers (usually random
words or letters added to a subject), and/or with different bodies
(sometimes minor random differences, sometimes very different messages).

c) I receive spam where on Jan 2 I can receive spam with a given message
ID, and I can receive spam (similar or not) with identical message ids on
Jan 14, Jan 30, Feb 12, etc.'

I think this is probably a bayes-evasion technique, since we key
our bayes_seen db on Message-ID if present.

What were the objections to using a hash of some selected headers (From, To,
Subject) and the message body, again?  Strikes me this is a more resilient
way to avoid spammers using 1 message ID for all their spam and evading
bayes learning that way.

--j.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Re: [Bug 3055] New: Bayes: use hash instead of Message-Id?

Posted by Daniel Quinlan <qu...@pathname.com>.

bugzilla-daemon@bugzilla.spamassassin.org writes:

> What were the objections to using a hash of some selected headers (From, To,
> Subject) and the message body, again?  Strikes me this is a more resilient
> way to avoid spammers using 1 message ID for all their spam and evading
> bayes learning that way.

I agree.  Let's move to using the hash in 3.0.

I think the main concerns (I wouldn't say objections) were (or should
be).

1. overhead of computing the hash (not a big deal, I think)

2. stability of the hash to minor changes (like whitespace in headers,
   whitespace at end of body, header sorting, Received headers, etc.)
   that could cause a mismatch in generated ID from one hashing to the
   next.

3. backward compatibility with existing Bayes databases.

Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting