You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Robert Menschel <Ro...@Menschel.net> on 2004/02/14 05:59:56 UTC

Re[2]: Some real anti-bayes stuffing followup

Hello Bart, Devs,

Friday, February 13, 2004, 12:33:27 PM, you wrote, concerning Bayes:

BS> (I hope the use of message-id for this goes by the wayside soon,
BS> before spammers get the bright idea to steal old message-id headers
BS> from nonspam usenet or list archives and insert them into newly
BS> generated spam.)

Actually, a new spam-detecting mechanism could be to look for duplicate
message ids. I've received multiple spams all using the same message id.

a) If a ham is sent to my domain with four recipients here, then because
of the way I run SA, I could process that email four times, once for each
mailbox. That's expected. And it's expected that each of those emails
will have identical bodies, and identical subjects.

b) I receive spam where in a given day I can receive similar spam,
identical message ids, but with different subject headers (usually random
words or letters added to a subject), and/or with different bodies
(sometimes minor random differences, sometimes very different messages).

c) I receive spam where on Jan 2 I can receive spam with a given message
ID, and I can receive spam (similar or not) with identical message ids on
Jan 14, Jan 30, Feb 12, etc.

I suggest that if we could store a record with three or four fields,
message-id, checksum(subject), checksum(body), and maybe time(firstseen),
we could use this as a database, and apply a rule (maybe named
DUPLICATE_MESSAGEID) where either (1) checksums don't match, or (2)
time(now) is significantly different from time(firstseen).

Does this seem like a worthwhile approach?

Bob Menschel

Re: Re[2]: Some real anti-bayes stuffing followup

Posted by "Brent J. Nordquist" <b-...@bethel.edu>.

On Sat, 14 Feb 2004, Keith C. Ivey <kc...@cpcug.org> wrote:

> David B Funk <db...@engineering.uiowa.edu> wrote:
> 
> > Silly question, how does Bayes deal with a message that has -no-
> > Message-ID? Unlike NNTP, SMTP does not require a Message-ID, just
> > reccomends one.
> 
> If there is no message ID, SA uses a hash of the message text 
> followed by '@sa_generated'.

Note that many MTAs will add the Message-ID on the way through, if it
didn't have one already.  SA, in turn, uses that as useful intelligence;  
search for MSGID_FROM_MTA_ in the *.cf rules distributed with SA.

-- 
Brent J. Nordquist <b-...@bethel.edu> N0BJN
Other contact information: http://kepler.acns.bethel.edu/~bjn/contact.html
* Fast pipe * Always on * Get out of the way - Tim Bray http://tinyurl.com/7sti

Re: Re[2]: Some real anti-bayes stuffing followup

Posted by "Keith C. Ivey" <kc...@cpcug.org>.

David B Funk <db...@engineering.uiowa.edu> wrote:

> Silly question, how does Bayes deal with a message that has -no-
> Message-ID? Unlike NNTP, SMTP does not require a Message-ID, just
> reccomends one.

If there is no message ID, SA uses a hash of the message text 
followed by '@sa_generated'.  Unfortunately that means if the 
message is modified at a later stage before delivery it won't 
be possible to correct mislearning (of course, relearning a 
modified message doesn't work completely right even if there is 
a message ID).

-- 
Keith C. Ivey <kc...@cpcug.org>
Washington, DC

Re[2]: Some real anti-bayes stuffing followup

Posted by David B Funk <db...@engineering.uiowa.edu>.

On Fri, 13 Feb 2004, Robert Menschel wrote:

> Hello Bart, Devs,
>
> Friday, February 13, 2004, 12:33:27 PM, you wrote, concerning Bayes:
>
> BS> (I hope the use of message-id for this goes by the wayside soon,
> BS> before spammers get the bright idea to steal old message-id headers
> BS> from nonspam usenet or list archives and insert them into newly
> BS> generated spam.)
>
> Actually, a new spam-detecting mechanism could be to look for duplicate
> message ids. I've received multiple spams all using the same message id.

Silly question, how does Bayes deal with a message that has -no-
Message-ID? Unlike NNTP, SMTP does not require a Message-ID, just
reccomends one.

I see many messages a day that come into our mail server that totally lack
a Message-ID (I use that as a spam-sign and assign a value of 1.5 to it ;).
My sendmail daemon synthesizes a Message-ID before delivery but it isn't
there during the filtering process.

-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Re: Some real anti-bayes stuffing followup

Posted by "Keith C. Ivey" <kc...@cpcug.org>.

Bart Schaefer <sc...@zanshin.com> wrote, responding to 
Robert
Menschel's proposal for catching duplicate message IDs:

> Just two points before I go to bed:
> 
> (1) Isn't this effectively what DCC, Razor, Pyzor, etc. 
already do?

How is that?  We're talking about different messages that have 
the same message ID.

> (2) Isn't most of this data already in the Bayes database, 
just being
> used differently?

It's true that the message IDs of learned messages are in the 
Bayes DB, so it should be possible to use that to catch the 
duplicates.  I agree with Jon that it's probably not worth the 
trouble, though.  I have seen spammers occasionally reuse 
message IDs, but it doesn't really give them much benefit, so 
it's not widespread.

-- 
Keith C. Ivey <kc...@cpcug.org>
Washington, DC

Re[4]: Some real anti-bayes stuffing followup

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Jon,

Friday, February 13, 2004, 9:11:41 PM, you wrote:

J> On Fri, 2004-02-13 at 20:59, Robert Menschel wrote:
>> I suggest that if we could store a record with three or four fields,
>> message-id, checksum(subject), checksum(body), and maybe time(firstseen),
>> we could use this as a database, and apply a rule (maybe named
>> DUPLICATE_MESSAGEID) where either (1) checksums don't match, or (2)
>> time(now) is significantly different from time(firstseen).
>> 
>> Does this seem like a worthwhile approach?

J> IANAD (I am not a developer) but I don't think I this a worthwhile
J> approach for two related reasons:

J> * it costs us (the mail admins) too much
J> * it costs spammers too little

J> We would need to go through the effort of implementing this in code,
J> then setting off resources (disk and CPU) to checksum and record these
J> attributes of incoming messages.

I see this resource requirement as being minimal -- a small fraction of
what we do currently with Bayes.

J> In response, spammers would only need to insert a %RND_MSG_ID to
J> render all our efforts useless.

It'd be easier to simply have their spam-mail programs generate normal,
unique message ids...

Bob Menschel

Re: Some real anti-bayes stuffing followup

Posted by Bart Schaefer <sc...@zanshin.com>.

On Fri, 13 Feb 2004, Robert Menschel wrote:

> I suggest that if we could store a record with three or four fields,
> message-id, checksum(subject), checksum(body), and maybe
> time(firstseen), we could use this as a database, and apply a rule
> (maybe named DUPLICATE_MESSAGEID) where either (1) checksums don't
> match, or (2) time(now) is significantly different from time(firstseen).

On Fri, 13 Feb 2004, Jon wrote:

> IANAD (I am not a developer) but I don't think I this a worthwhile
> approach for two related reasons:
> 
> * it costs us (the mail admins) too much
> * it costs spammers too little

Just two points before I go to bed:

(1) Isn't this effectively what DCC, Razor, Pyzor, etc. already do?

(2) Isn't most of this data already in the Bayes database, just being
used differently?

Re: Re[2]: Some real anti-bayes stuffing followup

Posted by Jon <jo...@tgpsolutions.com>.

On Fri, 2004-02-13 at 20:59, Robert Menschel wrote:
> I suggest that if we could store a record with three or four fields,
> message-id, checksum(subject), checksum(body), and maybe time(firstseen),
> we could use this as a database, and apply a rule (maybe named
> DUPLICATE_MESSAGEID) where either (1) checksums don't match, or (2)
> time(now) is significantly different from time(firstseen).
> 
> Does this seem like a worthwhile approach?
> 

IANAD (I am not a developer) but I don't think I this a worthwhile
approach for two related reasons:

* it costs us (the mail admins) too much
* it costs spammers too little

We would need to go through the effort of implementing this in code,
then setting off resources (disk and CPU) to checksum and record these
attributes of incoming messages.  

In response, spammers would only need to insert a %RND_MSG_ID to render
all our efforts useless.  

- Jon

-- 
jon@tgpsolutions.com

Administrator, tgpsolutions
http://www.tgpsolutions.com