You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Andrew Sykes <an...@sykesdevelopment.com> on 2006/11/24 16:01:12 UTC

Newbie Question

Hi,

I'm writing some code to integrate SpamAssassin with Apache JAMES.

I want to setup an address to allow me to pipe spam into sa-learn. I
have a prototype of this working fine, but would like to allow various
webmail client users to be able to forward spam messages to this
address.

As I have very limited understanding of how SA works, I don't want to
end up blocking the forwarding addresses.

If I whitelist the forwarding addresses, can I then simply pipe a
forwarded spam from that address into sa-learn or is there more to it?

Thanks a lot for your help.
-- 
Kind Regards
Andrew Sykes <an...@sykesdevelopment.com>
Sykes Development Ltd
http://www.sykesdevelopment.com


Re: Newbie Question

Posted by Matt Kettler <mk...@verizon.net>.
Andrew Sykes wrote:
> Hi,
>
> I'm writing some code to integrate SpamAssassin with Apache JAMES.
>
> I want to setup an address to allow me to pipe spam into sa-learn. I
> have a prototype of this working fine, but would like to allow various
> webmail client users to be able to forward spam messages to this
> address.
>
> As I have very limited understanding of how SA works, I don't want to
> end up blocking the forwarding addresses.
>
> If I whitelist the forwarding addresses, can I then simply pipe a
> forwarded spam from that address into sa-learn or is there more to it?
>   

There's MUCH more to it.. In fact, whitelisting won't really affect what
sa-learn does at all.

Generally speaking, forwarded messages are mostly useless to sa-learn.
Exactly how useless depends a bit on the mail client..

SA tokenizes MANY mail headers, including Received:, not just From: and
To. All the headers in a forwarded message are completely new, thus the
sa-learn process will be learning the headers generated by forwarding,
and not spam.

SA also tokenizes the body of the message. However, most mail clients
substantially modify the body of the message when you forward. 
Generally speaking they only preserve one of the mime sections in a
multipart/alternative message. Spammers FREQUENTLY have text/plain
sections which are dissimilar from the text/html. By forwarding you're
loosing all but one mime section (generally text/html is kept).

On top of this, most mail clients also insert "Forwarded message:" type
text into the body, and add Fwd: to the subject.

SA also tokenizes the in-body mime headers describing how the message
was encoded. However, when you forward, the mail client doing the
forward re-encodes things its own way. What might have been base64
encoded may now be quoted-printable, 8 bit, or 7 bit.

So, fundamentally, as far as bayes is concerned the forwarded message is
a completely different message than the original spam.

You can try this sometime by taking an original spam, and a forwarded
version of it and feed them both to spamassassin or sa-learn with "-D
bayes" added. This will cause the debug output to list all the tokens
used. Take a look at the tokens. .some are the same, but many are different.








RE: Newbie Question

Posted by Giampaolo Tomassoni <g....@libero.it>.
From: Michael W Cocke [mailto:cocke@catherders.com]
> 
> For what it's worth, on the system here I have a special directory on
> the server set up, and when the users get a spam message they do a
> 'save as ascii text file' to that directory. sa-learn runs thru that
> directory every half hour.  Just a thought.

Would be better to learn ham too, not just spam. You may get SA more prone to FPs otherwise.

giampaolo

> 
> Mike-
> 
> 
> On Fri, 24 Nov 2006 15:39:35 +0000, you wrote:
> 
> >Matt,
> >
> >Thank you, that makes things a lot clearer, is there any way to utilise
> >forwarded messages or is it a lost cause?
> >
> >Thanks
> >Andrew
> >
> >On Fri, 2006-11-24 at 10:22 -0500, Matt Kettler wrote:
> >> Andrew Sykes wrote:
> >> > Hi,
> >> >
> >> > I'm writing some code to integrate SpamAssassin with Apache JAMES.
> >> >
> >> > I want to setup an address to allow me to pipe spam into sa-learn. I
> >> > have a prototype of this working fine, but would like to 
> allow various
> >> > webmail client users to be able to forward spam messages to this
> >> > address.
> >> >
> >> > As I have very limited understanding of how SA works, I don't want to
> >> > end up blocking the forwarding addresses.
> >> >
> >> > If I whitelist the forwarding addresses, can I then simply pipe a
> >> > forwarded spam from that address into sa-learn or is there 
> more to it?
> >> >   
> >> 
> >> There's MUCH more to it.. In fact, whitelisting won't really 
> affect what
> >> sa-learn does at all.
> >> 
> >> Generally speaking, forwarded messages are mostly useless to sa-learn.
> >> Exactly how useless depends a bit on the mail client..
> >> 
> >> SA tokenizes MANY mail headers, including Received:, not just From: and
> >> To. All the headers in a forwarded message are completely new, thus the
> >> sa-learn process will be learning the headers generated by forwarding,
> >> and not spam.
> >> 
> >> SA also tokenizes the body of the message. However, most mail clients
> >> substantially modify the body of the message when you forward. 
> >> Generally speaking they only preserve one of the mime sections in a
> >> multipart/alternative message. Spammers FREQUENTLY have text/plain
> >> sections which are dissimilar from the text/html. By forwarding you're
> >> loosing all but one mime section (generally text/html is kept).
> >> 
> >> On top of this, most mail clients also insert "Forwarded message:" type
> >> text into the body, and add Fwd: to the subject.
> >> 
> >> SA also tokenizes the in-body mime headers describing how the message
> >> was encoded. However, when you forward, the mail client doing the
> >> forward re-encodes things its own way. What might have been base64
> >> encoded may now be quoted-printable, 8 bit, or 7 bit.
> >> 
> >> So, fundamentally, as far as bayes is concerned the forwarded 
> message is
> >> a completely different message than the original spam.
> >> 
> >> You can try this sometime by taking an original spam, and a forwarded
> >> version of it and feed them both to spamassassin or sa-learn with "-D
> >> bayes" added. This will cause the debug output to list all the tokens
> >> used. Take a look at the tokens. .some are the same, but many 
> are different.
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> --
> If you're not confused, you're not trying hard enough.
> --
> Please note - Due to the intense volume of spam, we have installed 
> site-wide spam filters at catherders.com.  If email from you bounces,
> try non-HTML, non-encoded, non-attachments,


Re: Newbie Question

Posted by Michael W Cocke <co...@catherders.com>.
For what it's worth, on the system here I have a special directory on
the server set up, and when the users get a spam message they do a
'save as ascii text file' to that directory. sa-learn runs thru that
directory every half hour.  Just a thought.

Mike-


On Fri, 24 Nov 2006 15:39:35 +0000, you wrote:

>Matt,
>
>Thank you, that makes things a lot clearer, is there any way to utilise
>forwarded messages or is it a lost cause?
>
>Thanks
>Andrew
>
>On Fri, 2006-11-24 at 10:22 -0500, Matt Kettler wrote:
>> Andrew Sykes wrote:
>> > Hi,
>> >
>> > I'm writing some code to integrate SpamAssassin with Apache JAMES.
>> >
>> > I want to setup an address to allow me to pipe spam into sa-learn. I
>> > have a prototype of this working fine, but would like to allow various
>> > webmail client users to be able to forward spam messages to this
>> > address.
>> >
>> > As I have very limited understanding of how SA works, I don't want to
>> > end up blocking the forwarding addresses.
>> >
>> > If I whitelist the forwarding addresses, can I then simply pipe a
>> > forwarded spam from that address into sa-learn or is there more to it?
>> >   
>> 
>> There's MUCH more to it.. In fact, whitelisting won't really affect what
>> sa-learn does at all.
>> 
>> Generally speaking, forwarded messages are mostly useless to sa-learn.
>> Exactly how useless depends a bit on the mail client..
>> 
>> SA tokenizes MANY mail headers, including Received:, not just From: and
>> To. All the headers in a forwarded message are completely new, thus the
>> sa-learn process will be learning the headers generated by forwarding,
>> and not spam.
>> 
>> SA also tokenizes the body of the message. However, most mail clients
>> substantially modify the body of the message when you forward. 
>> Generally speaking they only preserve one of the mime sections in a
>> multipart/alternative message. Spammers FREQUENTLY have text/plain
>> sections which are dissimilar from the text/html. By forwarding you're
>> loosing all but one mime section (generally text/html is kept).
>> 
>> On top of this, most mail clients also insert "Forwarded message:" type
>> text into the body, and add Fwd: to the subject.
>> 
>> SA also tokenizes the in-body mime headers describing how the message
>> was encoded. However, when you forward, the mail client doing the
>> forward re-encodes things its own way. What might have been base64
>> encoded may now be quoted-printable, 8 bit, or 7 bit.
>> 
>> So, fundamentally, as far as bayes is concerned the forwarded message is
>> a completely different message than the original spam.
>> 
>> You can try this sometime by taking an original spam, and a forwarded
>> version of it and feed them both to spamassassin or sa-learn with "-D
>> bayes" added. This will cause the debug output to list all the tokens
>> used. Take a look at the tokens. .some are the same, but many are different.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
--
If you're not confused, you're not trying hard enough.
--
Please note - Due to the intense volume of spam, we have installed 
site-wide spam filters at catherders.com.  If email from you bounces,
try non-HTML, non-encoded, non-attachments,

Re: Newbie Question

Posted by Matt Kettler <mk...@verizon.net>.
Andrew Sykes wrote:
> Matt,
>
> Thank you, that makes things a lot clearer, is there any way to utilise
> forwarded messages or is it a lost cause?
>   
In general, no... In some situations you can make use of how a
particular mail client does its forwarding, but you'd need to really
look at what the specific mail client does.

Another option is to have them forward the message as an attachment, and
have a script strip off the attachment and feed that to sa-learn..
however, not all clients do forward as attachment.

Re: Newbie Question

Posted by Andrew Sykes <an...@sykesdevelopment.com>.
Matt,

Thank you, that makes things a lot clearer, is there any way to utilise
forwarded messages or is it a lost cause?

Thanks
Andrew

On Fri, 2006-11-24 at 10:22 -0500, Matt Kettler wrote:
> Andrew Sykes wrote:
> > Hi,
> >
> > I'm writing some code to integrate SpamAssassin with Apache JAMES.
> >
> > I want to setup an address to allow me to pipe spam into sa-learn. I
> > have a prototype of this working fine, but would like to allow various
> > webmail client users to be able to forward spam messages to this
> > address.
> >
> > As I have very limited understanding of how SA works, I don't want to
> > end up blocking the forwarding addresses.
> >
> > If I whitelist the forwarding addresses, can I then simply pipe a
> > forwarded spam from that address into sa-learn or is there more to it?
> >   
> 
> There's MUCH more to it.. In fact, whitelisting won't really affect what
> sa-learn does at all.
> 
> Generally speaking, forwarded messages are mostly useless to sa-learn.
> Exactly how useless depends a bit on the mail client..
> 
> SA tokenizes MANY mail headers, including Received:, not just From: and
> To. All the headers in a forwarded message are completely new, thus the
> sa-learn process will be learning the headers generated by forwarding,
> and not spam.
> 
> SA also tokenizes the body of the message. However, most mail clients
> substantially modify the body of the message when you forward. 
> Generally speaking they only preserve one of the mime sections in a
> multipart/alternative message. Spammers FREQUENTLY have text/plain
> sections which are dissimilar from the text/html. By forwarding you're
> loosing all but one mime section (generally text/html is kept).
> 
> On top of this, most mail clients also insert "Forwarded message:" type
> text into the body, and add Fwd: to the subject.
> 
> SA also tokenizes the in-body mime headers describing how the message
> was encoded. However, when you forward, the mail client doing the
> forward re-encodes things its own way. What might have been base64
> encoded may now be quoted-printable, 8 bit, or 7 bit.
> 
> So, fundamentally, as far as bayes is concerned the forwarded message is
> a completely different message than the original spam.
> 
> You can try this sometime by taking an original spam, and a forwarded
> version of it and feed them both to spamassassin or sa-learn with "-D
> bayes" added. This will cause the debug output to list all the tokens
> used. Take a look at the tokens. .some are the same, but many are different.
> 
> 
> 
> 
> 
> 
> 
-- 
Kind Regards
Andrew Sykes <an...@sykesdevelopment.com>
Sykes Development Ltd
http://www.sykesdevelopment.com