You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Geert Mak <po...@verysmall.org> on 2011/07/13 09:44:54 UTC

improving the score for specific types of spam

hello,

recently we had two specific types of spam getting very low score (about 3) and going through -

a) one was about Armbanduhren (wrist watches in German) or Edelarmbanduhren (luxury wrist watches) - all in that direction.

b) the other was about all kind of Job offers

as i could not figure out how to increase their score in spamassassin and as they had repeatable keywords which are usually not in our correspondence, i currently process them on postfix level with a small header_checks and body_checks regex.

does somebody know a way to improve spamassassin so that the score for these specific types of spam (which are quite massive the last months in Austria)?

thanks,
geert

Re: improving the score for specific types of spam

Posted by Martin Gregorie <ma...@gregorie.org>.

On Wed, 2011-07-13 at 15:29 +0200, J4K wrote:
> On 07/13/2011 02:43 PM, Martin Gregorie wrote:
> > On Wed, 2011-07-13 at 14:06 +0200, J4K wrote:
> >
> > I assume you tested it as well as running it through lint ("spamassassin
> > <spam_sample.txt"), so is it firing on samples of that type of spam?
> >
> > Comments: As written the rule won't work because __PR2 assumes that the
> > domain name starts at the beginning of the URI but you said that the
> > URIs typically contain a user name and '@'. Also, I'd probably
> > generalise __PR2 to something like:
> >
> > uri  __PR2 /(joblists.com|gb-totaljob.com)/i
> >
> > on the assumption that when you wrote 'europ-joblist.com' you meant
> > 'europ-joblists.com'. This change will probably run faster and possibly
> > catch more spam too, especially if there is a Canadian or Scandinavian
> > office.
> >
> >
> > Martin
> >
> >
> Thank-you Martin.  I modified the rule as suggested. I
> 
> I ran it through spamassassin > test.txt, but the rule was not triggered
> even though the Subject was: Vacancy - apply online, and the content
> contained the email address: Trenton@totaljoblists.net
> I looked with -D, and there is no mention of PRIVATE_RULE1.
> 
Indeed. the __PR2 regex won't match "joblists.net" though it would if
you changed it to:

	/(joblists\.(com|net)|gb-totaljob\.com)/i

I strongly suggest you learn a bit more about Perl regular expressions,
either from one of the many tutorials on the 'net or by getting a copy
of the O'Reilly "Camel" book 'Programming PERL'.

Secondly, you need to read up on SA rules and how to write them. This is
all on the SA website. Subrules whose names start with a double
underscore are never shown when they fire, so its a good idea to run
initial tests with the subrules (and references to them) with the
underscores removed so you can see which subrules fire. When they are
working correctly, put the underscores back, re-lint and *retest* the
rule before putting it live.

Thirdly, I find it useful to have an SA installation on a second machine
that I can use for rule development without affecting my main system.
When I'm happy with the rule, I copy the affected SA configuration files
over to the live system and restart SA. 

Martin

Re: improving the score for specific types of spam

Posted by J4K <ju...@klunky.co.uk>.

On 07/13/2011 02:43 PM, Martin Gregorie wrote:
> On Wed, 2011-07-13 at 14:06 +0200, J4K wrote:
>
> I assume you tested it as well as running it through lint ("spamassassin
> <spam_sample.txt"), so is it firing on samples of that type of spam?
>
> Comments: As written the rule won't work because __PR2 assumes that the
> domain name starts at the beginning of the URI but you said that the
> URIs typically contain a user name and '@'. Also, I'd probably
> generalise __PR2 to something like:
>
> uri  __PR2 /(joblists.com|gb-totaljob.com)/i
>
> on the assumption that when you wrote 'europ-joblist.com' you meant
> 'europ-joblists.com'. This change will probably run faster and possibly
> catch more spam too, especially if there is a Canadian or Scandinavian
> office.
>
>
> Martin
>
>
Thank-you Martin.  I modified the rule as suggested. I

I ran it through spamassassin > test.txt, but the rule was not triggered
even though the Subject was: Vacancy - apply online, and the content
contained the email address: Trenton@totaljoblists.net
I looked with -D, and there is no mention of PRIVATE_RULE1.


Odd.

Re: improving the score for specific types of spam

Posted by Martin Gregorie <ma...@gregorie.org>.

On Wed, 2011-07-13 at 14:06 +0200, J4K wrote:

>     I put this in to deter the wealth of job advertisments we get:
> 
> describe PRIVATE_RULE1 English language job opportunity
> body     __PR1        /(Employment opportunity|Job offer match, respond
> to apply|Employment you've been searching|Job opportunity|Career
> opportunity inside|Position opening in your area|Work offer
> inside|Vacancy - apply online|Job ad - see details! Sent through  Search
> engine|Get a New Job Today|Working Part Time)/i
> uri      __PR2       
> /^(au-joblists.com|europ-joblist.com|gb-totaljob.com|uk-joblists.com|us-joblists.com)/i
> meta     PRIVATE_RULE1 (__PR1 && __PR2)
> score    PRIVATE_RULE1 5.5
>
> The URLs are typically email addresses e.g fred@europ-joblist.com. Would
> this rule work.  spamassassin --lint did not complain.
> 
I assume you tested it as well as running it through lint ("spamassassin
<spam_sample.txt"), so is it firing on samples of that type of spam?

Comments: As written the rule won't work because __PR2 assumes that the
domain name starts at the beginning of the URI but you said that the
URIs typically contain a user name and '@'. Also, I'd probably
generalise __PR2 to something like:

uri  __PR2 /(joblists.com|gb-totaljob.com)/i

on the assumption that when you wrote 'europ-joblist.com' you meant
'europ-joblists.com'. This change will probably run faster and possibly
catch more spam too, especially if there is a Canadian or Scandinavian
office.

Martin

Re: improving the score for specific types of spam

Posted by J4K <ju...@klunky.co.uk>.

On 07/13/2011 01:23 PM, Martin Gregorie wrote:
> On Wed, 2011-07-13 at 09:44 +0200, Geert Mak wrote:
>> recently we had two specific types of spam getting very low score
>> (about 3) and going through -
>>
>> a) one was about Armbanduhren (wrist watches in German) or
>> Edelarmbanduhren (luxury wrist watches) - all in that direction.
>>
>> b) the other was about all kind of Job offers
>>
>> as i could not figure out how to increase their score in spamassassin
>> and as they had repeatable keywords which are usually not in our
>> correspondence, i currently process them on postfix level with a small
>> header_checks and body_checks regex.
>>
>> does somebody know a way to improve spamassassin so that the score for
>> these specific types of spam (which are quite massive the last months
>> in Austria)?
>>
> I'd write a private rule for each type of spam, along the lines of:
>
> describe PRIVATE_RULE German language wrist watch spam
> body     __PR1        /(Armbanduhren|Edelarmbanduhren)/i
> uri      __PR2        /www\..*\.de/
> meta     PRIVATE_RULE (__PR1 && __PR2)
> score    PRIVATE_RULE 5.5
>
> The basic principle is that the first 'body' subrule(s) match words that
> mark this sort of spam and the second 'uri' subrule detects URLs for the
> shop being advertised. It might be very specific, or even less specific
> that my example, e.g. /^www/
>
> There's a hidden assumption with this type of rule that the
> *combination* of the words and matching URIs is always spam but things
> that match the subrules can legitimately appear in ham provided they
> don't both appear. 
>
> Each rule of this type needs to be carefully tested and tuned to suit
> your particular mail stream.
>
>
> Martin
Hi,

    I put this in to deter the wealth of job advertisments we get:

describe PRIVATE_RULE1 English language job opportunity
body     __PR1        /(Employment opportunity|Job offer match, respond
to apply|Employment you've been searching|Job opportunity|Career
opportunity inside|Position opening in your area|Work offer
inside|Vacancy - apply online|Job ad - see details! Sent through  Search
engine|Get a New Job Today|Working Part Time)/i
uri      __PR2       
/^(au-joblists.com|europ-joblist.com|gb-totaljob.com|uk-joblists.com|us-joblists.com)/i
meta     PRIVATE_RULE1 (__PR1 && __PR2)
score    PRIVATE_RULE1 5.5

The URLs are typically email addresses e.g fred@europ-joblist.com. Would
this rule work.  spamassassin --lint did not complain.




S

Re: improving the score for specific types of spam

Posted by Martin Gregorie <ma...@gregorie.org>.

On Wed, 2011-07-13 at 09:44 +0200, Geert Mak wrote:
> recently we had two specific types of spam getting very low score
> (about 3) and going through -
> 
> a) one was about Armbanduhren (wrist watches in German) or
> Edelarmbanduhren (luxury wrist watches) - all in that direction.
> 
> b) the other was about all kind of Job offers
> 
> as i could not figure out how to increase their score in spamassassin
> and as they had repeatable keywords which are usually not in our
> correspondence, i currently process them on postfix level with a small
> header_checks and body_checks regex.
> 
> does somebody know a way to improve spamassassin so that the score for
> these specific types of spam (which are quite massive the last months
> in Austria)?
> 
I'd write a private rule for each type of spam, along the lines of:

describe PRIVATE_RULE German language wrist watch spam
body     __PR1        /(Armbanduhren|Edelarmbanduhren)/i
uri      __PR2        /www\..*\.de/
meta     PRIVATE_RULE (__PR1 && __PR2)
score    PRIVATE_RULE 5.5

The basic principle is that the first 'body' subrule(s) match words that
mark this sort of spam and the second 'uri' subrule detects URLs for the
shop being advertised. It might be very specific, or even less specific
that my example, e.g. /^www/

There's a hidden assumption with this type of rule that the
*combination* of the words and matching URIs is always spam but things
that match the subrules can legitimately appear in ham provided they
don't both appear. 

Each rule of this type needs to be carefully tested and tuned to suit
your particular mail stream.

Martin