You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Marcin Krol <mr...@gmail.com> on 2008/12/11 12:52:54 UTC

(newbie question) Increasing SA effectiveness

Hello everyone,

I'm (somewhat) new to SA, and it works nicely, except now I would like 
to boost its effectiveness at finding spam. I have searched the web and 
frankly I'm disappointed with the results - except basic config there is 
not much info there on how to finetune SA to get better results at 
filtering. Secret science or what? :-)

Through experimentation I have found that the following techniques are 
highly effective:

- Botnet plugin is very effective at finding spammer-like DNS records

- SURBL and URIBL are extremely effective at identifying spam

- DCC is able to find at least some spam

Is anybody here willing to share other / better techniques and tips?

Thanks in advance,
Marcin Krol

Re: (newbie question) Increasing SA effectiveness

Posted by mouss <mo...@netoyen.net>.

Ned Slider a écrit :
> Genuine spam traps are great for bayes training as they should contain a
> representative sample of spam your users will be seeing plus you know
> they only contain spam so you don't need to check the contents before
> feeding them to bayes to learn :)
> 

you must be careful with traps. They can get non spam mail:

- bounces (backscatter). you may consider this spam, but I'm not sure
this won't simply poison your bayes

- spammers can use the trap address in subscription forms. (I mean, if
they can send mail to these addresses, then they can use them otherwise.
if they can't send mail to, then the address is useless!). so you should
at least exclude "confirmation requests".

I do "whitelist" some pseudo-traps from time to time, but I manually
review the messages (quickly of course).

> I do the same - whitelist a few *good* spamtraps through all my
> different levels of filtering specifically to feed bayes. I also use
> these for statistical analysis to see which types of mail SA scores
> poorly on and then target custom rules towards those spam to help bump
> the scores.
> 
> I'm sure there's other useful stuff you can do with spamtrap mails too.
> 
> 
> 
>

Re: Spamtraps

Posted by mouss <mo...@netoyen.net>.

Marcin Krol a écrit :
> Henrik K wrote:
> sure there's other useful stuff you can do with spamtrap mails too.
>>
>> Unfortunately it takes a lot of effort to create *good* spamtraps. 
> 
> Yep.
> 
>> It's just
>> too much trouble for a normal admin, I leave it to those who have time on
>> their hands. You can do the simple grep for "mistyped" non-existant
>> addresses from logs etc, but it's just silly botnet crud that doesn't
>> represent the "real" spam coming to real users (that leak their
>> addresses in
>> all sort of ways). 
> 
> This is exactly what I have a problem with: while lots of spam is
> directed at my regular users, I get very little spam caught in my
> spamtraps.
> 
> I have published spamtrap addresses (in "hidden" HTML of course, like
> "mailto:address" in the same color as background of the page) on many
> company webpages, posted spamtraps to Usenet some 6 months ago and I
> still get very little spam caught in spamtraps.
> 
> I have a haunting suspicion that email correspondents of my users have
> trojans or smth in their Outlooks, which then leak the addresses to
> spammers. Either that, or spammers get addresses some other way. Getting
> my spamtrap addresses into spammers address lists has been a problem for
> me.
> 
> Any other ideas on how to do that?
> 

I get a lot of junk to addresses with many digits (phone style or
message-id style).

> I don't see any point Bayes-learning simple-to-block
>> botnet mails either, since it's completely separate thing from the
>> sneakier
>> 419 and phish stuff..
> 
> What's "419" stuff?
> 

419 = Advanced Fee Fraud = Nigerian scam.

419 is the number of a section of a (old?) related criminal code of
Nigeria.

Re: Spamtraps

Posted by Ned Slider <ne...@unixmail.co.uk>.

Marcin Krol wrote:
> Henrik K wrote:
> sure there's other useful stuff you can do with spamtrap mails too.
>>
>> Unfortunately it takes a lot of effort to create *good* spamtraps. 
> 
> Yep.
> 
>> It's just
>> too much trouble for a normal admin, I leave it to those who have time on
>> their hands. You can do the simple grep for "mistyped" non-existant
>> addresses from logs etc, but it's just silly botnet crud that doesn't
>> represent the "real" spam coming to real users (that leak their 
>> addresses in
>> all sort of ways). 
> 
> This is exactly what I have a problem with: while lots of spam is 
> directed at my regular users, I get very little spam caught in my 
> spamtraps.
> 
> I have published spamtrap addresses (in "hidden" HTML of course, like 
> "mailto:address" in the same color as background of the page) on many 
> company webpages, posted spamtraps to Usenet some 6 months ago and I 
> still get very little spam caught in spamtraps.
> 

IMHO total volume isn't necessarily a good indicator. A few copies of 
each spam are all that's required to feed Bayes - you don't need 
thousands of copies of the *same* spam. The objective is that you get a 
copy of new spam and feed it to Bayes or a blocklist/custom 
rules/whatever *before* your users start seeing it.

Try responding to spam or clicking unsubscribe links from your spamtrap 
addresses. Exactly the type of thing you'd tell your users *never* to 
do. Spammers love confirmed live email addresses, especially those who 
read the spam and follow the instructions (like click here to 
unsubscribe). It makes those addresses perfect candidates for more spam.

Try signing up for some newsletters from dubious sites and then 
unsubscribing - if you can't opt out after opting in then it's spam and 
they'll likely sell your address on.

Using common easy to guess addresses (bob@example.com) rather than 
difficult to guess addresses (b.smith4244532@example.com) will generate 
more spam but also has the potential for more FPs - same with using an 
old address that's no longer used - you need to make sure it's no longer 
receiving any legitimate mail.

Re: Spamtraps (was: Increasing SA effectiveness)

Posted by Kai Schaetzl <ma...@conactive.com>.

Marcin Krol wrote on Fri, 12 Dec 2008 10:43:57 +0100:

> posted spamtraps to Usenet some 6 months ago and I 
> still get very little spam caught in spamtraps.

you will have to exclude them from RBL blocking, of course ;-)

I have one Usenet spamtrap that is getting a lot of spam, although it 
hasn't been used for years. However, at that time it was used and spread 
heavily as it was my address for posting in several groups at that time 
for at least a year.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com

Spamtraps (was: Increasing SA effectiveness)

Posted by Marcin Krol <mr...@gmail.com>.

Henrik K wrote:
sure there's other useful stuff you can do with spamtrap mails too.
> 
> Unfortunately it takes a lot of effort to create *good* spamtraps. 

Yep.

>It's just
> too much trouble for a normal admin, I leave it to those who have time on
> their hands. You can do the simple grep for "mistyped" non-existant
> addresses from logs etc, but it's just silly botnet crud that doesn't
> represent the "real" spam coming to real users (that leak their addresses in
> all sort of ways). 

This is exactly what I have a problem with: while lots of spam is 
directed at my regular users, I get very little spam caught in my 
spamtraps.

I have published spamtrap addresses (in "hidden" HTML of course, like 
"mailto:address" in the same color as background of the page) on many 
company webpages, posted spamtraps to Usenet some 6 months ago and I 
still get very little spam caught in spamtraps.

I have a haunting suspicion that email correspondents of my users have 
trojans or smth in their Outlooks, which then leak the addresses to 
spammers. Either that, or spammers get addresses some other way. Getting 
my spamtrap addresses into spammers address lists has been a problem for 
me.

Any other ideas on how to do that?

I don't see any point Bayes-learning simple-to-block
> botnet mails either, since it's completely separate thing from the sneakier
> 419 and phish stuff..

What's "419" stuff?

Regards,
Marcin Krol

Re: (newbie question) Increasing SA effectiveness

Posted by Henrik K <he...@hege.li>.

On Thu, Dec 11, 2008 at 05:57:10PM +0000, Ned Slider wrote:
>
> Genuine spam traps are great for bayes training as they should contain a  
> representative sample of spam your users will be seeing plus you know  
> they only contain spam so you don't need to check the contents before  
> feeding them to bayes to learn :)
>
> I do the same - whitelist a few *good* spamtraps through all my  
> different levels of filtering specifically to feed bayes. I also use  
> these for statistical analysis to see which types of mail SA scores  
> poorly on and then target custom rules towards those spam to help bump  
> the scores.
>
> I'm sure there's other useful stuff you can do with spamtrap mails too.

Unfortunately it takes a lot of effort to create *good* spamtraps. It's just
too much trouble for a normal admin, I leave it to those who have time on
their hands. You can do the simple grep for "mistyped" non-existant
addresses from logs etc, but it's just silly botnet crud that doesn't
represent the "real" spam coming to real users (that leak their addresses in
all sort of ways). I don't see any point Bayes-learning simple-to-block
botnet mails either, since it's completely separate thing from the sneakier
419 and phish stuff..

Re: (newbie question) Increasing SA effectiveness

Posted by Ned Slider <ne...@unixmail.co.uk>.

Marcin Krol wrote:
> Matus UHLAR - fantomas wrote:
>>> - blocking at MTA by RBL or other techniques (such as graylisting)
>>>   is efficient and effective, but deprives SpamAssassin of spam samples,
>>>   so if your resources permit, it is better to let SpamAssassin deal
>>>   with all RBLs.
>>
>> I don't think so. We get "enough" of spam even if using many RBLs at SMTP
>> level.
> 
> Plus note that characteristics of spam that got through RBL "sieve" 
> *might* be different than characteristics of the spam that didn't.
> 
> If so - I have not done any tests, so I have no idea really - then Bayes 
> would be at least partially mistrained.
> 
> Having said that, I do have exceptions to my sender-verify and RBL rules 
> for spam traps. :-) Now, getting something useful done with that stuff 
> is another story.
> 

Genuine spam traps are great for bayes training as they should contain a 
representative sample of spam your users will be seeing plus you know 
they only contain spam so you don't need to check the contents before 
feeding them to bayes to learn :)

I do the same - whitelist a few *good* spamtraps through all my 
different levels of filtering specifically to feed bayes. I also use 
these for statistical analysis to see which types of mail SA scores 
poorly on and then target custom rules towards those spam to help bump 
the scores.

I'm sure there's other useful stuff you can do with spamtrap mails too.

Re: (newbie question) Increasing SA effectiveness

Posted by Marcin Krol <mr...@gmail.com>.

Matus UHLAR - fantomas wrote:
>> - blocking at MTA by RBL or other techniques (such as graylisting)
>>   is efficient and effective, but deprives SpamAssassin of spam samples,
>>   so if your resources permit, it is better to let SpamAssassin deal
>>   with all RBLs.
> 
> I don't think so. We get "enough" of spam even if using many RBLs at SMTP
> level.

Plus note that characteristics of spam that got through RBL "sieve" 
*might* be different than characteristics of the spam that didn't.

If so - I have not done any tests, so I have no idea really - then Bayes 
would be at least partially mistrained.

Having said that, I do have exceptions to my sender-verify and RBL rules 
for spam traps. :-) Now, getting something useful done with that stuff 
is another story.

Regards,
Marcin Krol

Re: (newbie question) Increasing SA effectiveness

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

On 11.12.08 15:47, Mark Martinec wrote:
> Quality of bayes auto-learning improves if you let all your mail
> pass through SpamAssassin:
> 
> - outbound mail is often a high-quality source of ham
>   for autolearning;

But when one of your users starts spamming (trojan or wtf), you have problem
and can drop the BAYES DB imediately...

> - blocking at MTA by RBL or other techniques (such as graylisting)
>   is efficient and effective, but deprives SpamAssassin of spam samples,
>   so if your resources permit, it is better to let SpamAssassin deal
>   with all RBLs.

I don't think so. We get "enough" of spam even if using many RBLs at SMTP
level.
-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
M$ Win's are shit, do not use it !

Re: (newbie question) Increasing SA effectiveness

Posted by Mark Martinec <Ma...@ijs.si>.

Marcin,

> >Did you manually (initially) train it
> > with your collected ham and recent (not older than 3 months) spam?
>
> No, I just waited until default 200 hams and 200 spams kicked it in. As
> I mentioned elsewhere, I get a weird effect of correct positives, but
> relatively many false negatives from Bayes rules.

Quality of bayes auto-learning improves if you let all your mail
pass through SpamAssassin:

- outbound mail is often a high-quality source of ham
  for autolearning;

- blocking at MTA by RBL or other techniques (such as graylisting)
  is efficient and effective, but deprives SpamAssassin of spam samples,
  so if your resources permit, it is better to let SpamAssassin deal
  with all RBLs.


Mark

Re: (newbie question) Increasing SA effectiveness

Posted by Kai Schaetzl <ma...@conactive.com>.

Marcin Krol wrote on Fri, 12 Dec 2008 10:37:31 +0100:

> Define manual: manual picking out spams is plain too labor-intensive. If 
> we redefine "manual" to mean ham coming from authenticated mail, and 
> spam coming from spamtraps, I wholeheartedly agree.

The point is that you need to have a corpus of *guaranteed* ham and spam, 
that you feed manually right before you start scanning or a ready-made 
Bayes db containing this. Enabling Bayes and feeding it with what you get 
after that time until you reach the threshold is not a good way of 
starting.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com

Re: (newbie question) Increasing SA effectiveness

Posted by John Hardin <jh...@impsec.org>.

On Fri, 12 Dec 2008, Marcin Krol wrote:

> John Hardin wrote:
>>  On Thu, 11 Dec 2008, Karsten Br�ckelmann wrote:
>> 
>> >  I still recommend initial training, to give Bayes a good kick-start.
>>
>>  Initial _manual_ training.
>
> Define manual: manual picking out spams is plain too labor-intensive.

Manual training of the initial corpus is 200 hams and 200 spams. That's 
not excessive.

Past that point the decision to continue to use manual training or add or 
completely switch to autotraining is the admin's preference, based on 
volume.

I manually train the few domains I host and manage for myself and family 
members and friends, and get good results. When I was administering a 
100-user network manual training was not a burden. I can't speak for 
someone administering a 1000- or 10,000- or 100,000-user network.

I do have some distrust of autolearn given the complaints I've seen here 
that can be laid at its feet (but note, the successful users are 
understandably not complaining, so that impression is no doubt unfairly 
biased). I just like the idea of human judgement in the loop. A middle 
ground could be user spam- and ham-training folders, with manual review 
before feeding the messages to sa-learn.

But autolearn should _not_ be trusted for initial training. That will 
simply magnify small errors in the initial configuration, rather than 
helping to correct them. That's our point.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Government cannot grant rights. Government can only limit, infringe
   or suppress rights.
-----------------------------------------------------------------------
  3 days until Bill of Rights day

Re: (newbie question) Increasing SA effectiveness

Posted by Marcin Krol <mr...@gmail.com>.

John Hardin wrote:
> On Thu, 11 Dec 2008, Karsten Br�ckelmann wrote:
> 
>> I still recommend initial training, to give Bayes a good kick-start.
> 
> Initial _manual_ training.

Define manual: manual picking out spams is plain too labor-intensive. If 
we redefine "manual" to mean ham coming from authenticated mail, and 
spam coming from spamtraps, I wholeheartedly agree.

Regards,
Marcin Krol

Re: (newbie question) Increasing SA effectiveness

Posted by Karsten Bräckelmann <gu...@rudersport.de>.

On Thu, 2008-12-11 at 08:28 -0800, John Hardin wrote:
> On Thu, 11 Dec 2008, Karsten Bräckelmann wrote:

> >>> I still recommend initial training, to give Bayes a good kick-start.
> >>
> >> Initial _manual_ training.
> >
> > Err... Yes! :)
> 
> The reason I stressed that is it sounds like the OP turned on autolearn 
> and let that do the initial bayes training, and I think we all agree 
> that's a bad idea.

Yeah, exactly my point, I just didn't express it the way I meant to.
Thanks for pointing out the most important part, John.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: (newbie question) Increasing SA effectiveness

Posted by John Hardin <jh...@impsec.org>.

On Thu, 11 Dec 2008, Karsten Br�ckelmann wrote:

> On Thu, 2008-12-11 at 08:18 -0800, John Hardin wrote:
>> On Thu, 11 Dec 2008, Karsten Bräckelmann wrote:
>>
>>> I still recommend initial training, to give Bayes a good kick-start.
>>
>> Initial _manual_ training.
>
> Err... Yes! :)

The reason I stressed that is it sounds like the OP turned on autolearn 
and let that do the initial bayes training, and I think we all agree 
that's a bad idea.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   You do not examine legislation in the light of the benefits it
   will convey if properly administered, but in the light of the
   wrongs it would do and the harms it would cause if improperly
   administered.                                  -- Lyndon B. Johnson
-----------------------------------------------------------------------
  4 days until Bill of Rights day

Re: (newbie question) Increasing SA effectiveness

Posted by Karsten Bräckelmann <gu...@rudersport.de>.

On Thu, 2008-12-11 at 08:18 -0800, John Hardin wrote:
> On Thu, 11 Dec 2008, Karsten Bräckelmann wrote:
> 
> > I still recommend initial training, to give Bayes a good kick-start.
> 
> Initial _manual_ training.

Err... Yes! :)

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: (newbie question) Increasing SA effectiveness

Posted by John Hardin <jh...@impsec.org>.

On Thu, 11 Dec 2008, Karsten Br�ckelmann wrote:

> I still recommend initial training, to give Bayes a good kick-start.

Initial _manual_ training.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   You do not examine legislation in the light of the benefits it
   will convey if properly administered, but in the light of the
   wrongs it would do and the harms it would cause if improperly
   administered.                                  -- Lyndon B. Johnson
-----------------------------------------------------------------------
  4 days until Bill of Rights day

Re: (newbie question) Increasing SA effectiveness

Posted by Karsten Bräckelmann <gu...@rudersport.de>.

On Thu, 2008-12-11 at 16:28 +0100, Marcin Krol wrote:
> Karsten Bräckelmann wrote:
> > Do train false negatives. It does help Bayes, if you train "FN according
> > to Bayes", that is spam that has been caught, but got a low, ham-ish
> > Bayes score.
> 
> It seems that I need to brush up on specifics of SA Bayes; so far I have 
> used only DSPAM from among statistical filters.

Nah, I guess you just need to adjust your point of view. :)

We've specifically discussed Bayes here. So strip all the rules and
network tests, which still made the message correctly score as spam,
despite Bayes claiming different. The latter is important here.
Considering Bayes only -- if Bayes returned a score less than 0.5 it
looks like ham to it...

With a statical filter *only*, you now would train and re-classify that
mail, no? Do the same with Bayes in SA (regardless of other tests
overruling Bayes) -- at least, for those where SA did not auto-learn
anyway. How is that different from dspam?

  guenther

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: (newbie question) Increasing SA effectiveness

Posted by Marcin Krol <mr...@gmail.com>.

Karsten Bräckelmann wrote:
> Do train false negatives. It does help Bayes, if you train "FN according
> to Bayes", that is spam that has been caught, but got a low, ham-ish
> Bayes score.

It seems that I need to brush up on specifics of SA Bayes; so far I have 
used only DSPAM from among statistical filters.

Regards,
Marcin Krol

Re: (newbie question) Increasing SA effectiveness

Posted by Karsten Bräckelmann <gu...@rudersport.de>.

On Thu, 2008-12-11 at 16:01 +0100, Karsten Bräckelmann wrote:
> On Thu, 2008-12-11 at 15:13 +0100, Marcin Krol wrote:

Forgot to add...

> > No, I just waited until default 200 hams and 200 spams kicked it in. As 
> > I mentioned elsewhere, I get a weird effect of correct positives, but 
> > relatively many false negatives from Bayes rules.

Do train false negatives. It does help Bayes, if you train "FN according
to Bayes", that is spam that has been caught, but got a low, ham-ish
Bayes score.

> I still recommend initial training, to give Bayes a good kick-start.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: (newbie question) Increasing SA effectiveness

Posted by Karsten Bräckelmann <gu...@rudersport.de>.

On Thu, 2008-12-11 at 15:13 +0100, Marcin Krol wrote:
> Karsten Bräckelmann wrote:

> > Razor is quite good, too. Also Pyzor, though it requires much more
> > resources. 
> 
> See, my friend who works at a hosting company didn't find Razor to be 
> much improvement. Perhaps he misconfigured it or smth?

That's pretty much just another way of saying, what you snipped from my
post. ;)  "Results and effectiveness vary, everyone's spam is
different." Yes, that means it might work much better for you. Did you
try it?

> >I also recommend the iXhash plugin, which is another digest
> > test that kicks some serious butt.
> 
> Now you're talking. :-)


> >Did you manually (initially) train it
> > with your collected ham and recent (not older than 3 months) spam?
> 
> No, I just waited until default 200 hams and 200 spams kicked it in. As 
> I mentioned elsewhere, I get a weird effect of correct positives, but 
> relatively many false negatives from Bayes rules.

I still recommend initial training, to give Bayes a good kick-start.


-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: (newbie question) Increasing SA effectiveness

Posted by Marcin Krol <mr...@gmail.com>.

Karsten Bräckelmann wrote:
>> - SURBL and URIBL are extremely effective at identifying spam
> 
> They are enabled by default -- unless you are running local tests only.
> Did you (or your distro default) disable network tests? If you
> specifically had to enable these, you are likely missing more of them.

No, I have them enabled - I just found them so effective that I 
increased their scores.

>> - DCC is able to find at least some spam
> 
> Razor is quite good, too. Also Pyzor, though it requires much more
> resources. 

See, my friend who works at a hosting company didn't find Razor to be 
much improvement. Perhaps he misconfigured it or smth?

>I also recommend the iXhash plugin, which is another digest
> test that kicks some serious butt.

Now you're talking. :-)

>> Is anybody here willing to share other / better techniques and tips?
> 
> Watch the list. Every now and then additional rules, tips and even DNS
> BLs are posted and discussed here.

> Btw, do you have Bayes enabled? 

Yes.

>Did you manually (initially) train it
> with your collected ham and recent (not older than 3 months) spam?

No, I just waited until default 200 hams and 200 spams kicked it in. As 
I mentioned elsewhere, I get a weird effect of correct positives, but 
relatively many false negatives from Bayes rules.

Regards,
Marcin Krol

Re: (newbie question) Increasing SA effectiveness

Posted by Karsten Bräckelmann <gu...@rudersport.de>.

On Thu, 2008-12-11 at 12:52 +0100, Marcin Krol wrote:
> Through experimentation I have found that the following techniques are 
> highly effective:

> - SURBL and URIBL are extremely effective at identifying spam

They are enabled by default -- unless you are running local tests only.
Did you (or your distro default) disable network tests? If you
specifically had to enable these, you are likely missing more of them.

Yes, network tests are highly effective with SA.

> - DCC is able to find at least some spam

Razor is quite good, too. Also Pyzor, though it requires much more
resources. I also recommend the iXhash plugin, which is another digest
test that kicks some serious butt.

Results and effectiveness vary, everyone's spam is different.

> Is anybody here willing to share other / better techniques and tips?

Watch the list. Every now and then additional rules, tips and even DNS
BLs are posted and discussed here.

Btw, do you have Bayes enabled? Did you manually (initially) train it
with your collected ham and recent (not older than 3 months) spam?

  guenther

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: (newbie question) Increasing SA effectiveness

Posted by Marcin Krol <mr...@gmail.com>.

Matthias Leisi wrote:

> * If circumstances permit, make use of extensive whitelisting, so that 
> you can increase the score of rules (or maybe lower the threshold after 
> which you consider a message to be spam).

With all due respect, that's risky... My users often get legit mails out 
of blue or e-mail new parties and I could react to that only after the 
fact.

> * Experiment with additional blacklists (but beware of false positives).
> 
> * Consider using some blacklist(s) to actually reject messages before 
> they reach SpamAssassin (often, the Spamhaus lists are fine for that 
> purpose).

I already do that (among other reasons, it is far cheaper than SA 
scanning), just today:

Rejections by:
  RBLs                      2856
  SA permanent rejection    871
  Sender Verify failed      4627

As you can see, sender verify (a feature in Exim) is very effective at 
cutting out lots of spam, so there's little left for SA to work on. 
Granted, it's controversial, but extremely effective.

Regards,
Marcin Krol

Re: (newbie question) Increasing SA effectiveness

Posted by Kai Schaetzl <ma...@conactive.com>.

Matthias Leisi wrote on Thu, 11 Dec 2008 22:05:34 +0100:

> (and
> are thus likely to be quoted in reply emails)

correctly working email programs leave the signature out from quoting

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com

Re: (newbie question) Increasing SA effectiveness

Posted by Matthias Leisi <ma...@leisi.net>.

Mark Martinec schrieb:

> or construct custom rules to whitelist (=add negative score points)
> based on some other specific chraracteristic of mail to be passed.

Your own (your companys) street address, phone number, or some hopefully
unique token which you typically add in footers of outgoing emails (and
are thus likely to be quoted in reply emails) are good candidates for
such rules.

-- Matthias

Re: (newbie question) Increasing SA effectiveness

Posted by Mark Martinec <Ma...@ijs.si>.

> * If circumstances permit, make use of extensive whitelisting, so that
> you can increase the score of rules (or maybe lower the threshold after
> which you consider a message to be spam).

When whitelisting, never whitelist just based on a plain sender or author
address (such as 'whitelist_from').

Whitelisting should only be based on reliable (or at least: likely to be true)
information, so use:
  whitelist_from_dkim
  whitelist_from_spf
  whitelist_auth
  whitelist_from_rcvd

or construct custom rules to whitelist (=add negative score points)
based on some other specific chraracteristic of mail to be passed.

See man pages for:
  Mail::SpamAssassin::Conf
  Mail::SpamAssassin::Plugin::DKIM
  Mail::SpamAssassin::Plugin::SPF


Mark

Re: (newbie question) Increasing SA effectiveness

Posted by Ned Slider <ne...@unixmail.co.uk>.

Karsten Bräckelmann wrote:
> On Thu, 2008-12-11 at 15:19 +0100, Marcin Krol wrote:
>> Ned Slider wrote:
>>
>>> Yes, additional DNSBLs such as psbl and uceprotect can be integrated 
>>> into SA
>> Well, isn't it better to use them before SA, provided your MTA does have
>> this feature (I recommend Exim to everyone)?
> 
> No -- unless you ultimately trust the RBL to produce a *negligible*
> amount of FPs. Every single RBL does have FPs to a highly variable
> degree. Instead ob outright blocking on a hit, it is a good idea to
> assign a score for the hit only, and see what the result is after all
> tests have been performed...
> 

I agree. There are very few (well, only one actually) DNSBLs that I 
trust to outright block mail at the smtp level whereas plenty of DNSBLs 
are good enough to be useful in SA with sensible scoring where an 
occasional FP doesn't matter too much. That said, that one DNSBL 
(zen.spamhaus.org) and greylisting do block 90% of spam before it ever 
reaches SA.

> 
>>> Also look at setting up Bayes and train it well. A well trained Bayes 
>>> setup can hit 99% plus spam (for me) and can be highly effective.
>> Except I found that while it often gets positive identification right,
>> it sometimes produces false negatives (BAYES_00 negative scoring gets
>> fired on what it should classify as spam -- I reduced BAYES_00 scoring
>> for that reason).
> 
> As mentioned a few times already -- do train Bayes instead. That's a
> mis-fire of Bayes, and needs to be corrected.
> 

Agreed - Bayes does need to be well trained. I find Bayes to be highly 
accurate - over 99% of my spam scores at bayes_80 or above (the vast 
majority at bayes_99) whilst non-spam scores at bayes_00 and 
occasionally bayes_05. Occasionally new spam not seen on my server 
before scores bayes_50 (neutral) but that's what you'd expect. I see 
very little mail that scores between the two extremes.

Bottom line - if bayes isn't working well for you then you've not 
trained it right.

Re: (newbie question) Increasing SA effectiveness

Posted by Marcin Krol <mr...@gmail.com>.

Karsten Bräckelmann wrote:
>> Well, isn't it better to use them before SA, provided your MTA does have
>> this feature (I recommend Exim to everyone)?
> 
> No -- unless you ultimately trust the RBL to produce a *negligible*
> amount of FPs. Every single RBL does have FPs to a highly variable
> degree. Instead ob outright blocking on a hit, it is a good idea to
> assign a score for the hit only, and see what the result is after all
> tests have been performed...

Actually, I think there are good reasons to reject mail based on RBLs:

First, it has a strong policing effect on the internet: nobody except 
hardcore spammers dares to send spam.

In hosting, where I worked for some time (another admin was taking care 
of SA-related issues), the few false positives we had were generally 
quickly taken care of. With literally thousands of customers, we didn't 
find RBL false positives to be any major issue.

Another "policing" issue that is positive side effect of common 
rejecting the mail by RBLs: the major shared hosting providers do not 
dare to do business with spammers. We all know the reality of it, if it 
made a few nickles profit for providers, they would not hesitate to host 
spammers. Were it not for, granted, drastic phenomenon of mail rejection 
due to RBLs, spam would be even more of a problem.

Suppose everyone used your approach: most of the mail would be accepted, 
which is good enough for spammers (few MTAs do SA-scanning at SMTP time, 
a la sa-exim). Maybe it would be filtered, maybe it wouldn't, but 
policing effect would mostly disappear without outright rejection of 
mail coming from RBL-damned addresses.

Second: SA-scanning is a MAJOR cost. At hosting we found that *majority* 
of overall server load was generated by SA, even after most spam was 
eliminated by RBLs and sender-verify before it even reached SA!

Face it, SA is effective, but that comes at a cost: all those tests burn 
huge, and I mean huge, amounts of CPU and time. Even scanning time at 
hosting server is a somewhat important issue, as it greatly increases 
the number of concurrent connections to your server and the number of 
forked MTA software instances (memory, etc). Anything that cuts that 
cost down, even an occasional FP, is worth it, especially as it's 
resolvable nowadays -- I have taken quite a number addresses off RBLs 
(mostly Spamhaus and Spamcop). Sure, it was never pleasant. But IMO, 
it's well worth it.

> Exactly the SA approach. A single (or even a few) rules and RBLs can
> misfire, without affecting the overall deliverability of a particular
> mail.

With all due respect, I disagree, in the sense: there are very few cases 
where it would produce overall benefit, while many other benefits 
(above) would disappear and many problems would be much more common had 
your recommended approach been common.

>>> Also look at setting up Bayes and train it well. A well trained Bayes 
>>> setup can hit 99% plus spam (for me) and can be highly effective.
>> Except I found that while it often gets positive identification right,
>> it sometimes produces false negatives (BAYES_00 negative scoring gets
>> fired on what it should classify as spam -- I reduced BAYES_00 scoring
>> for that reason).
> 
> As mentioned a few times already -- do train Bayes instead. That's a
> mis-fire of Bayes, and needs to be corrected.

The problem is paradoxically the lack of spam - my spamtraps do not get 
enough spam.

Regards,
Marcin Krol

Re: (newbie question) Increasing SA effectiveness

Posted by Karsten Bräckelmann <gu...@rudersport.de>.

On Thu, 2008-12-11 at 18:36 +0100, Matus UHLAR - fantomas wrote:
> > > Ned Slider wrote:
> > > > Yes, additional DNSBLs such as psbl and uceprotect can be integrated 
> > > > into SA
> 
> > On Thu, 2008-12-11 at 15:19 +0100, Marcin Krol wrote:
> > > Well, isn't it better to use them before SA, provided your MTA does have
> > > this feature (I recommend Exim to everyone)?
> 
> On 11.12.08 17:55, Karsten Bräckelmann wrote:
> > No -- unless you ultimately trust the RBL to produce a *negligible*
> > amount of FPs. Every single RBL does have FPs to a highly variable
> > degree. Instead ob outright blocking on a hit, it is a good idea to
> > assign a score for the hit only, and see what the result is after all
> > tests have been performed...
> 
> However, using blacklists before SA saves much of bandwidth and CPU time.
> Our company's servers refuse daily ~3x more clients than mails that are
> daily processed.

That may very well be.  My point is, that you better *carefully* (to
avoid the word "paranoid") verify, whether you can trust an RBL for
outright blocking at SMTP level. Hence the "unless" part. The RBLs
mentioned aren't, say, ZEN...

This branch of the thread discusses adding more RBLs, which aren't even
part of stock SA for scoring.


> > Exactly the SA approach. A single (or even a few) rules and RBLs can
> > misfire, without affecting the overall deliverability of a particular
> > mail.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: (newbie question) Increasing SA effectiveness

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

> > Ned Slider wrote:
> > > Yes, additional DNSBLs such as psbl and uceprotect can be integrated 
> > > into SA

> On Thu, 2008-12-11 at 15:19 +0100, Marcin Krol wrote:
> > Well, isn't it better to use them before SA, provided your MTA does have
> > this feature (I recommend Exim to everyone)?

On 11.12.08 17:55, Karsten Bräckelmann wrote:
> No -- unless you ultimately trust the RBL to produce a *negligible*
> amount of FPs. Every single RBL does have FPs to a highly variable
> degree. Instead ob outright blocking on a hit, it is a good idea to
> assign a score for the hit only, and see what the result is after all
> tests have been performed...

However, using blacklists before SA saves much of bandwidth and CPU time.
Our company's servers refuse daily ~3x more clients than mails that are
daily processed.

Configure combination of scoring and rejecting mail without the need of
recceiving it as whole would be nice. 

Good that at least postfix supports pre-data filtering...

> Exactly the SA approach. A single (or even a few) rules and RBLs can
> misfire, without affecting the overall deliverability of a particular
> mail.


-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
The 3 biggets disasters: Hiroshima 45, Tschernobyl 86, Windows 95

Re: (newbie question) Increasing SA effectiveness

Posted by Karsten Bräckelmann <gu...@rudersport.de>.

On Thu, 2008-12-11 at 15:19 +0100, Marcin Krol wrote:
> Ned Slider wrote:
> 
> > Yes, additional DNSBLs such as psbl and uceprotect can be integrated 
> > into SA
> 
> Well, isn't it better to use them before SA, provided your MTA does have
> this feature (I recommend Exim to everyone)?

No -- unless you ultimately trust the RBL to produce a *negligible*
amount of FPs. Every single RBL does have FPs to a highly variable
degree. Instead ob outright blocking on a hit, it is a good idea to
assign a score for the hit only, and see what the result is after all
tests have been performed...

Exactly the SA approach. A single (or even a few) rules and RBLs can
misfire, without affecting the overall deliverability of a particular
mail.

> > Also look at setting up Bayes and train it well. A well trained Bayes 
> > setup can hit 99% plus spam (for me) and can be highly effective.
> 
> Except I found that while it often gets positive identification right,
> it sometimes produces false negatives (BAYES_00 negative scoring gets
> fired on what it should classify as spam -- I reduced BAYES_00 scoring
> for that reason).

As mentioned a few times already -- do train Bayes instead. That's a
mis-fire of Bayes, and needs to be corrected.

  guenther

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: (newbie question) Increasing SA effectiveness

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

> Ned Slider wrote:
> >Also look at setting up Bayes and train it well. A well trained Bayes 
> >setup can hit 99% plus spam (for me) and can be highly effective.

On 11.12.08 15:19, Marcin Krol wrote:
> Except I found that while it often gets positive identification right,
> it sometimes produces false negatives (BAYES_00 negative scoring gets
> fired on what it should classify as spam -- I reduced BAYES_00 scoring
> for that reason).

That's apparently problem of bad trained BAYES, not the problem of BAYES
itself. Train more spams.

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
We are but packets in the Internet of life (userfriendly.org)

Re: (newbie question) Increasing SA effectiveness

Posted by Marcin Krol <mr...@gmail.com>.

Ned Slider wrote:

> Yes, additional DNSBLs such as psbl and uceprotect can be integrated 
> into SA

Well, isn't it better to use them before SA, provided your MTA does have
this feature (I recommend Exim to everyone)?

> Also look at setting up Bayes and train it well. A well trained Bayes 
> setup can hit 99% plus spam (for me) and can be highly effective.

Except I found that while it often gets positive identification right,
it sometimes produces false negatives (BAYES_00 negative scoring gets
fired on what it should classify as spam -- I reduced BAYES_00 scoring
for that reason).

> My most effective rule classes are Bayes, DNSBLs and URIBLs plus my own 
> custom rules for stuff SA routinely misses.

> Add on 3rd party rules like JM_SOUGHT and SARE can be useful too so 
> maybe look at those as well.

That probably is stuff to look at, thanks!

Regards,
Marcin

Re: (newbie question) Increasing SA effectiveness

Posted by Ned Slider <ne...@unixmail.co.uk>.

Matthias Leisi wrote:
> Marcin Krol schrieb:
> 
>> Is anybody here willing to share other / better techniques and tips?
> 
> No silver bullet, only blood, sweat and tears :-)
> 

I agree.

> * Create custom rules that to match your uncaught spam (and maybe share 
> these rules back on this list).
> 

Yes, custom rules are a great way of supplementing SA's scoring. But 
score your custom rules low to start with and ALWAYS run 'spamassassin 
--lint' to check your custom rules BEFORE restarting SA as if you're 
anything like me you will make typos!

> * If circumstances permit, make use of extensive whitelisting, so that 
> you can increase the score of rules (or maybe lower the threshold after 
> which you consider a message to be spam).
> 
> * Experiment with additional blacklists (but beware of false positives).
> 

Yes, additional DNSBLs such as psbl and uceprotect can be integrated into SA

Also look at setting up Bayes and train it well. A well trained Bayes 
setup can hit 99% plus spam (for me) and can be highly effective.

My most effective rule classes are Bayes, DNSBLs and URIBLs plus my own 
custom rules for stuff SA routinely misses.

Add on 3rd party rules like JM_SOUGHT and SARE can be useful too so 
maybe look at those as well.

Re: (newbie question) Increasing SA effectiveness

Posted by Matthias Leisi <ma...@leisi.net>.

Marcin Krol schrieb:

> Is anybody here willing to share other / better techniques and tips?

No silver bullet, only blood, sweat and tears :-)

* Create custom rules that to match your uncaught spam (and maybe share 
these rules back on this list).

* If circumstances permit, make use of extensive whitelisting, so that 
you can increase the score of rules (or maybe lower the threshold after 
which you consider a message to be spam).

* Experiment with additional blacklists (but beware of false positives).

* Consider using some blacklist(s) to actually reject messages before 
they reach SpamAssassin (often, the Spamhaus lists are fine for that 
purpose).

-- Matthias