You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Matt Kettler <mk...@evi-inc.com> on 2004/02/11 17:35:01 UTC

Re: SA Problem: spam with random words to defeat Baysian filtering ...

At 11:05 AM 2/11/2004, Robert S. Sciuk wrote:
>I've just joined the list, and requested FAQ and info from the majordomo.
>In the absence of either one, I am forced to ask the following of the list
>with no knowledge of whether it is an FAQ or not -- sorry.

The FAQ is actually a wiki web, and it's linked from the spamassassin.org 
main page.

http://wiki.spamassassin.org/w/

>As indicated in the subject line, I'm getting negative hit rates on spam
>which uses random dictionary words.  Obviously sa-learn cannot learn how
>to deal with such an approach, and my formerly brilliant
>sendmail/spamassassin configuration is now next to useless - as I'm
>getting 200 - 300 spam's per day.
>
>Can anyone point me to a solution or a counter-counter measure to kill
>this damn spam??

This is quite surprising to me.. I've been getting a lot of the "random 
word" spams too, but feeding them to sa-learn has been quite effective.

If you've got a lot of input to bayes, the random-word attacks wind up 
being more-or-less a wash.

So far this month, I've had 7 false negatives, 0 false positives. Most of 
the "dictionary bayes poison" spams are gettting BAYES_99 for me.

For reference, and those wondering about the full details of how I get that 
my config consist of:

         DCC, razor2 and RBLs used.
         habeas_swe score forced down to -1.0
         bayes_ignore_header statements for all the habeas SWE headers
         bayes_auto_learn_threshold_nonspam -0.3

         A few add-on rules:
                 antidrug.cf (gee, there's a shock, since I wrote it ;)
                         <http://mywebpages.comcast.net/mkettler/sa/antidrug.cf>http://mywebpages.comcast.net/mkettler/sa/antidrug.cf 

                 A collapsed version of popcorn that's just 2 rules.
                         Based on http://www.emtinc.net/includes/popcorn.cf 
, but edited by me to only be 2 rules

                 A few rules from 
http://www.merchantsoverseas.com/wwwroot/gorilla/body.txt
                         L_b_MaskedW0rds*
                 A few rules from 
http://www.exit0.us/index.php/FredsRules-SUBJECT
                         FVGT_s_OBFU_*

                 One of the blackholes.us blacklists added, with score set 
fairly low to avoid FPs.
                         header 
RCVD_IN_CHINA_KR         eval:check_rbl('country', 'cn-kr.blackholes.us.')
                         describe RCVD_IN_CHINA_KR               Received 
from China or Korea
                         score RCVD_IN_CHINA_KR          1.0

                 about 15 negative scoring rules which have "industry 
specific" phrases for my companies business in it.

I feed bayes with some spamtraps and nonspamtraps each day, giving it about 
100 spams, and 25 nonspams in manual training daily.



Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Bob George <ma...@ttlexceeded.com>.
Raquel Rice <ra...@thericehouse.net> wrote:
> [...]
> That isn't what I asked.  I get over a thousand emails per day,
> personally.  Those are from all the lists I'm on, all the
> personal mail, and all the business mail.  I assume that
> Matt's email is similar.  What I'm asking is, how to select
> 125 per day out of 1000?

I just go through and delete "borderline" cases from my inbox (mbox) (that is,
messages that are OK, but "spammy"), and manually sa-learn that as ham
occasionally. I do the same against folder for mailing lists that have low/no
spam hits. So I simply PRUNE my inbox before training for any large amount of
ham. (more below)

> (I've been going through all my messages each day, manually
> moving "ham" to a ham directory and moving "spam" to a spam
> directory ... a long and tedious job ... then using that to
> train bayes)

The key for me is keeping spam OUT of my inbox altogether for quick downloads,
reading and required daily maintenance. Perhaps set a lower spam threshold
initially, then automatically sort messages above threshold into "obvious" and
"maybe" spam folders? This would help keep your inbox spam-free (mostly), while
not dumping useful but not-as-important stuff.

I manually sort the false-positives out of the "maybe spam" folder and just
drag  to "not spam" and "confirmed spam" folders. I have a cron script
automatically run sa-learn on several times a day. Since anything in the "not"
or "confirmed" folder has been verified, I'm comfortable with this. This way, I
don't have to worry about training daily. I just do it as time allows, yet
still enjoy a spam-free inbox.

Daily use is virtually spam-free, and I just sort when convenient. Once bayes
came up to speed, I started dumping anything over the bayes auto_learn
threshold, since I had zero false positives at that level. So even the "maybe
spam" folder isn't overwhelming. If it starts to get cumbersome, I might even
crank this threshold back a couple of points, as I've yet to have a false
positive score much more than 6.

I don't get 1,000 messages personally each day, but over 500 come through
regularly. I find this quite manageable.

- Bob


Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by "Jack L. Stone" <ja...@sage-american.com>.
>
>> From: "Raquel Rice" <ra...@thericehouse.net>
>> > On Wed, 11 Feb 2004 18:54:13 -0800
>> > "jdow" <jd...@earthlink.net> wrote:
>> > 
>> > > From: "Raquel Rice" <ra...@thericehouse.net>
>> > > > On Wed, 11 Feb 2004 11:35:01 -0500
>> > > > Matt Kettler <mk...@evi-inc.com> wrote:
>> > > > 
>> > > > > I feed bayes with some spamtraps and nonspamtraps each
>> > > > > day, giving it about 100 spams, and 25 nonspams in manual
>> > > > > training daily.
>> > > > 
>> > > > How do you select, out of all your mail, 125 emails to train
>> > > > bayes with?
>> > > 
>> > > Might it be because SA seems to need 200 spams before the
>> > > Bayes filter kicks in? (It performs remarkably well here with
>> > > a corpus of some 450 spams and 700 or so hams.
>> > > 
>> > 
>> > That isn't what I asked.  I get over a thousand emails per day,
>> > personally.  Those are from all the lists I'm on, all the
>> > personal mail, and all the business mail.  I assume that Matt's
>> > email is similar.  What I'm asking is, how to select 125 per day
>> > out of 1000?
>> > 
>> > (I've been going through all my messages each day, manually
>> > moving"ham" to a ham directory and moving "spam" to a spam
>> > directory ... a long and tedious job ... then using that to
>> > train bayes)
>> 

Raquel: Don't know if this is what you want either, but sounds like it.

Right down at the very bottom of my global procmailrc, I place this recipe
to send a copy of the "HAM" to a special HAM collection folder. The other
copy is delivered to the appropriate user mbox. This figures that if the
messages made it through all of the other recipes above -- it's HAM.

Same with SPAM. Any of the recipes that spots a SPAM, a copy goes to a SPAM
collection folder.

Then, at midnight, a cron job feeds both HAM & SPAM using sa-learn.

Hope this helps......

Best regards,
Jack L. Stone,
Administrator

Sage American
http://www.sage-american.com
jacks@sage-american.com

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by "Jack L. Stone" <ja...@sage-american.com>.
>From my earlier message:
Right down at the very bottom of my global procmailrc, I place this recipe
to send a copy of the "HAM" to a special HAM collection folder. The other
copy is delivered to the appropriate user mbox. This figures that if the
messages made it through all of the other recipes above -- it's HAM.
-------------------------------------------------
Sorry.......
NOW, for the recipe at the bottom:
## Send copy to Ham folder           
## Copy to Ham folder              
:0
{
  :0c:
  $HAM
  :0:
  $DEFAULT                 
}

Best regards,
Jack L. Stone,
Administrator

Sage American
http://www.sage-american.com
jacks@sage-american.com

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Raquel Rice <ra...@thericehouse.net>.
On Wed, 11 Feb 2004 20:01:19 -0800
"jdow" <jd...@earthlink.net> wrote:

> From: "Raquel Rice" <ra...@thericehouse.net>
> > On Wed, 11 Feb 2004 18:54:13 -0800
> > "jdow" <jd...@earthlink.net> wrote:
> > 
> > > From: "Raquel Rice" <ra...@thericehouse.net>
> > > > On Wed, 11 Feb 2004 11:35:01 -0500
> > > > Matt Kettler <mk...@evi-inc.com> wrote:
> > > > 
> > > > > I feed bayes with some spamtraps and nonspamtraps each
> > > > > day, giving it about 100 spams, and 25 nonspams in manual
> > > > > training daily.
> > > > 
> > > > How do you select, out of all your mail, 125 emails to train
> > > > bayes with?
> > > 
> > > Might it be because SA seems to need 200 spams before the
> > > Bayes filter kicks in? (It performs remarkably well here with
> > > a corpus of some 450 spams and 700 or so hams.
> > > 
> > 
> > That isn't what I asked.  I get over a thousand emails per day,
> > personally.  Those are from all the lists I'm on, all the
> > personal mail, and all the business mail.  I assume that Matt's
> > email is similar.  What I'm asking is, how to select 125 per day
> > out of 1000?
> > 
> > (I've been going through all my messages each day, manually
> > moving"ham" to a ham directory and moving "spam" to a spam
> > directory ... a long and tedious job ... then using that to
> > train bayes)
> 
> Would you believe I used lowly, antiquated, silly, old "mail" to
> do the job? It took only about a day to get enough ham and spam.
> Then I abandoned it until I had to train the Bayes filter for my
> partner, who I just talked into going through the SpamAssassin
> filter I had setup on our Linux Internet connection machine. (I
> use fetchmail to drag raw material into the system.) It was
> painful. But it was only for a day. Selecting spam to feed it was
> not hard. Even the tiny fraction of the titles shown by mail makes
> it pretty easy to select out spam. I have a pretty good idea of
> the names I see even on the kernel mailing list. (I guess I am a
> piker. I only have an honest 600-1000 a day depending on the day
> of the week and which list explodes. (Today it was fedora - which
> I am thinking of dropping in favor of Debian or Mandrake.)
> 
> {^_^}    Joanne

Please respond to the list, rather than to me personally.

You still don't answer the question, which I asked Matt.

-- 
Raquel
============================================================
Say no to racism, sexism, no to homophobia and all forms of bigotry
and discrimination and say yes to sisterhood and brotherhood of all
humankind.
  --Coretta Scott King


Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Raquel Rice <ra...@thericehouse.net>.
On Wed, 11 Feb 2004 18:54:13 -0800
"jdow" <jd...@earthlink.net> wrote:

> From: "Raquel Rice" <ra...@thericehouse.net>
> > On Wed, 11 Feb 2004 11:35:01 -0500
> > Matt Kettler <mk...@evi-inc.com> wrote:
> > 
> > > I feed bayes with some spamtraps and nonspamtraps each day,
> > > giving it about 100 spams, and 25 nonspams in manual training
> > > daily.
> > 
> > How do you select, out of all your mail, 125 emails to train
> > bayes with?
> 
> Might it be because SA seems to need 200 spams before the Bayes
> filter kicks in? (It performs remarkably well here with a corpus
> of some 450 spams and 700 or so hams.
> 

That isn't what I asked.  I get over a thousand emails per day,
personally.  Those are from all the lists I'm on, all the personal
mail, and all the business mail.  I assume that Matt's email is
similar.  What I'm asking is, how to select 125 per day out of 1000?

(I've been going through all my messages each day, manually moving
"ham" to a ham directory and moving "spam" to a spam directory ... a
long and tedious job ... then using that to train bayes)

-- 
Raquel
============================================================
Say no to racism, sexism, no to homophobia and all forms of bigotry
and discrimination and say yes to sisterhood and brotherhood of all
humankind.
  --Coretta Scott King


Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by jdow <jd...@earthlink.net>.
From: "Raquel Rice" <ra...@thericehouse.net>
> On Wed, 11 Feb 2004 11:35:01 -0500
> Matt Kettler <mk...@evi-inc.com> wrote:
> 
> > I feed bayes with some spamtraps and nonspamtraps each day, giving
> > it about 100 spams, and 25 nonspams in manual training daily.
> 
> How do you select, out of all your mail, 125 emails to train bayes
> with?

Might it be because SA seems to need 200 spams before the Bayes
filter kicks in? (It performs remarkably well here with a corpus
of some 450 spams and 700 or so hams.

{^_-}

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Raquel Rice <ra...@thericehouse.net>.
On Wed, 11 Feb 2004 11:35:01 -0500
Matt Kettler <mk...@evi-inc.com> wrote:

> I feed bayes with some spamtraps and nonspamtraps each day, giving
> it about 100 spams, and 25 nonspams in manual training daily.

How do you select, out of all your mail, 125 emails to train bayes
with?

-- 
Raquel
============================================================
Say no to racism, sexism, no to homophobia and all forms of bigotry
and discrimination and say yes to sisterhood and brotherhood of all
humankind.
  --Coretta Scott King