You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by "Robert S. Sciuk" <ro...@ControlQ.com> on 2004/02/11 17:05:38 UTC

SA Problem: spam with random words to defeat Baysian filtering ...

I've just joined the list, and requested FAQ and info from the majordomo.
In the absence of either one, I am forced to ask the following of the list
with no knowledge of whether it is an FAQ or not -- sorry.

As indicated in the subject line, I'm getting negative hit rates on spam
which uses random dictionary words.  Obviously sa-learn cannot learn how
to deal with such an approach, and my formerly brilliant
sendmail/spamassassin configuration is now next to useless - as I'm
getting 200 - 300 spam's per day.

Can anyone point me to a solution or a counter-counter measure to kill
this damn spam??

-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=
Robert S. Sciuk		http://www.controlq.com		259 Simcoe St. S.
Control-Q Research	tel: 905.576.8028		Oshawa, Ont.
rob@controlq.com	fax: 905.576.8386  		Canada, L1H 4H3

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Bob George <ma...@ttlexceeded.com>.

Raquel Rice <ra...@thericehouse.net> wrote:
> [...]
> That isn't what I asked.  I get over a thousand emails per day,
> personally.  Those are from all the lists I'm on, all the
> personal mail, and all the business mail.  I assume that
> Matt's email is similar.  What I'm asking is, how to select
> 125 per day out of 1000?

I just go through and delete "borderline" cases from my inbox (mbox) (that is,
messages that are OK, but "spammy"), and manually sa-learn that as ham
occasionally. I do the same against folder for mailing lists that have low/no
spam hits. So I simply PRUNE my inbox before training for any large amount of
ham. (more below)

> (I've been going through all my messages each day, manually
> moving "ham" to a ham directory and moving "spam" to a spam
> directory ... a long and tedious job ... then using that to
> train bayes)

The key for me is keeping spam OUT of my inbox altogether for quick downloads,
reading and required daily maintenance. Perhaps set a lower spam threshold
initially, then automatically sort messages above threshold into "obvious" and
"maybe" spam folders? This would help keep your inbox spam-free (mostly), while
not dumping useful but not-as-important stuff.

I manually sort the false-positives out of the "maybe spam" folder and just
drag  to "not spam" and "confirmed spam" folders. I have a cron script
automatically run sa-learn on several times a day. Since anything in the "not"
or "confirmed" folder has been verified, I'm comfortable with this. This way, I
don't have to worry about training daily. I just do it as time allows, yet
still enjoy a spam-free inbox.

Daily use is virtually spam-free, and I just sort when convenient. Once bayes
came up to speed, I started dumping anything over the bayes auto_learn
threshold, since I had zero false positives at that level. So even the "maybe
spam" folder isn't overwhelming. If it starts to get cumbersome, I might even
crank this threshold back a couple of points, as I've yet to have a false
positive score much more than 6.

I don't get 1,000 messages personally each day, but over 500 come through
regularly. I find this quite manageable.

- Bob

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by "Jack L. Stone" <ja...@sage-american.com>.

>
>> From: "Raquel Rice" <ra...@thericehouse.net>
>> > On Wed, 11 Feb 2004 18:54:13 -0800
>> > "jdow" <jd...@earthlink.net> wrote:
>> > 
>> > > From: "Raquel Rice" <ra...@thericehouse.net>
>> > > > On Wed, 11 Feb 2004 11:35:01 -0500
>> > > > Matt Kettler <mk...@evi-inc.com> wrote:
>> > > > 
>> > > > > I feed bayes with some spamtraps and nonspamtraps each
>> > > > > day, giving it about 100 spams, and 25 nonspams in manual
>> > > > > training daily.
>> > > > 
>> > > > How do you select, out of all your mail, 125 emails to train
>> > > > bayes with?
>> > > 
>> > > Might it be because SA seems to need 200 spams before the
>> > > Bayes filter kicks in? (It performs remarkably well here with
>> > > a corpus of some 450 spams and 700 or so hams.
>> > > 
>> > 
>> > That isn't what I asked.  I get over a thousand emails per day,
>> > personally.  Those are from all the lists I'm on, all the
>> > personal mail, and all the business mail.  I assume that Matt's
>> > email is similar.  What I'm asking is, how to select 125 per day
>> > out of 1000?
>> > 
>> > (I've been going through all my messages each day, manually
>> > moving"ham" to a ham directory and moving "spam" to a spam
>> > directory ... a long and tedious job ... then using that to
>> > train bayes)
>> 

Raquel: Don't know if this is what you want either, but sounds like it.

Right down at the very bottom of my global procmailrc, I place this recipe
to send a copy of the "HAM" to a special HAM collection folder. The other
copy is delivered to the appropriate user mbox. This figures that if the
messages made it through all of the other recipes above -- it's HAM.

Same with SPAM. Any of the recipes that spots a SPAM, a copy goes to a SPAM
collection folder.

Then, at midnight, a cron job feeds both HAM & SPAM using sa-learn.

Hope this helps......

Best regards,
Jack L. Stone,
Administrator

Sage American
http://www.sage-american.com
jacks@sage-american.com

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by "Jack L. Stone" <ja...@sage-american.com>.

>From my earlier message:
Right down at the very bottom of my global procmailrc, I place this recipe
to send a copy of the "HAM" to a special HAM collection folder. The other
copy is delivered to the appropriate user mbox. This figures that if the
messages made it through all of the other recipes above -- it's HAM.
-------------------------------------------------
Sorry.......
NOW, for the recipe at the bottom:
## Send copy to Ham folder           
## Copy to Ham folder              
:0
{
  :0c:
  $HAM
  :0:
  $DEFAULT                 
}

Best regards,
Jack L. Stone,
Administrator

Sage American
http://www.sage-american.com
jacks@sage-american.com

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Raquel Rice <ra...@thericehouse.net>.

On Wed, 11 Feb 2004 20:01:19 -0800
"jdow" <jd...@earthlink.net> wrote:

> From: "Raquel Rice" <ra...@thericehouse.net>
> > On Wed, 11 Feb 2004 18:54:13 -0800
> > "jdow" <jd...@earthlink.net> wrote:
> > 
> > > From: "Raquel Rice" <ra...@thericehouse.net>
> > > > On Wed, 11 Feb 2004 11:35:01 -0500
> > > > Matt Kettler <mk...@evi-inc.com> wrote:
> > > > 
> > > > > I feed bayes with some spamtraps and nonspamtraps each
> > > > > day, giving it about 100 spams, and 25 nonspams in manual
> > > > > training daily.
> > > > 
> > > > How do you select, out of all your mail, 125 emails to train
> > > > bayes with?
> > > 
> > > Might it be because SA seems to need 200 spams before the
> > > Bayes filter kicks in? (It performs remarkably well here with
> > > a corpus of some 450 spams and 700 or so hams.
> > > 
> > 
> > That isn't what I asked.  I get over a thousand emails per day,
> > personally.  Those are from all the lists I'm on, all the
> > personal mail, and all the business mail.  I assume that Matt's
> > email is similar.  What I'm asking is, how to select 125 per day
> > out of 1000?
> > 
> > (I've been going through all my messages each day, manually
> > moving"ham" to a ham directory and moving "spam" to a spam
> > directory ... a long and tedious job ... then using that to
> > train bayes)
> 
> Would you believe I used lowly, antiquated, silly, old "mail" to
> do the job? It took only about a day to get enough ham and spam.
> Then I abandoned it until I had to train the Bayes filter for my
> partner, who I just talked into going through the SpamAssassin
> filter I had setup on our Linux Internet connection machine. (I
> use fetchmail to drag raw material into the system.) It was
> painful. But it was only for a day. Selecting spam to feed it was
> not hard. Even the tiny fraction of the titles shown by mail makes
> it pretty easy to select out spam. I have a pretty good idea of
> the names I see even on the kernel mailing list. (I guess I am a
> piker. I only have an honest 600-1000 a day depending on the day
> of the week and which list explodes. (Today it was fedora - which
> I am thinking of dropping in favor of Debian or Mandrake.)
> 
> {^_^}    Joanne

Please respond to the list, rather than to me personally.

You still don't answer the question, which I asked Matt.

-- 
Raquel
============================================================
Say no to racism, sexism, no to homophobia and all forms of bigotry
and discrimination and say yes to sisterhood and brotherhood of all
humankind.
  --Coretta Scott King

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Raquel Rice <ra...@thericehouse.net>.

On Wed, 11 Feb 2004 18:54:13 -0800
"jdow" <jd...@earthlink.net> wrote:

> From: "Raquel Rice" <ra...@thericehouse.net>
> > On Wed, 11 Feb 2004 11:35:01 -0500
> > Matt Kettler <mk...@evi-inc.com> wrote:
> > 
> > > I feed bayes with some spamtraps and nonspamtraps each day,
> > > giving it about 100 spams, and 25 nonspams in manual training
> > > daily.
> > 
> > How do you select, out of all your mail, 125 emails to train
> > bayes with?
> 
> Might it be because SA seems to need 200 spams before the Bayes
> filter kicks in? (It performs remarkably well here with a corpus
> of some 450 spams and 700 or so hams.
> 

That isn't what I asked.  I get over a thousand emails per day,
personally.  Those are from all the lists I'm on, all the personal
mail, and all the business mail.  I assume that Matt's email is
similar.  What I'm asking is, how to select 125 per day out of 1000?

(I've been going through all my messages each day, manually moving
"ham" to a ham directory and moving "spam" to a spam directory ... a
long and tedious job ... then using that to train bayes)

-- 
Raquel
============================================================
Say no to racism, sexism, no to homophobia and all forms of bigotry
and discrimination and say yes to sisterhood and brotherhood of all
humankind.
  --Coretta Scott King

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by jdow <jd...@earthlink.net>.

From: "Raquel Rice" <ra...@thericehouse.net>
> On Wed, 11 Feb 2004 11:35:01 -0500
> Matt Kettler <mk...@evi-inc.com> wrote:
> 
> > I feed bayes with some spamtraps and nonspamtraps each day, giving
> > it about 100 spams, and 25 nonspams in manual training daily.
> 
> How do you select, out of all your mail, 125 emails to train bayes
> with?

Might it be because SA seems to need 200 spams before the Bayes
filter kicks in? (It performs remarkably well here with a corpus
of some 450 spams and 700 or so hams.

{^_-}

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Raquel Rice <ra...@thericehouse.net>.

On Wed, 11 Feb 2004 11:35:01 -0500
Matt Kettler <mk...@evi-inc.com> wrote:

> I feed bayes with some spamtraps and nonspamtraps each day, giving
> it about 100 spams, and 25 nonspams in manual training daily.

How do you select, out of all your mail, 125 emails to train bayes
with?

-- 
Raquel
============================================================
Say no to racism, sexism, no to homophobia and all forms of bigotry
and discrimination and say yes to sisterhood and brotherhood of all
humankind.
  --Coretta Scott King

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Matt Kettler <mk...@evi-inc.com>.

At 11:05 AM 2/11/2004, Robert S. Sciuk wrote:
>I've just joined the list, and requested FAQ and info from the majordomo.
>In the absence of either one, I am forced to ask the following of the list
>with no knowledge of whether it is an FAQ or not -- sorry.

The FAQ is actually a wiki web, and it's linked from the spamassassin.org 
main page.

http://wiki.spamassassin.org/w/

>As indicated in the subject line, I'm getting negative hit rates on spam
>which uses random dictionary words.  Obviously sa-learn cannot learn how
>to deal with such an approach, and my formerly brilliant
>sendmail/spamassassin configuration is now next to useless - as I'm
>getting 200 - 300 spam's per day.
>
>Can anyone point me to a solution or a counter-counter measure to kill
>this damn spam??

This is quite surprising to me.. I've been getting a lot of the "random 
word" spams too, but feeding them to sa-learn has been quite effective.

If you've got a lot of input to bayes, the random-word attacks wind up 
being more-or-less a wash.

So far this month, I've had 7 false negatives, 0 false positives. Most of 
the "dictionary bayes poison" spams are gettting BAYES_99 for me.

For reference, and those wondering about the full details of how I get that 
my config consist of:

         DCC, razor2 and RBLs used.
         habeas_swe score forced down to -1.0
         bayes_ignore_header statements for all the habeas SWE headers
         bayes_auto_learn_threshold_nonspam -0.3

         A few add-on rules:
                 antidrug.cf (gee, there's a shock, since I wrote it ;)
                         <http://mywebpages.comcast.net/mkettler/sa/antidrug.cf>http://mywebpages.comcast.net/mkettler/sa/antidrug.cf 

                 A collapsed version of popcorn that's just 2 rules.
                         Based on http://www.emtinc.net/includes/popcorn.cf 
, but edited by me to only be 2 rules

                 A few rules from 
http://www.merchantsoverseas.com/wwwroot/gorilla/body.txt
                         L_b_MaskedW0rds*
                 A few rules from 
http://www.exit0.us/index.php/FredsRules-SUBJECT
                         FVGT_s_OBFU_*

                 One of the blackholes.us blacklists added, with score set 
fairly low to avoid FPs.
                         header 
RCVD_IN_CHINA_KR         eval:check_rbl('country', 'cn-kr.blackholes.us.')
                         describe RCVD_IN_CHINA_KR               Received 
from China or Korea
                         score RCVD_IN_CHINA_KR          1.0

                 about 15 negative scoring rules which have "industry 
specific" phrases for my companies business in it.

I feed bayes with some spamtraps and nonspamtraps each day, giving it about 
100 spams, and 25 nonspams in manual training daily.

Re[2]: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Matthias Fuhrmann <Ma...@stud.uni-hannover.de>.

On Wed, 11 Feb 2004, Robert Menschel wrote:

> Hello Matthias,
>
> Wednesday, February 11, 2004, 5:44:16 PM, you wrote:
>
> >> body     RM_bpt_longwords99 /\b(?:\w{9,}\s+){9}/
> >> describe RM_bpt_longwords99 Long string of long words
> >> score    RM_bpt_longwords99 1.000  # type=max:1 (add to 98) -  330s/0h of 91714 corpus (74113s/17601h) 01/23/04
>
> MF> can you give some examples for what those rules will hit?
> MF> i've been trying some emails and misc texts with it, and got no hit yet :)
>
> Attached.

thnx :)

regards,
Matthias

Re[2]: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Matthias,

Wednesday, February 11, 2004, 5:44:16 PM, you wrote:

>> body     RM_bpt_longwords99 /\b(?:\w{9,}\s+){9}/
>> describe RM_bpt_longwords99 Long string of long words
>> score    RM_bpt_longwords99 1.000  # type=max:1 (add to 98) -  330s/0h of 91714 corpus (74113s/17601h) 01/23/04

MF> can you give some examples for what those rules will hit?
MF> i've been trying some emails and misc texts with it, and got no hit yet :)

Attached.

Bob Menschel

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Matthias Fuhrmann <Ma...@stud.uni-hannover.de>.

On Wed, 11 Feb 2004, Robert Menschel wrote:

[...]
> 1) Yes, sa-learn DOES deal with these emails, and does so exceedingly
> well here. I call them "bayes fodder", since those random words are
> teaching bayes that emails with those random words are spam.
>
> 2) I then augment bayes with the following rules:

[...]
> score    RM_bpt_longwords98 1.000  # type=max:1 (add to 97) - 442s/0h of 91714 corpus (74113s/17601h) 01/23/04
> body     RM_bpt_longwords99 /\b(?:\w{9,}\s+){9}/
> describe RM_bpt_longwords99 Long string of long words
> score    RM_bpt_longwords99 1.000  # type=max:1 (add to 98) - 330s/0h of 91714 corpus (74113s/17601h) 01/23/04
>
> Bob Menschel

can you give some examples for what those rules will hit?
i've been trying some emails and misc texts with it, and got no hit yet :)

regards,
Matthias

Re[2]: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Bob,

Wednesday, February 11, 2004, 5:53:03 PM, you wrote:

BG> Robert Menschel <Ro...@Menschel.net> wrote:
>> 1) Yes, sa-learn DOES deal with these emails, and does so
>> exceedingly well here. I call them "bayes fodder", since those random
>> words are teaching bayes that emails with those random words are spam.

BG> Just to avoid confusion, you're saying that AFTER TRAINING, bayes works quite
BG> well for those messages, right? The key is feeding any messages that DO slip
BG> through into sa-learn as spam UNTIL you get those results, no?

Correct, with one clarification: The key is feeding ANY/ALL messages to
sa-learn, whether or not they have slipped through. The great majority of
spam is caught regardless; if we sa-learn only those that slip through,
then IMO there isn't enough information for Bayes to make this
determination. If ALL confirmed spam is fed to sa-learn, then Bayes will
have enough information.

BG> The "random words" question seems to come up frequently, and TRAINED bayes
BG> seems to be a good answer.

Agreed.

Bob Menschel

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Raquel Rice <ra...@thericehouse.net>.

On Thu, 12 Feb 2004 09:14:09 -0500
"Bob George" <ma...@ttlexceeded.com> wrote:

> Raquel Rice <ra...@thericehouse.net> wrote:
> > [...]
> > All those lists you're so willing to throw away are working
> > for me. I run all the rule lists chickenpox and Tripwire.
> 
> This is an interesting issue, since I am also on a low-resource
> computing list(lots of DOS holdouts!) and they're as bedeviled by
> spam as the rest of us.
> 
> I've noticed that the add-on rules help recognize new patterns,
> which is very useful for training bayes. But once bayes has the
> patterns, it alone is more than adequate.
> 
> I'm wondering how practical it would be to "train up" a more
> powerful bayes system with the full boat of rules, then just
> transfer the bayes data files to a lower end machine. Or run the
> additional rules, then disable them for performance until new
> patterns emerge.
> 
> Would there be a problem with creating a "bayes repository" and
> share it with others? Of course, it's a shared bayes
> configuration, so there'd need to be some general consensus as to
> what constitutes spam, etc.
> 
> > It takes my poor little 466 only a few seconds to scan for
> > viruses and then for SA to do its work.  I'd be swamped by
> > spam if it weren't for the extra rulesets ... as far as I can
> > tell from all the spam that's caught. My partner downloaded
> > from our server 137 spam messages yesterday, all tagged, and
> > two false negatives ... which I fed to sa-learn.
> 
> That's the model we've discussed for the "low-end gateway" for
> users. Have a"smarter" machine capable of running tools such as SA
> do the work, then just poll for the cleaned up messages using
> whatever software the users want.
> 
> - Bob
> 

I had users who were complaining about spam getting through without
being tagged.  These users had a very low volume of mail.  I
instituted a site-wide bayes database, using bayes_path and
auto_whitelist_path.  I copied my bayes_* and auto-whitelist* files
to the new path and then retrained with about 1000 new messages. 
The number of spams getting through to those users dropped to
nothing!

You may be right about using the extra rules to catch and help train
bayes, but then to remove them.  However, I'm hesitant to do that. 
I think what causes my hesitancy is that I'm seeing terribly low
bayes scores (50 - 56% probability) on spam that is collecting high
scores on the extra rules files.  One this morning scored 0 on
bayes, but gathered 8 from tripwire and 4 from chickenpox.  It was
then marked for auto learning as spam.

I still find the extra rulesets to be invaluable.

-- 
Raquel
============================================================
Agape is disinterested love.  Agape does not begin by discriminating
between worthy and unworthy people, or any qualities people possess.
 It begins by loving others for their sakes.  It springs from the
need of the other person.
  --Martin Luther King, Jr.

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by jdow <jd...@earthlink.net>.

From: "Bob George" <ma...@ttlexceeded.com>

> > It takes my poor little 466 only a few seconds to scan for
> > viruses and then for SA to do its work.  I'd be swamped by
> > spam if it weren't for the extra rulesets ... as far as I can
> > tell from all the spam that's caught. My partner downloaded
> > from our server 137 spam messages yesterday, all tagged, and
> > two false negatives ... which I fed to sa-learn.
>
> That's the model we've discussed for the "low-end gateway" for users. Have
a
> "smarter" machine capable of running tools such as SA do the work, then
just
> poll for the cleaned up messages using whatever software the users want.

Bob, my trick here is a simple procmail rule to clone the messages into
a junk mailbox on the linux mailserver machine:
--8<--
:0c:
/$HOME/mail/rawmbox
--8<--

Then I use "mail" as a tool for performing the quick sort into spam and
ham. It took two days to generate my current spam database. Actual time
spent doing it was about an hour or two. Now that the database is trained
I look for any emails that slip through, find them in the raw mailbox,
and toss them into the spam training file. That takes maybe 10 minutes
every few days if I get worked up when more than a couple percent escape
the scanning process. The Baysian analysis has made me lazy about
fomenting new explicit rules here. It builds the rules for me. That's
what a computer should do for me, isn't it?

(I'm worried about when the spammers figure out how to defeat the
simple Baysian analysis. But by then they might have learned that
a trick to survive is to make the advertising interesting. TV was
a LONG time learning this. The current spammers haven't a clue on
this one, yet. But then, it'd take real creative work on their
part. I read that as well beyond them.)

{^_^}   Joanne

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Bob George <ma...@ttlexceeded.com>.

Keith C. Ivey <kc...@cpcug.org> wrote:
> Bob George <ma...@ttlexceeded.com> wrote:
>
>> I've noticed that the add-on rules help recognize new
>> patterns, which is very useful for training bayes. But once
>> bayes has the patterns, it alone is more than adequate.
>
> I'm not sure what you mean by "patterns", but it should be
> clarified that Bayes doesn't deal with patterns like the ones
> recognized by most rules.  It deals only with the presence of
> tokens, and individual tokens at that, not even combinations.
> Rules can recognize much more general and complex patterns in
> messages than anything Bayes can (at least as Bayes is
> implemented in SA).

Ah, I hope I'm not spreading bad information. I'm hardly an SA expert, just a
very happy end-user. It seems that using the add-on rules in conjunction with
bayes has resulted in NONE of the "clever" spams getting through. I have spent
some time thinking through training bayes (including NOT feeding it this list
as ham!) and it seems to have paid off. Perhaps I'm simply benefitting from
better recognition in the basic SA rules.

Just to verify, most spam I receive -- regardless of technique used -- seems to
be tagged with BAYES lately (90+ mostly). So I thought the weird "patterns"
(more correctly, broken-word tokens) were also going into bayes, with the
result that since those odd spellings of v-drug, backhair, spammer domains and
such ONLY show in spam, bayes associates them with statistically indicating
spam. Have I misunderstood?

So if the word "quatrain" only appears in random-word spam (here at least), or
more importantly, never shows in non-spam, it won't help (nor necessarily
hinder) detecting spam. And "eeVagra" and such will ONLY be in spam.
If spammers are using common word lists, I'd think there would be some
repetition, so it *might* help.

Am I off base?

- Bob

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by "Keith C. Ivey" <kc...@cpcug.org>.

Bob George <ma...@ttlexceeded.com> wrote:

> I've noticed that the add-on rules help recognize new patterns, which is very
> useful for training bayes. But once bayes has the patterns, it alone is more
> than adequate.

I'm not sure what you mean by "patterns", but it should be 
clarified that Bayes doesn't deal with patterns like the ones 
recognized by most rules.  It deals only with the presence of 
tokens, and individual tokens at that, not even combinations.  
Rules can recognize much more general and complex patterns in 
messages than anything Bayes can (at least as Bayes is 
implemented in SA).

-- 
Keith C. Ivey <kc...@cpcug.org>
Washington, DC

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Bob George <ma...@ttlexceeded.com>.

Raquel Rice <ra...@thericehouse.net> wrote:
> [...]
> All those lists you're so willing to throw away are working
> for me. I run all the rule lists chickenpox and Tripwire.

This is an interesting issue, since I am also on a low-resource computing list
(lots of DOS holdouts!) and they're as bedeviled by spam as the rest of us.

I've noticed that the add-on rules help recognize new patterns, which is very
useful for training bayes. But once bayes has the patterns, it alone is more
than adequate.

I'm wondering how practical it would be to "train up" a more powerful bayes
system with the full boat of rules, then just transfer the bayes data files to
a lower end machine. Or run the additional rules, then disable them for
performance until new patterns emerge.

Would there be a problem with creating a "bayes repository" and share it with
others? Of course, it's a shared bayes configuration, so there'd need to be
some general consensus as to what constitutes spam, etc.

> It takes my poor little 466 only a few seconds to scan for
> viruses and then for SA to do its work.  I'd be swamped by
> spam if it weren't for the extra rulesets ... as far as I can
> tell from all the spam that's caught. My partner downloaded
> from our server 137 spam messages yesterday, all tagged, and
> two false negatives ... which I fed to sa-learn.

That's the model we've discussed for the "low-end gateway" for users. Have a
"smarter" machine capable of running tools such as SA do the work, then just
poll for the cleaned up messages using whatever software the users want.

- Bob

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Raquel Rice <ra...@thericehouse.net>.

On Wed, 11 Feb 2004 19:33:10 -0800
"jdow" <jd...@earthlink.net> wrote:

> 
> Bob, as far as I can figure it is not the words themselves that
> trigger the rules as much as the ratios of word lengths. The
> random alphabet word spammers were blocked before I ever installed
> the .cf files I found referenced here like chickenpox,
> 99_FVGT_Tripwire, or any of the others. I was having zero false
> positives and maybe 1-2% false spams. (Now that those extra
> filters are in the path some spams are going right around the
> filter. The poor 133MHz machine I am using to filter two people
> gets plain swamped so some seem to go around the spam filtering
> with no spamd available for connections. So the extra rules
> actually made things worse here. {^_-})
> 
> I am thinking of pulling out the "useless" Tripwire and chickenpox
> scans. Working too hard to achieve perfection wastes more time
> than 1-2% spam does.
> 
> {^_^}

All those lists you're so willing to throw away are working for me. 
I run all the rule lists chickenpox and Tripwire.  It takes my poor
little 466 only a few seconds to scan for viruses and then for SA to
do its work.  I'd be swamped by spam if it weren't for the extra
rulesets ... as far as I can tell from all the spam that's caught. 
My partner downloaded from our server 137 spam messages yesterday,
all tagged, and two false negatives ... which I fed to sa-learn.

-- 
Raquel
============================================================
Say no to racism, sexism, no to homophobia and all forms of bigotry
and discrimination and say yes to sisterhood and brotherhood of all
humankind.
  --Coretta Scott King

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by jdow <jd...@earthlink.net>.

From: "Bob George" <ma...@ttlexceeded.com>

> I think the reason for bayes auto-learning being useful is that the words
in
> spam that DIDN'T trip the score get added as well. If those same words
appear
> commonly in non-spam, they cancel out. But as was pointed out recently, if
> spammers use random dictionary words that DON'T appear in non-spam, that
itself
> is a hint that it might be spammy. It adds to the "smell" of spam, which
is why
> I think bayes has been so effective at catching the random-word spams that
> bypass so many rudimentary filters.
>
> Then again, this may simply be an indicator that I subscribe to low-brow
lists.
> :)

Bob, as far as I can figure it is not the words themselves that trigger
the rules as much as the ratios of word lengths. The random alphabet
word spammers were blocked before I ever installed the .cf files I found
referenced here like chickenpox, 99_FVGT_Tripwire, or any of the others.
I was having zero false positives and maybe 1-2% false spams. (Now that
those extra filters are in the path some spams are going right around
the filter. The poor 133MHz machine I am using to filter two people gets
plain swamped so some seem to go around the spam filtering with no
spamd available for connections. So the extra rules actually made things
worse here. {^_-})

I am thinking of pulling out the "useless" Tripwire and chickenpox
scans. Working too hard to achieve perfection wastes more time than
1-2% spam does.

{^_^}

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Bob George <ma...@ttlexceeded.com>.

jdow <jd...@earthlink.net> wrote:
> After watching the Bayes filter "learn" to auto white list
> spam when first installed I disabled the auto white list
> feature and explicitly generated lists if ham and spam.

AWL works well for me, but that may have been due a combination of add-on rules
and luck. I've left it enabled, but scoring of spam has swung to such extremes
(a good thing) thanks to bayes and other rules that it really hasn't impacted
things much one way or the other lately.

It does seem most of the auto-whitelist options are now missing from the
manpage (Mail::SpamAssassin::Conf) so perhaps they've been deprecated as of
late? (Must search archives.)

> When the Bayes filter kicked in after it had accumulated a couple
> hundred ham and spam messages the results were dramatic.

I learned my lesson and have begun storing a collection of 'borderline' spam
for training purposes. Thankfully, I had bayes trained before some of the more
clever spams began to hit, so non have gotten through lately, depite all their
attempts.

> Before then it was somewhat discouraging. I do believe I shall
> leave automatic learning and white listing turned off because
> it seems to false entirely too often for my tastes.

Now that I've read the latest manpage, I'm not really sure WHAT AWL is doing in
my case. I do see AWL score adjustments, but they tend to be slight... at least
in comparison to the massive scores most spam gets. Unless I'm mistaken, unless
spammers have forged addresses from real people I get good messages from, AWL
should NOT result in false positives.

> (The concept also seems a little strange. If it already knows it's
> spam then train it that the message is spam. I'd rather teach
> it with the new spam that is not found than simply rack up
> higher scores by training it that material it knows is spam is
> indeed spam. What am I missing here?)

I think there's a difference between auto-whitelist (AWL) -- based on sender -- 
and bayes_auto, which trains on content. AWL makes good sense... especially for
messages from my good friend that occasionally forwards spammy stuff of
interest. I've left the defaults for bayes_auto (to autolearn high-scoring
spam), but I do augment it with training from my corpus of about 1,000
low-scoring spams that I verified by hand, and the (infrequent) false negative.

I think the reason for bayes auto-learning being useful is that the words in
spam that DIDN'T trip the score get added as well. If those same words appear
commonly in non-spam, they cancel out. But as was pointed out recently, if
spammers use random dictionary words that DON'T appear in non-spam, that itself
is a hint that it might be spammy. It adds to the "smell" of spam, which is why
I think bayes has been so effective at catching the random-word spams that
bypass so many rudimentary filters.

Then again, this may simply be an indicator that I subscribe to low-brow lists.
:)

- Bob

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by jdow <jd...@earthlink.net>.

From: "Bob George" <ma...@ttlexceeded.com>

> Robert Menschel <Ro...@Menschel.net> wrote:
> > [...]
> > 1) Yes, sa-learn DOES deal with these emails, and does so
> > exceedingly well here. I call them "bayes fodder", since those random
> > words are teaching bayes that emails with those random words are spam.
>
> Just to avoid confusion, you're saying that AFTER TRAINING, bayes works
quite
> well for those messages, right? The key is feeding any messages that DO
slip
> through into sa-learn as spam UNTIL you get those results, no?
>
> The "random words" question seems to come up frequently, and TRAINED bayes
> seems to be a good answer.
>
> > 2) I then augment bayes with the following rules:
> > [...]
>
> "Add-on" rules do seem to help get bayes there quicker!

After watching the Bayes filter "learn" to auto white list spam when
first installed I disabled the auto white list feature and explicitly
generated lists if ham and spam. When the Bayes filter kicked in after
it had accumulated a couple hundred ham and spam messages the results
were dramatic. Before then it was somewhat discouraging. I do believe
I shall leave automatic learning and white listing turned off because
it seems to false entirely too often for my tastes. (The concept also
seems a little strange. If it already knows it's spam then train it
that the message is spam. I'd rather teach it with the new spam that
is not found than simply rack up higher scores by training it that
material it knows is spam is indeed spam. What am I missing here?)

{^_^}    Joanne

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Bob George <ma...@ttlexceeded.com>.

Robert Menschel <Ro...@Menschel.net> wrote:
> [...]
> 1) Yes, sa-learn DOES deal with these emails, and does so
> exceedingly well here. I call them "bayes fodder", since those random
> words are teaching bayes that emails with those random words are spam.

Just to avoid confusion, you're saying that AFTER TRAINING, bayes works quite
well for those messages, right? The key is feeding any messages that DO slip
through into sa-learn as spam UNTIL you get those results, no?

The "random words" question seems to come up frequently, and TRAINED bayes
seems to be a good answer.

> 2) I then augment bayes with the following rules:
> [...]

"Add-on" rules do seem to help get bayes there quicker!

- Bob

Re: SA Problem: spam with random words to defeat Baysian filtering ...

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Robert,

Wednesday, February 11, 2004, 8:05:38 AM, you wrote:

RSS> As indicated in the subject line, I'm getting negative hit rates on spam
RSS> which uses random dictionary words.  Obviously sa-learn cannot learn how
RSS> to deal with such an approach, and my formerly brilliant
RSS> sendmail/spamassassin configuration is now next to useless - as I'm
RSS> getting 200 - 300 spam's per day.

RSS> Can anyone point me to a solution or a counter-counter measure to kill
RSS> this damn spam??

1) Yes, sa-learn DOES deal with these emails, and does so exceedingly
well here. I call them "bayes fodder", since those random words are
teaching bayes that emails with those random words are spam.

2) I then augment bayes with the following rules:

# longwords -- possible sign of random words placed into spam to confuse anti-spam software
body     RM_bpt_longwords68a /\b(?:[a-z]{6,}\s+){8}/
describe RM_bpt_longwords68a Long string of long words
score    RM_bpt_longwords68a 1.500  # type=FP - 7429s/2h of 91714 corpus (74113s/17601h) 01/23/04
                                    # ham: userid list, 
                                    # "improving compatibility between computer platforms demands certain levels "
body     RM_bpt_longwords69a /\b(?:[a-z]{6,}\s+){9}/
describe RM_bpt_longwords69a Long string of long words
score    RM_bpt_longwords69a 1.000  # type=max:1 (add to 59a,68a) - 6595s/1h of 91714 corpus (74113s/17601h) 01/23/04
                                    # ham: userid list
body     RM_bpt_longwords78a /\b(?:[a-z]{7,}\s+){8}/
describe RM_bpt_longwords78a Long string of long words
score    RM_bpt_longwords78a 2.000 # type=max:2 (add to 68a) - 4163s/0h of 91714 corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords59a /\b(?:[a-z]{5,}\s+){9}/
describe RM_bpt_longwords59a Long string of long words
score    RM_bpt_longwords59a 1.500  # type=FP - 8753s/8h of 91714 corpus (74113s/17601h) 01/23/04
                                    # ham: userid list
body     RM_bpt_longwords79a /\b(?:[a-z]{7,}\s+){9}/
describe RM_bpt_longwords79a Long string of long words
score    RM_bpt_longwords79a 1.000  # type=max:1 (add to 78a) - 2950s/0h of 91714 corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords96a /\b(?:[a-z]{9,}\s+){6}/
describe RM_bpt_longwords96a Long string of long words
score    RM_bpt_longwords96a 4.000  # 1162s/0h of 91714 corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords88a /\b(?:[a-z]{8,}\s+){8}/
describe RM_bpt_longwords88a Long string of long words
score    RM_bpt_longwords88a 4.000  # 1025s/0h of 91714 corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords89a /\b(?:[a-z]{8,}\s+){9}/
describe RM_bpt_longwords89a Long string of long words
score    RM_bpt_longwords89a 1.000  # type=max:1 (add to 88a) - 590s/0h of 91714 corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords97 /\b(?:\w{9,}\s+){7}/
describe RM_bpt_longwords97 Long string of long words
score    RM_bpt_longwords97 3.000  # 545s/0h of 91714 corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords98 /\b(?:\w{9,}\s+){8}/
describe RM_bpt_longwords98 Long string of long words
score    RM_bpt_longwords98 1.000  # type=max:1 (add to 97) - 442s/0h of 91714 corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords99 /\b(?:\w{9,}\s+){9}/
describe RM_bpt_longwords99 Long string of long words
score    RM_bpt_longwords99 1.000  # type=max:1 (add to 98) - 330s/0h of 91714 corpus (74113s/17601h) 01/23/04

Bob Menschel