You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Adam Lanier <ad...@krusty.madoff.com> on 2004/12/09 22:04:28 UTC

Soliciting advice from the list members

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

My managment has recently asked me how SpamAssassin is prepared to deal
with a number of recent trends in spam technology.  This was prompted by
a recent seminar they attended regarding spam (provided by an anti-spam
vendor who shall remain nameless).

None of these so-called recent spam trends are new to me or probably to
anyone who deals with spam on a daily basis.  However, while drafting my
reply I had the thought that perhaps my answers would carry more weight
if I could include some quotes from other people in the industry
regarding SA's ability to handle spam utilizing these techniques.  I've
done some cursory browsing through the list archives but thought I might
solicit some fresh input from the list-members.

These are the recent trends raised by my management:

Hash Busting - slightly modify each copy of message to foil
'fingerprinting' techniques

Bayes Poisoning - addition of random dictionary words

Hidden Text - using invisible text in html messages

Keyword Corruption - using obfuscated text to hide keywords

Tiny Messages - messages with only URL or image

I'd appreciate any comments on how SA handles these types of spamming
nastiness.

Thanks,

- --
Adam Lanier
Bernard L. Madoff Investment Securities LLC
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBuL3cyiJGTGcJL0cRAv3kAJ0f3ldqM0H6noaaSJ+H4z7EdyY4ewCcDnIT
7wNUFp4StciPxLaWzUIN9bs=
=Nnn/
-----END PGP SIGNATURE-----

Re: Soliciting advice from the list members

Posted by Michael Barnes <mb...@compsci.wm.edu>.
Adam,

I'm sure everyone else that replies may say basically the same as me,
but here's my input about SA and your management's questions.

> Hash Busting - slightly modify each copy of message to foil
> 'fingerprinting' techniques

AFAIK, the fingerprinting techinques are "fuzzy" and can withstand a
a little bit of abuse.

> Bayes Poisoning - addition of random dictionary words

My only experience with bayes poisoning has been from this list :)  By
that, I mean I have had a mail on this list go talking about spam, and
the db got almost reversed.  I'll talk more about this later.

> Hidden Text - using invisible text in html messages

SA has specificc rules for this.

> Keyword Corruption - using obfuscated text to hide keywords

SA has specific rules for this.

> Tiny Messages - messages with only URL or image

SA has specific rules for this.


What I like about SA is that there is no specific rule or subset of
rules that can trigger a mail to be labled as spam.

Even when my bayes db got poisoned by this list.  I was experimenting
with tagging low spam scores with '*** LIKELY SPAM ***' in the subject
because my anal retentive users would complain very loudly if anything
was marked as a false positive.  The thing that irritated me about these
complaints is that all of the mails that were labled this way scored
just barely as spam, and all of these mails were _solicited_ bulk email,
and looked like spam to me, and used many of the same tools and tricks
that real spammers use.

What I do now, is I set my threshold score high (10), and I have custom
spam and ham rules as well as a 3rd party plugin to raise scores.  My
average spam score is 20 or above.  I don't have real data, but the
number of missed mails is very low.  Like less than 10 since SA 3.0 came
out.  And I have had 0 false positives for my mailbox, and the 1st false
positive for one user today from a mail that was very borderline, and it
would not have been missed if it was not delivered.

My only issue with SA is that it does not appear to scale very well.  I
have not experienced this problem personally, because the domain that I
run SA on does not have very high mail traffic, but this does appear to
be an issue, and there are workarounds for it by skipping some tests at
the expense of doing better filtering.

OK, another issue, but a small one (I'm pretty picky), is that the
scores for some of the rules do not always seem correct.  High scores
for things that seem pretty benign, and low scores for things that look
almost exclusively like spam (such as forged headers, or mismatched IPs).
I know these scores are somehow objectively done with a corpus of ham
and spam and some algorithm to score accordingly, but to me some of them
just seem wrong.  Maybe scores and rules could be autolearned like
bayes.  Not sure.

Thats my input for your managers.

Mike

-- 
/-----------------------------------------\
| Michael Barnes <mb...@compsci.wm.edu> |
| UNIX Systems Administrator              |
| College of William and Mary             |
| Phone: (757) 879-3930                   |
\-----------------------------------------/

Re: Soliciting advice from the list members

Posted by Rob Kudyba <rk...@raeinternet.com>.
Kris Deugau wrote:

>Rob Kudyba wrote:
>  
>
>>Well there is a company that sprouted from Vipul's Razor that uses
>>the concept of collaborative filtering and adaptive learning from
>>over a million trusted users, in a type of-- if you will--"sp@m
>>net"--(ah hem) and I don't believe this 'net is incorporated into SA?
>>    
>>
>
>That would be Cloudmark - which was co-founded by the guy that wrote
>Vipul's Razor in the first place.  <g>  (See the bottom two paragraphs
>on http://www.cloudmark.com/company/.)
>
>Their business model is (IIRC) based on selling licences for an Outlook
>plugin (and other similar largely Windows-only bits) that connect to the
>Razor servers.
>  
>

That is only a part of their model now and their desktop product--now 
called SafetyBar--connects to their own proprietary "spam net"...sorry 
for the plug but you can check out their mail server based products at 
http://raeinternet.com/cloud/ which includes Cloudmark Authority, which 
our MPP product now integrates, in addition to working w/ SA. I'm not 
trying to disparage SA by any means as our company and customers use it 
but I was simply responding to the comments that other commercial 
efforts are either using SA or have equivalent functionality to SA...

FWIW, here's a customer "testimonial" that just came in, which is 
slightly OT as it relates to performance more than functionality and is 
a bit self-serving:

"I switched from MPP/SpamAssassin to MPP/Cloudmark last night.  We have 
quarantined about 35k messages so far, and I am having a hard time 
finding a false positive among them.  We are still searching.  I got one 
false negative since 2am this morning.
But here is the important info as far we are concerned:

2x2 CGPro cluster on Debian Sarge (each frontend is a 2x Xeon w/ 4GB 
RAM) ~75k mailboxes

MPP/Spamd/Clamd (daily figures)
Average load avg: 3.53 (daily max about 10.00+)
Average CPU: 60% system; 18% user (93% max system; 31% max user)

MPP/Cloudmark/Clamd
Average load avg: 0.33 (current max about 1.42 - our morning peak)
Average CPU: 2.70% system; 19.14% user (note: every hour we run log 
parsers, this causes a spike of up to 80%)

I have to say I am *very* impressed.  I will report back if I encounter 
anything strange.  The transition was incredibly easy as well.   I have 
a couple of folks doing some localized spam detection.  I am going to 
see if they are noticing any increase, decrease, or nothing with their 
detection."

Coridally,

Rob
Admin for http://www.raeinternet.com/ & http://www.raeantivirus.com/

Re: Soliciting advice from the list members

Posted by Kris Deugau <kd...@vianet.ca>.
Rob Kudyba wrote:
> Well there is a company that sprouted from Vipul's Razor that uses
> the concept of collaborative filtering and adaptive learning from
> over a million trusted users, in a type of-- if you will--"sp@m
> net"--(ah hem) and I don't believe this 'net is incorporated into SA?

That would be Cloudmark - which was co-founded by the guy that wrote
Vipul's Razor in the first place.  <g>  (See the bottom two paragraphs
on http://www.cloudmark.com/company/.)

Their business model is (IIRC) based on selling licences for an Outlook
plugin (and other similar largely Windows-only bits) that connect to the
Razor servers.

-kgd
-- 
Get your mouse off of there!  You don't know where that email has been!

Re: Soliciting advice from the list members

Posted by Rob Kudyba <rk...@raeinternet.com>.
>And the key point is that if commercial anti-spam vendors
>are using points such as these to either differentiate
>themselves from other commercial efforts or SpamAssassin,
>they're either:
>
>A) Using SpamAssassin underneath
>B) Inferior to SpamAssassin
>C) Equivalent functionally to SpamAssassin
>
>Jeff C.
>  
>
Well there is a company that sprouted from Vipul's Razor that uses the 
concept of collaborative filtering and adaptive learning from over a 
million trusted users, in a type of-- if you will--"sp@m net"--(ah hem) 
and I don't believe this 'net is incorporated into SA?

Rob

Admin for http://www.raeantivirus.com/ & http://www.raeinternet.com/


Re: Soliciting advice from the list members

Posted by Jeff Chan <je...@surbl.org>.
On Thursday, December 9, 2004, 8:28:47 PM, Loren Wilton wrote:
>> These are the recent trends raised by my management:
>>
>> Hash Busting - slightly modify each copy of message to foil
>> 'fingerprinting' techniques

> Since SA doesn't do fingerprinting this doesn't have quite the desired
> effect.
> It can break a meta rule looking for particular text, but the quick answer
> is generally to modify the meta with a new term or two.  After about three
> tries you end up with a rule that will catch most all the variations.
> And in any case Bayes doesn't care much about minor variations, its all spam
> to it.


>> Bayes Poisoning - addition of random dictionary words

> Makes really GOOD spam identification.  I hardly ever send or receive email
> containing a page of Cicero in the original Latin.  Spammers do.

> In addition, most of these things are mispunctuated and often have other
> interesting characteristics that make them fodder for some pretty generic
> rules.  I have mine scored at 10 points each.  It isn't  unusual to manage
> to hit 80 points with one of these things.


>> Hidden Text - using invisible text in html messages

> Makes really GOOD spam identification.    :-)


>> Keyword Corruption - using obfuscated text to hide keywords

> Makes really GOOD spam identification.


>> Tiny Messages - messages with only URL or image

> These are harder.  Fortunately they almost always follow one of a handful of
> standard pattersn that make them amenable to catching with fairly simple
> rules.

And the key point is that if commercial anti-spam vendors
are using points such as these to either differentiate
themselves from other commercial efforts or SpamAssassin,
they're either:

A) Using SpamAssassin underneath
B) Inferior to SpamAssassin
C) Equivalent functionally to SpamAssassin

Jeff C.
-- 
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/


Re: Soliciting advice from the list members

Posted by Loren Wilton <lw...@earthlink.net>.
> These are the recent trends raised by my management:
>
> Hash Busting - slightly modify each copy of message to foil
> 'fingerprinting' techniques

Since SA doesn't do fingerprinting this doesn't have quite the desired
effect.
It can break a meta rule looking for particular text, but the quick answer
is generally to modify the meta with a new term or two.  After about three
tries you end up with a rule that will catch most all the variations.
And in any case Bayes doesn't care much about minor variations, its all spam
to it.


> Bayes Poisoning - addition of random dictionary words

Makes really GOOD spam identification.  I hardly ever send or receive email
containing a page of Cicero in the original Latin.  Spammers do.

In addition, most of these things are mispunctuated and often have other
interesting characteristics that make them fodder for some pretty generic
rules.  I have mine scored at 10 points each.  It isn't  unusual to manage
to hit 80 points with one of these things.


> Hidden Text - using invisible text in html messages

Makes really GOOD spam identification.    :-)


> Keyword Corruption - using obfuscated text to hide keywords

Makes really GOOD spam identification.


> Tiny Messages - messages with only URL or image

These are harder.  Fortunately they almost always follow one of a handful of
standard pattersn that make them amenable to catching with fairly simple
rules.


        Loren


Re: Soliciting advice from the list members

Posted by Matthew Romanek <sh...@gmail.com>.
On Fri, 10 Dec 2004 11:02:03 -0500 (EST), JP <sa...@b-dub.org> wrote:
> > Report Title     : SpamAssassin - Spam Statistics
> > Report Date      : 2004-12-10
> > Period Beginning : Thu 09 Dec 2004 06:00:00 AM PST
> > Period Ending    : Fri 10 Dec 2004 06:00:00 AM PST
> >
> > Reporting Period : 24.00 hrs
> > --------------------------------------------------
> 
> I would love to know how this report was generated.
> 
> Thanks,

Generally speaking, it was was sa-stats.pl found in the tools
directory of the SA 3.0.1 tarball.

Specifically speaking:
sa-stats.pl -s 2004-12-09-06 -e 2004-12-10-06

Usually I do -s 'yesterday midnight', but I wanted more recent data
than that.  It also does HTML output, but honestly I like the plain
text version better.


-- 
Matthew 'Shandower' Romanek
IDS Analyst

Re: Soliciting advice from the list members

Posted by JP <sa...@b-dub.org>.
> Report Title     : SpamAssassin - Spam Statistics
> Report Date      : 2004-12-10
> Period Beginning : Thu 09 Dec 2004 06:00:00 AM PST
> Period Ending    : Fri 10 Dec 2004 06:00:00 AM PST
>
> Reporting Period : 24.00 hrs
> --------------------------------------------------

I would love to know how this report was generated.

Thanks,
JP

Re: Soliciting advice from the list members

Posted by Matthew Romanek <sh...@gmail.com>.
On Thu, 09 Dec 2004 16:04:28 -0500, Adam Lanier <ad...@krusty.madoff.com> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> My managment has recently asked me how SpamAssassin is prepared to deal
> with a number of recent trends in spam technology.  This was prompted by
> a recent seminar they attended regarding spam (provided by an anti-spam
> vendor who shall remain nameless).

Without answering a specific question, I will just provide you live
data for just one user. Me.  As I just got URIBL's working, I'm still
monitoring all the spam, so I can say that in the 2542 message that
came in in the last 24 hours there were 0 false positives and only 3
false negatives (all of which passed because of AWL, which suffered a
lot of bad training. It's learning, though.)

I don't know about these amazing new spammer tricks (that's irony, by
the way), but it SA sure does work against what's actually out in the
wild.

Report Title     : SpamAssassin - Spam Statistics
Report Date      : 2004-12-10
Period Beginning : Thu 09 Dec 2004 06:00:00 AM PST
Period Ending    : Fri 10 Dec 2004 06:00:00 AM PST

Reporting Period : 24.00 hrs
--------------------------------------------------

Note: 'ham' = 'nonspam'

Total spam detected    :     2542 (  96.84%)
Total ham accepted     :       83 (   3.16%)
                        -------------------
Total emails processed :     2625 (  109/hr)

Average spam threshold :        5.00
Average spam score     :       28.26
Average ham score      :        1.83

Spam kbytes processed  :     9816   (  409 kb/hr)
Ham kbytes processed   :      487   (   20 kb/hr)
Total kbytes processed :    10303   (  429 kb/hr)

Spam analysis time     :    13224 s (  551 s/hr)
Ham analysis time      :      280 s (   12 s/hr)
Total analysis time    :    13505 s (  563 s/hr)


Statistics by Hour
----------------------------------------------------
Hour                          Spam               Ham
-------------    -----------------    --------------
2004-12-09 06           107 (100%)          0 (  0%)
2004-12-09 07            54 ( 96%)          2 (  3%)
2004-12-09 08            58 ( 96%)          2 (  3%)
2004-12-09 09            73 ( 97%)          2 (  2%)
2004-12-09 10           191 ( 99%)          1 (  0%)
2004-12-09 11           130 ( 97%)          4 (  2%)
2004-12-09 12            64 ( 84%)         12 ( 15%)
2004-12-09 13           234 ( 99%)          2 (  0%)
2004-12-09 14           154 ( 98%)          3 (  1%)
2004-12-09 15           173 ( 90%)         19 (  9%)
2004-12-09 16           101 ( 97%)          3 (  2%)
2004-12-09 17            79 ( 97%)          2 (  2%)
2004-12-09 18           205 ( 93%)         14 (  6%)
2004-12-09 19           228 ( 98%)          3 (  1%)
2004-12-09 20            69 ( 98%)          1 (  1%)
2004-12-09 21           109 ( 98%)          2 (  1%)
2004-12-09 22            55 ( 98%)          1 (  1%)
2004-12-09 23            57 ( 98%)          1 (  1%)
2004-12-10 00            59 (100%)          0 (  0%)
2004-12-10 01            58 ( 98%)          1 (  1%)
2004-12-10 02            66 ( 98%)          1 (  1%)
2004-12-10 03            59 ( 98%)          1 (  1%)
2004-12-10 04            90 ( 97%)          2 (  2%)
2004-12-10 05            69 ( 94%)          4 (  5%)

-- 
Matthew 'Shandower' Romanek
IDS Analyst