You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Mike Jackson <mj...@barking-dog.net> on 2005/08/15 21:33:31 UTC

test for multipart/alternative discrepancies?

I've been getting quite a few spams (which slipped past SA) in the last few 
minutes with subject lines like "dies in McDonalds", so I looked at the 
message source to see how they were scoring (which I've included below). In 
all the cases, the HTML content (at least as displayed in Outlook Express) 
was fairly consistent, but the plain text version looked like typical Bayes 
poisoning text.

Would it be possible to craft a rule that roughly compares the text/plain 
and HTML-stripped text/html versions of a message and scored against them if 
the words they contained were significantly different? Or is that 
technically infeasible?




  Content-Type: text/plain;


Hello,
     5.  Kislovodsk:  Literally  `acid  waters,  a  popular resort  in  t=
he =
    `Thats wonderful! Koroviev  yelled. Somewhat stunned by his  =
chatter,that  one  could execute  such  a man.  There  had  been  no  =
execution!  Nocloser, youll see the details.midnight moon. A greenish =
kerchief of  night-light fell from the window-sillup still more ... She =
greedily began gulping down caviar.up to the footboard of an A tram =
waiting at a stop, brazenly elbow aside a     Here he applauded, but =
quite  alone, while a confident smile  played onthat might occur at the =
time of the execution in the city of Yershalaim,  sospeaking, I had =
nothing more to do, and I lived from one meeting with her toPetrakovs. =
Placing his bulging briefcase on the table, Boba  immediately =
putposts?[6]horizon. He did not rejoice in the staggeringly beautiful =
view  which openedpaying or free, but even changes countenance at any =
theatrical conversation.what she was going to tell the neighbours the =
next day.phrase:

#########

  Content-Type: text/html;

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META http-equiv=3DContent-Type content=3D"text/html; charset=3Dus-ascii">
<META content=3D"MSHTML 6.00.2800.1106" name=3DGENERATOR>
<STYLE></STYLE>
</HEAD>
<BODY bgColor=3D#ffffff>
<DIV><FONT face=3DArial></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial>A court has sentenced a man to life in jail for the 
=
=

bombing of a McDonald's restaurant, which left three people =
dead.</FONT></DIV>
<DIV><FONT face=3DArial></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial>The man, Agung Abdul Hamid, was found guilty of =
financing
and co-ordinating the attack.</FONT></DIV>
<DIV><FONT face=3DArial></FONT>&nbsp;</DIV>
<DIV><FONT face=3DArial><A href=3D"http://www.ildhd.lastrez.com">Read full =
=
story.</A></FONT></DIV>
<DIV>&nbsp;</DIV></BODY></HTML>

Re: test for multipart/alternative discrepancies?

Posted by Matt Kettler <mk...@evi-inc.com>.

Mike Jackson wrote:
> I've been getting quite a few spams (which slipped past SA) in the last
> few minutes with subject lines like "dies in McDonalds", so I looked at
> the message source to see how they were scoring (which I've included
> below). In all the cases, the HTML content (at least as displayed in
> Outlook Express) was fairly consistent, but the plain text version
> looked like typical Bayes poisoning text.
> 

Really, I'd be looking into why the messages got past SA. Did it get a decent
BAYES_ score? The bayes "poison" really shouldn't be a problem.

The use of chi-squared combining makes bayes poisoning pretty ineffective as
long as you're training your bayes often and training well.

And by "training well" I specifically mean you must train spam messages
containing "poison" as spam. If you're avoiding training "poison", then you
yourself are making that poison effective.

(Bayes can only be as accurate as its training. If its not getting realistic
training, it won't do well with realistic mail.)

Re: test for multipart/alternative discrepancies?

Posted by Theo Van Dinter <fe...@apache.org>.

On Mon, Aug 15, 2005 at 07:04:36PM -0700, Loren Wilton wrote:
> I just want a rule that checks the text/plain part for zero uris and the
> html part for > 0 uris.  That would catch 99+% of this trash without trying
> very hard.

FWIW, I put in a test rule for this:

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
  21255    18255     3000    0.859   0.00    0.00  (all messages)
100.000  85.8857  14.1143    0.859   0.00    0.00  (all messages as %)
 21.938  25.5327   0.0667    0.997   0.00    0.01  T_URI_HTML_ONLY

nice. :)

-- 
Randomly Generated Tagline:
There are no threads in a.b.p.erotica,  so there's no  gain in using a
 threaded news reader.
 (Unknown source)

Re: test for multipart/alternative discrepancies?

Posted by Loren Wilton <lw...@earthlink.net>.

> Would it be possible to craft a rule that roughly compares the text/plain
> and HTML-stripped text/html versions of a message and scored against them
if
> the words they contained were significantly different? Or is that
> technically infeasible?

I just want a rule that checks the text/plain part for zero uris and the
html part for > 0 uris.  That would catch 99+% of this trash without trying
very hard.

        Loren

Re: test for multipart/alternative discrepancies?

Posted by Mike Jackson <mj...@barking-dog.net>.

> On Mon, Aug 15, 2005 at 12:33:31PM -0700, Mike Jackson wrote:
> > Would it be possible to craft a rule that roughly compares the 
> > text/plain
> > and HTML-stripped text/html versions of a message and scored against 
> > them
> > if the words they contained were significantly different? Or is that
> > technically infeasible?
>
> You mean MPART_ALT_DIFF ?  ;)

Well blow me down  :) Strange that I didn't see that rule hit on this 
message though.

Re: test for multipart/alternative discrepancies?

Posted by Theo Van Dinter <fe...@apache.org>.

On Mon, Aug 15, 2005 at 12:33:31PM -0700, Mike Jackson wrote:
> Would it be possible to craft a rule that roughly compares the text/plain 
> and HTML-stripped text/html versions of a message and scored against them 
> if the words they contained were significantly different? Or is that 
> technically infeasible?

You mean MPART_ALT_DIFF ?  ;)

-- 
Randomly Generated Tagline:
"You're not significant until someone complains about you publically."
                 - Theo Van Dinter

RE: test for multipart/alternative discrepancies?

Posted by Herb Martin <He...@learnquick.com>.

> -----Original Message-----
> From: Mike Jackson [mailto:mjackson@barking-dog.net] 
> Sent: Monday, August 15, 2005 2:34 PM
> To: users@spamassassin.apache.org
> Subject: test for multipart/alternative discrepancies?
> 
> I've been getting quite a few spams (which slipped past SA) 
> in the last few minutes with subject lines like "dies in 
> McDonalds", so I looked at the message source to see how they 
> were scoring (which I've included below). In all the cases, 
> the HTML content (at least as displayed in Outlook Express) 
> was fairly consistent, but the plain text version looked like 
> typical Bayes poisoning text.
> 
> Would it be possible to craft a rule that roughly compares 
> the text/plain and HTML-stripped text/html versions of a 
> message and scored against them if the words they contained 
> were significantly different? Or is that technically infeasible?

Found one in my trap -- SpamAssasssin (3.10rc1) with lots of SARE
and many network tests scored it:  29.2

Bayes only scored it at 50% which was good for only 0.7 points.

Content analysis details: (29.2 points, 6.0 required)
     pts rule name description
     ---- ----------------------
--------------------------------------------------
     1.1 SPF_FAIL SPF: sender does not match SPF record (fail)
     [SPF failed: Please see
http://spf.pobox.com/why.html?sender=tiffiny%40karta.com%3E%0Atiffiny%40kart
a.com&ip=58.51.205.72&receiver=www.LearnQuick.Com]
     3.5 SPF_HELO_FAIL SPF: HELO does not match SPF record (fail)
     [SPF failed: Please see
http://spf.pobox.com/why.html?sender=karta.com&ip=58.51.205.72&receiver=www.
LearnQuick.Com]
     0.7 MPART_ALT_DIFF_COUNT BODY: HTML and text parts are different
     1.0 HTML_MESSAGE BODY: HTML included in message
     0.9 BAYES_50 BODY: Bayesian spam probability is 40 to 60%
     [score: 0.4999]
     0.7 Y_SILLY_SALUTATION RAW: Foobar,+ salutation
     1.5 RAZOR2_CF_RANGE_E8_51_100 Razor2 gives engine 8 confidence level
     above 50%
     [cf: 100]
     0.5 RAZOR2_CHECK Listed in Razor2 (http://razor.sf.net/)
     1.5 RAZOR2_CF_RANGE_E4_51_100 Razor2 gives engine 4 confidence level
     above 50%
     [cf: 100]
     2.0 RAZOR2_CF_RANGE_51_100 Razor2 gives confidence level above 50%
     [cf: 100]
     3.7 PYZOR_CHECK Listed in Pyzor (http://pyzor.sf.net/)
     1.5 NO_DNS_FOR_FROM DNS: Envelope sender has no MX or A DNS records
     1.6 URIBL_SBL Contains an URL listed in the SBL blocklist
     [URIs: lastrez.com]
     2.5 URIBL_BLACK Contains an URL listed in the URIBL blacklist
     [URIs: lastrez.com]
     4.5 URIBL_SC2_SURBL Has URI in SC2 at http://www.surbl.org/lists.html
     [URIs: lastrez.com]
     1.0 DIGEST_MULTIPLE Message hits more than one network digest check
     0.9 FM_NO_STYLE FM_NO_STYLE

Subject: ***** SPAM *****_29.2 McDonÂld's bomber jailed

--
Herb Martin