You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Daniel Quinlan <qu...@pathname.com> on 2004/04/19 23:02:11 UTC

followup comparison of 2.6x vs. 3.0

Bear in mind these comparisons only work for rules with names that
haven't changed.

Here's the current list of the 10 rules with the largest drop in SPAM%.
The worst problems in the MIME rules have been fixed.

  2.155   4.2803   0.0267    0.994   0.92    0.53  HTML_IMAGE_ONLY_08:1
  1.968   3.9136   0.0200    0.995   0.92    1.90  HTML_IMAGE_ONLY_08:2

ok

 46.326  81.8188  10.7877    0.884   0.75    0.16  HTML_MESSAGE:1
 44.571  80.8787   8.2176    0.908   0.81    0.16  HTML_MESSAGE:2

Okay, I also checked each of the spams that used to render as HTML, but
no longer do in a suitably loose MUA and none rendered as HTML.  There
were also 11 new hits and they do render as HTML in said MUA so it
appears that our MIME parsing and HTML detection is working well.

  0.607   1.1867   0.0267    0.978   0.87    0.58  HTML_TABLE_THICK_BORD:1
  0.470   0.9134   0.0267    0.972   0.85    0.58  HTML_TABLE_THICK_BORD:2

a bit disconcerting

  7.108  14.0076   0.2003    0.986   0.91    0.35  HTML_TAG_BALANCE_BODY:1
  5.427  10.8207   0.0267    0.998   0.94    0.35  HTML_TAG_BALANCE_BODY:2

good

  3.112   5.9604   0.2603    0.958   0.83    0.67  HTML_TAG_BALANCE_HTML:1
  1.691   3.3669   0.0134    0.996   0.92    0.67  HTML_TAG_BALANCE_HTML:2

good

  0.380   0.7601   0.0000    1.000   0.93    0.45  LOTS_OF_STUFF:1
  0.043   0.0867   0.0000    1.000   0.93    0.45  LOTS_OF_STUFF:2

bad

  3.756   7.3005   0.2069    0.972   0.86    1.00  MSGID_DOLLARS:1
  3.072   6.1404   0.0000    1.000   0.94    1.00  MSGID_DOLLARS:2

good

  0.510   0.9801   0.0401    0.961   0.82    0.49  TO_HAS_SPACES:1
  0.160   0.2934   0.0267    0.917   0.72    0.49  TO_HAS_SPACES:2

bad

  1.024   2.0401   0.0067    0.997   0.92    2.53  TRACKER_ID:1
  0.814   1.6201   0.0067    0.996   0.92    2.53  TRACKER_ID:2

bad

  2.125   3.7136   0.5340    0.874   0.63    0.69  UPPERCASE_25_50:1
  1.865   3.1602   0.5674    0.848   0.57    0.69  UPPERCASE_25_50:2

Probably okay, I suspect it's just the removal of "URI:" from the
rendered body.

I then looked again at the largest drops in RANK from 2.6x to 3.0
(ignoring ones with tiny 2.6x SPAM% numbers).

Theo, I still think these are buglets:

  0.077   0.1533   0.0000    1.000   0.93    0.43  MIME_BASE64_ILLEGAL:1
  0.127   0.1533   0.1001    0.605   0.21    0.43  MIME_BASE64_ILLEGAL:2

Mailing-list signatures after the end of the data.  I think this rule
should ignore illegal data at the end of the message if it's within 4
lines of a line beginning with "--" or "__".

 14.644  29.1953   0.0734    0.997   0.96    1.06  MIME_HTML_NO_CHARSET:1
 20.134  38.8626   1.3818    0.966   0.89    1.06  MIME_HTML_NO_CHARSET:2

This one is probably a bug.

  2.959   5.7470   0.1669    0.972   0.86    0.19  MIME_BASE64_NO_NAME:1
  3.042   5.7804   0.3004    0.951   0.81    0.19  MIME_BASE64_NO_NAME:2

could be a minor parsing issue

  0.864   1.7134   0.0134    0.992   0.91    1.59  MIME_HTML_MOSTLY:1
  0.977   1.9201   0.0334    0.983   0.88    1.59  MIME_HTML_MOSTLY:2

nah

  0.447   0.7534   0.1402    0.843   0.56    0.92  MSGID_FROM_MTA_HEADER:1
  0.407   0.6734   0.1402    0.828   0.53    0.92  MSGID_FROM_MTA_HEADER:2

hrm

  4.450   8.8939   0.0000    1.000   0.94    3.67  MSGID_FROM_MTA_SHORT:1
  7.599  14.8277   0.3605    0.976   0.88    3.67  MSGID_FROM_MTA_SHORT:2

There goes a perfectly good rule for me, even in STATISTICS.txt it was
pretty good:

  4.432   6.7680   0.0560    0.992   0.94    3.67  MSGID_FROM_MTA_SHORT

Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

Re: followup comparison of 2.6x vs. 3.0

Posted by Theo Van Dinter <fe...@kluge.net>.
On Mon, Apr 19, 2004 at 02:02:11PM -0700, Dan Quinlan wrote:
> Theo, I still think these are buglets:
> 
>   0.077   0.1533   0.0000    1.000   0.93    0.43  MIME_BASE64_ILLEGAL:1
>   0.127   0.1533   0.1001    0.605   0.21    0.43  MIME_BASE64_ILLEGAL:2
> 
> Mailing-list signatures after the end of the data.  I think this rule
> should ignore illegal data at the end of the message if it's within 4
> lines of a line beginning with "--" or "__".

IMO, if a mailing list puts a footer in a message that is encoded,
that's not our problem.  The rule is doing exactly what it's supposed to,
and the ham FPs are valid.

The rule still performs ok, though not terrifically, for the total of
everyone's results BTW:

  0.638   0.7609   0.0280    0.964   0.83    0.43  MIME_BASE64_ILLEGAL


But, if others feel we should ignore the invalid text in an encoded
section, we should be able to change the rule logic pretty easily.

-- 
Randomly Generated Tagline:
"Brevity is the soul of lingerie." - Dorothy Parker

Re: followup comparison of 2.6x vs. 3.0

Posted by Theo Van Dinter <fe...@kluge.net>.
On Mon, Apr 19, 2004 at 02:02:11PM -0700, Dan Quinlan wrote:
>  14.644  29.1953   0.0734    0.997   0.96    1.06  MIME_HTML_NO_CHARSET:1
>  20.134  38.8626   1.3818    0.966   0.89    1.06  MIME_HTML_NO_CHARSET:2
> 
> This one is probably a bug.

All valid ham hits for me.

>   2.959   5.7470   0.1669    0.972   0.86    0.19  MIME_BASE64_NO_NAME:1
>   3.042   5.7804   0.3004    0.951   0.81    0.19  MIME_BASE64_NO_NAME:2
> 
> could be a minor parsing issue

All valid hits for me.  All newsletters.  Most are from FoodTV --
multipart/alternate with text/plain and text/html... for some reason,
both parts are b64 encoded.  I also had a newsletter with an attached
GIF w/ no name.

>   0.864   1.7134   0.0134    0.992   0.91    1.59  MIME_HTML_MOSTLY:1
>   0.977   1.9201   0.0334    0.983   0.88    1.59  MIME_HTML_MOSTLY:2
> 
> nah

My only ham hit is valid.  Comes from a feedback response from Intuit:
m/a message with a blank text/plain part, and a few lines in the
text/html part.


From my POV, it just looks like we do more accurate parsing now, so
therefore the rule hits change.  If they go down in rank/etc, well,
there you go.

-- 
Randomly Generated Tagline:
"Anyone know of a buffer cleaning program for Linux?
  Netscape! (from the back of the room)" - Aeleen Frisch at LISA '99

Re: followup comparison of 2.6x vs. 3.0

Posted by Daniel Quinlan <qu...@pathname.com>.
Daniel Quinlan <qu...@pathname.com> writes:

>   0.380   0.7601   0.0000    1.000   0.93    0.45  LOTS_OF_STUFF:1
>   0.043   0.0867   0.0000    1.000   0.93    0.45  LOTS_OF_STUFF:2
>
> bad

I figured this out.  It's really a URI rule which worked as a body rule
before because we stuffed URIs into the body.  I have some test rules
that out-do the original now and I'll check 'em in.

>   0.510   0.9801   0.0401    0.961   0.82    0.49  TO_HAS_SPACES:1
>   0.160   0.2934   0.0267    0.917   0.72    0.49  TO_HAS_SPACES:2
>
> bad

Ah, this broke because we changed the :addr code.  I'll try to
resurrect...

>   1.024   2.0401   0.0067    0.997   0.92    2.53  TRACKER_ID:1
>   0.814   1.6201   0.0067    0.996   0.92    2.53  TRACKER_ID:2
>
> bad

Hmmm... the rule is unchanged, so I think it's just the loss of URIs in
the body again.  The difference is not as big as LOTS_OF_STUFF, though,
so I'm not as inclined to pursue the missing 0.4% of spam hits.  Maybe
it's worth it, though.

> There goes a perfectly good rule for me, even in STATISTICS.txt it was
> pretty good:
>
>   4.432   6.7680   0.0560    0.992   0.94    3.67  MSGID_FROM_MTA_SHORT

I have a replacement rule that's just about as good (possibly more
correct) and is 12 lines long instead of 96 lines for the original set
of MSGID_FROM_MTA* eval rules:

  4.473   5.2540   0.1261    0.977   0.87    3.67  MSGID_FROM_MTA_SHORT
  5.436   6.3822   0.1646    0.975   0.86    0.01  T_MSGID_FROM_MTA_1

I think the original really high scores was mostly luck due to being
able to parse some lines and not others.  This rule uses the trusted
Received header code so I think it will also solve some of the FP
problems that some sites had with MSGID_FROM_MTA_SHORT.

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting