You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Daniel Quinlan <qu...@pathname.com> on 2004/04/13 03:08:01 UTC

possible HTML rules to delete

These results are for the HTML_MESSAGE messages in our corpus.

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
 186686   182745     3941    0.979   0.00    0.00  (all messages)
100.000  97.8890   2.1110    0.979   0.00    0.00  (all messages as %)

Anything with an S/O below 0.500 is hitting on more HTML ham than HTML
spam.  Since most HTML messages are spam, these rules do have good
overall S/O ratios, but I don't think they add too much to our accuracy.
They generally have lower scores (average of .68 score for rules better
than HTML_MESSAGE and .29 score for rules less than HTML_MESSAGE).

First, the color rules don't seem very effective:

  5.869   5.9400   2.5628    0.699   0.33    0.00  HTML_COLOR_MAGENTA
  5.801   5.8612   3.0195    0.660   0.28    0.06  HTML_COLOR_GREEN
100.000  100.0000  100.0000    0.500   0.26    0.16  HTML_MESSAGE
 22.040  22.1527  16.8231    0.568   0.20    0.10  HTML_COLOR_RED
  4.548   4.5380   5.0241    0.475   0.11    0.10  HTML_COLOR_UNKNOWN
  5.407   5.3791   6.6988    0.445   0.09    0.00  HTML_COLOR_CYAN
 10.938  10.8446  15.2753    0.415   0.08    0.00  HTML_COLOR_YELLOW
 18.396  18.1581  29.4342    0.382   0.08    0.10  HTML_COLOR_UNSAFE
 18.562  18.2894  31.1850    0.370   0.07    0.10  HTML_COLOR_BLUE
 13.377  13.1938  21.8726    0.376   0.07    0.00  HTML_COLOR_GRAY

Second, the image area rules seem even less effective:

  0.260   0.2632   0.1015    0.722   0.35    0.00  HTML_IMAGE_AREA_06
100.000  100.0000  100.0000    0.500   0.26    0.16  HTML_MESSAGE
  0.366   0.3672   0.3299    0.527   0.14    0.28  HTML_IMAGE_AREA_05
  0.524   0.5248   0.5075    0.508   0.12    0.00  HTML_IMAGE_AREA_04
  0.016   0.0159   0.0254    0.385   0.05    0.00  HTML_IMAGE_AREA_08
  0.047   0.0449   0.1522    0.228   0.01    1.61  HTML_IMAGE_AREA_07
  0.081   0.0717   0.5075    0.124   0.00    0.00  HTML_IMAGE_AREA_09

So, it looks like spammers stopped adding size tags to images.

The image ratio and image only rules that don't rely on size tags seem
to be working better:

  6.636   6.7788   0.0254    0.996   0.94    2.75  HTML_IMAGE_ONLY_04
  4.399   4.4904   0.1776    0.962   0.84    1.90  HTML_IMAGE_ONLY_08
  2.540   2.5899   0.2284    0.919   0.73    0.53  HTML_IMAGE_ONLY_16
  3.083   3.1399   0.4567    0.873   0.63    1.53  HTML_IMAGE_ONLY_12
  5.176   5.2664   0.9642    0.845   0.57    0.82  HTML_IMAGE_RATIO_04
  7.104   7.2215   1.6493    0.814   0.52    0.00  HTML_IMAGE_RATIO_02
  3.126   3.1689   1.1165    0.739   0.38    0.79  HTML_IMAGE_ONLY_24
  2.293   2.3174   1.1672    0.665   0.28    0.61  HTML_IMAGE_ONLY_20
  2.140   2.1631   1.0911    0.665   0.28    0.94  HTML_IMAGE_RATIO_06
100.000  100.0000  100.0000    0.500   0.26    0.16  HTML_MESSAGE
  2.669   2.6884   1.7762    0.602   0.21    0.60  HTML_IMAGE_RATIO_08
  0.564   0.5324   2.0046    0.210   0.01    0.32  HTML_IMAGE_RATIO_12
  0.666   0.6162   2.9688    0.172   0.01    0.00  HTML_IMAGE_RATIO_14
  0.267   0.2375   1.6240    0.128   0.00    0.54  HTML_IMAGE_RATIO_10

Any thoughts or objections to removing them?

Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

Re: Re[2]: possible HTML rules to delete

Posted by Daniel Quinlan <qu...@pathname.com>.

Robert Menschel <Ro...@Menschel.net> writes:

> Corpus run with 2.63 distribution rules plus the above rule.
>
> S/O 0.918 (compared to global 0.813), over 6% of spam, significant ham.

Thanks.  Can you run hit-frequencies as follows?

  $ ./hit-frequences -xpa -M 'HTML_MESSAGE|__MIME_HTML' -m 'LW_BIG_AND_RED|HTML_FONT_BIG|HTML_FONTCOLOR_RED'

That will show the results for just HTML messages which is a more useful
of whether or not a rule is helpful.  Anything below 0.500 is pretty
much not useful.

I think a better rule would combine color with font size all at once (so
it would have to be integrated into HTML.pm).

Daniel

Re[2]: possible HTML rules to delete

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Daniel,

Tuesday, April 13, 2004, 6:46:10 PM, you wrote:

DQ> Loren Wilton <lw...@earthlink.net> writes:

>> meta  LW_BIG_AND_RED   (HTML_FONT_BIG && HTML_FONTCOLOR_RED)
>> describe LW_BIG_AND_RED   BIG RED TEXT
>> score  LW_BIG_AND_RED   3

DQ> Someone with a corpus could certainly give it a shot.  It's
DQ> speculative without a corpus run, though.

Corpus run with 2.63 distribution rules plus the above rule.

S/O 0.918 (compared to global 0.813), over 6% of spam, significant ham.

Section 3 -- Frequencies Log
(First numeric frequencies, followed by percentage frequencies)

OVERALL     SPAM      HAM     S/O   SCORE  NAME
 111528    90720    20808    0.813   0.00    0.00  (all messages)
  22646    22625       21    0.996   1.00   0.75  BIZ_TLD
  15721    15720        1    1.000   1.00   4.50  DATE_SPAMWARE_Y2K
   3923     3923        0    1.000   0.98   3.61  SUBJ_ILLEGAL_CHARS
                                                  ...
   6281     6155      126    0.918   0.76   3.00  LW_BIG_AND_RED
                                                  ...
  15427    14932      495    0.874   0.67   0.27  HTML_FONT_BIG
                                                  ...
   9750     9425      325    0.869   0.66   0.10  HTML_FONTCOLOR_RED
                                                  ...

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
 111528    90720    20808    0.813   0.00    0.00  (all messages)
100.000  81.3428  18.6572    0.813   0.00    0.00  (all messages as %)
 20.305  24.9394   0.1009    0.996   1.00    0.75  BIZ_TLD
 14.096  17.3280   0.0048    1.000   1.00    4.50  DATE_SPAMWARE_Y2K
  3.518   4.3243   0.0000    1.000   0.98    3.61  SUBJ_ILLEGAL_CHARS
                                                   ...
  5.632   6.7846   0.6055    0.918   0.76    3.00  LW_BIG_AND_RED
                                                   ...
 13.832  16.4594   2.3789    0.874   0.67    0.27  HTML_FONT_BIG
                                                   ...
  8.742  10.3891   1.5619    0.869   0.66    0.10  HTML_FONTCOLOR_RED

Re: possible HTML rules to delete

Posted by Daniel Quinlan <qu...@pathname.com>.

Loren Wilton <lw...@earthlink.net> writes:

> meta  LW_BIG_AND_RED   (HTML_FONT_BIG && HTML_FONTCOLOR_RED)
> describe LW_BIG_AND_RED   BIG RED TEXT
> score  LW_BIG_AND_RED   3

Someone with a corpus could certainly give it a shot.  It's speculative
without a corpus run, though.

>>> The COLOR_UNSAFE rule would be very useful if it worked better.  It
>>> seems to catch a lot of the fffffe and fefefe type colors on a white
>>> background, but it misses a lot fo them also, such as when the color
>>> value is 2 lines after the keyword.

Do you have an example that is missed by our renderer, but actually
renders as said color in a mail client?

>> The main problem is that it hits more ham than spam.

> And as written it doesn't catch a lot of the bogus values, although it
> catches some of them.  I believe I said above "if it worked better".

Define "bogus values"?  Which bogus values are not being caught?  How do
you propose to reduce ham hits on the so-called unsafe HTML colors?

Even if it is missing some spam hits due to some parsing problem, the
ham hits are still rather high.

> This rule needs sharpening, not deleting.

Okay, please propose something substantive or submit some code to
sharpen.

Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

Re: possible HTML rules to delete

Posted by Loren Wilton <lw...@earthlink.net>.

> > The RED and BLUE tags seem moderately useful in conjunction with big
> > font checks.
>
> Perhaps, but we don't have a rule for that.

Yes, but I do.  If the font color check goes away, then I won't, which will
be a net loss for my spam checking abilities.

meta  LW_BIG_AND_RED   (HTML_FONT_BIG && HTML_FONTCOLOR_RED)
describe LW_BIG_AND_RED   BIG RED TEXT
score  LW_BIG_AND_RED   3

Similar check exists for LW_BIG_BLUE_TEXT.


> > The COLOR_UNSAFE rule would be very useful if it worked better.  It
> > seems to catch a lot of the fffffe and fefefe type colors on a white
> > background, but it misses a lot fo them also, such as when the color
> > value is 2 lines after the keyword.
>
> The main problem is that it hits more ham than spam.

And as written it doesn't catch a lot of the bogus values, although it
catches some of them.  I believe I said above "if it worked better".  This
rule needs sharpening, not deleting.

        Loren

Re: possible HTML rules to delete

Posted by Daniel Quinlan <qu...@pathname.com>.

"Loren Wilton" <lw...@earthlink.net> writes:

> The RED and BLUE tags seem moderately useful in conjunction with big
> font checks.

Perhaps, but we don't have a rule for that.

> The COLOR_UNSAFE rule would be very useful if it worked better.  It
> seems to catch a lot of the fffffe and fefefe type colors on a white
> background, but it misses a lot fo them also, such as when the color
> value is 2 lines after the keyword.

The main problem is that it hits more ham than spam.
 
Daniel

-- 
Daniel Quinlan                     anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/    and open source consulting

Re: possible HTML rules to delete

Posted by Loren Wilton <lw...@earthlink.net>.

The RED and BLUE tags seem moderately useful in conjunction with big font
checks.
I don't know why blue text is so popular, all I can guess is it is the color
of certain little pills.  The other color cases could probably disappear
with no great loss.

The COLOR_UNSAFE rule would be very useful if it worked better.  It seems to
catch a lot of the fffffe and fefefe type colors on a white background, but
it misses a lot fo them also, such as when the color value is 2 lines after
the keyword.

Image area is I think useless.  I can't remember ever seeing a hit on one of
these rules in current spam.

Image_only and occasionally image-ratio tags show up and can help classify
spam.

        Loren