You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/04/13 04:04:55 UTC
Re: possible HTML rules to delete
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Daniel Quinlan writes:
> These results are for the HTML_MESSAGE messages in our corpus.
>
> OVERALL% SPAM% HAM% S/O RANK SCORE NAME
> 186686 182745 3941 0.979 0.00 0.00 (all messages)
> 100.000 97.8890 2.1110 0.979 0.00 0.00 (all messages as %)
>
> Anything with an S/O below 0.500 is hitting on more HTML ham than HTML
> spam. Since most HTML messages are spam, these rules do have good
> overall S/O ratios, but I don't think they add too much to our accuracy.
> They generally have lower scores (average of .68 score for rules better
> than HTML_MESSAGE and .29 score for rules less than HTML_MESSAGE).
>
> First, the color rules don't seem very effective:
>
> 5.869 5.9400 2.5628 0.699 0.33 0.00 HTML_COLOR_MAGENTA
> 5.801 5.8612 3.0195 0.660 0.28 0.06 HTML_COLOR_GREEN
> 100.000 100.0000 100.0000 0.500 0.26 0.16 HTML_MESSAGE
> 22.040 22.1527 16.8231 0.568 0.20 0.10 HTML_COLOR_RED
> 4.548 4.5380 5.0241 0.475 0.11 0.10 HTML_COLOR_UNKNOWN
> 5.407 5.3791 6.6988 0.445 0.09 0.00 HTML_COLOR_CYAN
> 10.938 10.8446 15.2753 0.415 0.08 0.00 HTML_COLOR_YELLOW
> 18.396 18.1581 29.4342 0.382 0.08 0.10 HTML_COLOR_UNSAFE
> 18.562 18.2894 31.1850 0.370 0.07 0.10 HTML_COLOR_BLUE
> 13.377 13.1938 21.8726 0.376 0.07 0.00 HTML_COLOR_GRAY
>
> Second, the image area rules seem even less effective:
>
> 0.260 0.2632 0.1015 0.722 0.35 0.00 HTML_IMAGE_AREA_06
> 100.000 100.0000 100.0000 0.500 0.26 0.16 HTML_MESSAGE
> 0.366 0.3672 0.3299 0.527 0.14 0.28 HTML_IMAGE_AREA_05
> 0.524 0.5248 0.5075 0.508 0.12 0.00 HTML_IMAGE_AREA_04
> 0.016 0.0159 0.0254 0.385 0.05 0.00 HTML_IMAGE_AREA_08
> 0.047 0.0449 0.1522 0.228 0.01 1.61 HTML_IMAGE_AREA_07
> 0.081 0.0717 0.5075 0.124 0.00 0.00 HTML_IMAGE_AREA_09
>
> So, it looks like spammers stopped adding size tags to images.
agreed.
> The image ratio and image only rules that don't rely on size tags seem
> to be working better:
>
> 6.636 6.7788 0.0254 0.996 0.94 2.75 HTML_IMAGE_ONLY_04
> 4.399 4.4904 0.1776 0.962 0.84 1.90 HTML_IMAGE_ONLY_08
> 2.540 2.5899 0.2284 0.919 0.73 0.53 HTML_IMAGE_ONLY_16
> 3.083 3.1399 0.4567 0.873 0.63 1.53 HTML_IMAGE_ONLY_12
> 5.176 5.2664 0.9642 0.845 0.57 0.82 HTML_IMAGE_RATIO_04
> 7.104 7.2215 1.6493 0.814 0.52 0.00 HTML_IMAGE_RATIO_02
> 3.126 3.1689 1.1165 0.739 0.38 0.79 HTML_IMAGE_ONLY_24
> 2.293 2.3174 1.1672 0.665 0.28 0.61 HTML_IMAGE_ONLY_20
> 2.140 2.1631 1.0911 0.665 0.28 0.94 HTML_IMAGE_RATIO_06
> 100.000 100.0000 100.0000 0.500 0.26 0.16 HTML_MESSAGE
> 2.669 2.6884 1.7762 0.602 0.21 0.60 HTML_IMAGE_RATIO_08
> 0.564 0.5324 2.0046 0.210 0.01 0.32 HTML_IMAGE_RATIO_12
> 0.666 0.6162 2.9688 0.172 0.01 0.00 HTML_IMAGE_RATIO_14
> 0.267 0.2375 1.6240 0.128 0.00 0.54 HTML_IMAGE_RATIO_10
>
> Any thoughts or objections to removing them?
nope, sounds good to me... those colour tags have always bothered
me anyway ;)
- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS
iD8DBQFAe0rHQTcbUG5Y7woRAkfzAJ9kie5sAlG/V1en+5ao9gjmFB9BbgCfT7dI
IImHNTsWxson+OtmhAorzb4=
=CiiW
-----END PGP SIGNATURE-----