You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2004/04/13 04:04:55 UTC

Re: possible HTML rules to delete

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Daniel Quinlan writes:
> These results are for the HTML_MESSAGE messages in our corpus.
> 
> OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
>  186686   182745     3941    0.979   0.00    0.00  (all messages)
> 100.000  97.8890   2.1110    0.979   0.00    0.00  (all messages as %)
> 
> Anything with an S/O below 0.500 is hitting on more HTML ham than HTML
> spam.  Since most HTML messages are spam, these rules do have good
> overall S/O ratios, but I don't think they add too much to our accuracy.
> They generally have lower scores (average of .68 score for rules better
> than HTML_MESSAGE and .29 score for rules less than HTML_MESSAGE).
> 
> First, the color rules don't seem very effective:
> 
>   5.869   5.9400   2.5628    0.699   0.33    0.00  HTML_COLOR_MAGENTA
>   5.801   5.8612   3.0195    0.660   0.28    0.06  HTML_COLOR_GREEN
> 100.000  100.0000  100.0000    0.500   0.26    0.16  HTML_MESSAGE
>  22.040  22.1527  16.8231    0.568   0.20    0.10  HTML_COLOR_RED
>   4.548   4.5380   5.0241    0.475   0.11    0.10  HTML_COLOR_UNKNOWN
>   5.407   5.3791   6.6988    0.445   0.09    0.00  HTML_COLOR_CYAN
>  10.938  10.8446  15.2753    0.415   0.08    0.00  HTML_COLOR_YELLOW
>  18.396  18.1581  29.4342    0.382   0.08    0.10  HTML_COLOR_UNSAFE
>  18.562  18.2894  31.1850    0.370   0.07    0.10  HTML_COLOR_BLUE
>  13.377  13.1938  21.8726    0.376   0.07    0.00  HTML_COLOR_GRAY
> 
> Second, the image area rules seem even less effective:
> 
>   0.260   0.2632   0.1015    0.722   0.35    0.00  HTML_IMAGE_AREA_06
> 100.000  100.0000  100.0000    0.500   0.26    0.16  HTML_MESSAGE
>   0.366   0.3672   0.3299    0.527   0.14    0.28  HTML_IMAGE_AREA_05
>   0.524   0.5248   0.5075    0.508   0.12    0.00  HTML_IMAGE_AREA_04
>   0.016   0.0159   0.0254    0.385   0.05    0.00  HTML_IMAGE_AREA_08
>   0.047   0.0449   0.1522    0.228   0.01    1.61  HTML_IMAGE_AREA_07
>   0.081   0.0717   0.5075    0.124   0.00    0.00  HTML_IMAGE_AREA_09
> 
> So, it looks like spammers stopped adding size tags to images.

agreed.

> The image ratio and image only rules that don't rely on size tags seem
> to be working better:
> 
>   6.636   6.7788   0.0254    0.996   0.94    2.75  HTML_IMAGE_ONLY_04
>   4.399   4.4904   0.1776    0.962   0.84    1.90  HTML_IMAGE_ONLY_08
>   2.540   2.5899   0.2284    0.919   0.73    0.53  HTML_IMAGE_ONLY_16
>   3.083   3.1399   0.4567    0.873   0.63    1.53  HTML_IMAGE_ONLY_12
>   5.176   5.2664   0.9642    0.845   0.57    0.82  HTML_IMAGE_RATIO_04
>   7.104   7.2215   1.6493    0.814   0.52    0.00  HTML_IMAGE_RATIO_02
>   3.126   3.1689   1.1165    0.739   0.38    0.79  HTML_IMAGE_ONLY_24
>   2.293   2.3174   1.1672    0.665   0.28    0.61  HTML_IMAGE_ONLY_20
>   2.140   2.1631   1.0911    0.665   0.28    0.94  HTML_IMAGE_RATIO_06
> 100.000  100.0000  100.0000    0.500   0.26    0.16  HTML_MESSAGE
>   2.669   2.6884   1.7762    0.602   0.21    0.60  HTML_IMAGE_RATIO_08
>   0.564   0.5324   2.0046    0.210   0.01    0.32  HTML_IMAGE_RATIO_12
>   0.666   0.6162   2.9688    0.172   0.01    0.00  HTML_IMAGE_RATIO_14
>   0.267   0.2375   1.6240    0.128   0.00    0.54  HTML_IMAGE_RATIO_10
> 
> Any thoughts or objections to removing them?

nope, sounds good to me...  those colour tags have always bothered
me anyway ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAe0rHQTcbUG5Y7woRAkfzAJ9kie5sAlG/V1en+5ao9gjmFB9BbgCfT7dI
IImHNTsWxson+OtmhAorzb4=
=CiiW
-----END PGP SIGNATURE-----