You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Daniel Quinlan <qu...@pathname.com> on 2004/04/13 03:08:01 UTC
possible HTML rules to delete
These results are for the HTML_MESSAGE messages in our corpus.
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
186686 182745 3941 0.979 0.00 0.00 (all messages)
100.000 97.8890 2.1110 0.979 0.00 0.00 (all messages as %)
Anything with an S/O below 0.500 is hitting on more HTML ham than HTML
spam. Since most HTML messages are spam, these rules do have good
overall S/O ratios, but I don't think they add too much to our accuracy.
They generally have lower scores (average of .68 score for rules better
than HTML_MESSAGE and .29 score for rules less than HTML_MESSAGE).
First, the color rules don't seem very effective:
5.869 5.9400 2.5628 0.699 0.33 0.00 HTML_COLOR_MAGENTA
5.801 5.8612 3.0195 0.660 0.28 0.06 HTML_COLOR_GREEN
100.000 100.0000 100.0000 0.500 0.26 0.16 HTML_MESSAGE
22.040 22.1527 16.8231 0.568 0.20 0.10 HTML_COLOR_RED
4.548 4.5380 5.0241 0.475 0.11 0.10 HTML_COLOR_UNKNOWN
5.407 5.3791 6.6988 0.445 0.09 0.00 HTML_COLOR_CYAN
10.938 10.8446 15.2753 0.415 0.08 0.00 HTML_COLOR_YELLOW
18.396 18.1581 29.4342 0.382 0.08 0.10 HTML_COLOR_UNSAFE
18.562 18.2894 31.1850 0.370 0.07 0.10 HTML_COLOR_BLUE
13.377 13.1938 21.8726 0.376 0.07 0.00 HTML_COLOR_GRAY
Second, the image area rules seem even less effective:
0.260 0.2632 0.1015 0.722 0.35 0.00 HTML_IMAGE_AREA_06
100.000 100.0000 100.0000 0.500 0.26 0.16 HTML_MESSAGE
0.366 0.3672 0.3299 0.527 0.14 0.28 HTML_IMAGE_AREA_05
0.524 0.5248 0.5075 0.508 0.12 0.00 HTML_IMAGE_AREA_04
0.016 0.0159 0.0254 0.385 0.05 0.00 HTML_IMAGE_AREA_08
0.047 0.0449 0.1522 0.228 0.01 1.61 HTML_IMAGE_AREA_07
0.081 0.0717 0.5075 0.124 0.00 0.00 HTML_IMAGE_AREA_09
So, it looks like spammers stopped adding size tags to images.
The image ratio and image only rules that don't rely on size tags seem
to be working better:
6.636 6.7788 0.0254 0.996 0.94 2.75 HTML_IMAGE_ONLY_04
4.399 4.4904 0.1776 0.962 0.84 1.90 HTML_IMAGE_ONLY_08
2.540 2.5899 0.2284 0.919 0.73 0.53 HTML_IMAGE_ONLY_16
3.083 3.1399 0.4567 0.873 0.63 1.53 HTML_IMAGE_ONLY_12
5.176 5.2664 0.9642 0.845 0.57 0.82 HTML_IMAGE_RATIO_04
7.104 7.2215 1.6493 0.814 0.52 0.00 HTML_IMAGE_RATIO_02
3.126 3.1689 1.1165 0.739 0.38 0.79 HTML_IMAGE_ONLY_24
2.293 2.3174 1.1672 0.665 0.28 0.61 HTML_IMAGE_ONLY_20
2.140 2.1631 1.0911 0.665 0.28 0.94 HTML_IMAGE_RATIO_06
100.000 100.0000 100.0000 0.500 0.26 0.16 HTML_MESSAGE
2.669 2.6884 1.7762 0.602 0.21 0.60 HTML_IMAGE_RATIO_08
0.564 0.5324 2.0046 0.210 0.01 0.32 HTML_IMAGE_RATIO_12
0.666 0.6162 2.9688 0.172 0.01 0.00 HTML_IMAGE_RATIO_14
0.267 0.2375 1.6240 0.128 0.00 0.54 HTML_IMAGE_RATIO_10
Any thoughts or objections to removing them?
Daniel
--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting
Re: Re[2]: possible HTML rules to delete
Posted by Daniel Quinlan <qu...@pathname.com>.
Robert Menschel <Ro...@Menschel.net> writes:
> Corpus run with 2.63 distribution rules plus the above rule.
>
> S/O 0.918 (compared to global 0.813), over 6% of spam, significant ham.
Thanks. Can you run hit-frequencies as follows?
$ ./hit-frequences -xpa -M 'HTML_MESSAGE|__MIME_HTML' -m 'LW_BIG_AND_RED|HTML_FONT_BIG|HTML_FONTCOLOR_RED'
That will show the results for just HTML messages which is a more useful
of whether or not a rule is helpful. Anything below 0.500 is pretty
much not useful.
I think a better rule would combine color with font size all at once (so
it would have to be integrated into HTML.pm).
Daniel
Re[2]: possible HTML rules to delete
Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Daniel,
Tuesday, April 13, 2004, 6:46:10 PM, you wrote:
DQ> Loren Wilton <lw...@earthlink.net> writes:
>> meta LW_BIG_AND_RED (HTML_FONT_BIG && HTML_FONTCOLOR_RED)
>> describe LW_BIG_AND_RED BIG RED TEXT
>> score LW_BIG_AND_RED 3
DQ> Someone with a corpus could certainly give it a shot. It's
DQ> speculative without a corpus run, though.
Corpus run with 2.63 distribution rules plus the above rule.
S/O 0.918 (compared to global 0.813), over 6% of spam, significant ham.
Section 3 -- Frequencies Log
(First numeric frequencies, followed by percentage frequencies)
OVERALL SPAM HAM S/O SCORE NAME
111528 90720 20808 0.813 0.00 0.00 (all messages)
22646 22625 21 0.996 1.00 0.75 BIZ_TLD
15721 15720 1 1.000 1.00 4.50 DATE_SPAMWARE_Y2K
3923 3923 0 1.000 0.98 3.61 SUBJ_ILLEGAL_CHARS
...
6281 6155 126 0.918 0.76 3.00 LW_BIG_AND_RED
...
15427 14932 495 0.874 0.67 0.27 HTML_FONT_BIG
...
9750 9425 325 0.869 0.66 0.10 HTML_FONTCOLOR_RED
...
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
111528 90720 20808 0.813 0.00 0.00 (all messages)
100.000 81.3428 18.6572 0.813 0.00 0.00 (all messages as %)
20.305 24.9394 0.1009 0.996 1.00 0.75 BIZ_TLD
14.096 17.3280 0.0048 1.000 1.00 4.50 DATE_SPAMWARE_Y2K
3.518 4.3243 0.0000 1.000 0.98 3.61 SUBJ_ILLEGAL_CHARS
...
5.632 6.7846 0.6055 0.918 0.76 3.00 LW_BIG_AND_RED
...
13.832 16.4594 2.3789 0.874 0.67 0.27 HTML_FONT_BIG
...
8.742 10.3891 1.5619 0.869 0.66 0.10 HTML_FONTCOLOR_RED
Re: possible HTML rules to delete
Posted by Daniel Quinlan <qu...@pathname.com>.
Loren Wilton <lw...@earthlink.net> writes:
> meta LW_BIG_AND_RED (HTML_FONT_BIG && HTML_FONTCOLOR_RED)
> describe LW_BIG_AND_RED BIG RED TEXT
> score LW_BIG_AND_RED 3
Someone with a corpus could certainly give it a shot. It's speculative
without a corpus run, though.
>>> The COLOR_UNSAFE rule would be very useful if it worked better. It
>>> seems to catch a lot of the fffffe and fefefe type colors on a white
>>> background, but it misses a lot fo them also, such as when the color
>>> value is 2 lines after the keyword.
Do you have an example that is missed by our renderer, but actually
renders as said color in a mail client?
>> The main problem is that it hits more ham than spam.
> And as written it doesn't catch a lot of the bogus values, although it
> catches some of them. I believe I said above "if it worked better".
Define "bogus values"? Which bogus values are not being caught? How do
you propose to reduce ham hits on the so-called unsafe HTML colors?
Even if it is missing some spam hits due to some parsing problem, the
ham hits are still rather high.
> This rule needs sharpening, not deleting.
Okay, please propose something substantive or submit some code to
sharpen.
Daniel
--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting
Re: possible HTML rules to delete
Posted by Loren Wilton <lw...@earthlink.net>.
> > The RED and BLUE tags seem moderately useful in conjunction with big
> > font checks.
>
> Perhaps, but we don't have a rule for that.
Yes, but I do. If the font color check goes away, then I won't, which will
be a net loss for my spam checking abilities.
meta LW_BIG_AND_RED (HTML_FONT_BIG && HTML_FONTCOLOR_RED)
describe LW_BIG_AND_RED BIG RED TEXT
score LW_BIG_AND_RED 3
Similar check exists for LW_BIG_BLUE_TEXT.
> > The COLOR_UNSAFE rule would be very useful if it worked better. It
> > seems to catch a lot of the fffffe and fefefe type colors on a white
> > background, but it misses a lot fo them also, such as when the color
> > value is 2 lines after the keyword.
>
> The main problem is that it hits more ham than spam.
And as written it doesn't catch a lot of the bogus values, although it
catches some of them. I believe I said above "if it worked better". This
rule needs sharpening, not deleting.
Loren
Re: possible HTML rules to delete
Posted by Daniel Quinlan <qu...@pathname.com>.
"Loren Wilton" <lw...@earthlink.net> writes:
> The RED and BLUE tags seem moderately useful in conjunction with big
> font checks.
Perhaps, but we don't have a rule for that.
> The COLOR_UNSAFE rule would be very useful if it worked better. It
> seems to catch a lot of the fffffe and fefefe type colors on a white
> background, but it misses a lot fo them also, such as when the color
> value is 2 lines after the keyword.
The main problem is that it hits more ham than spam.
Daniel
--
Daniel Quinlan anti-spam (SpamAssassin), Linux,
http://www.pathname.com/~quinlan/ and open source consulting
Re: possible HTML rules to delete
Posted by Loren Wilton <lw...@earthlink.net>.
The RED and BLUE tags seem moderately useful in conjunction with big font
checks.
I don't know why blue text is so popular, all I can guess is it is the color
of certain little pills. The other color cases could probably disappear
with no great loss.
The COLOR_UNSAFE rule would be very useful if it worked better. It seems to
catch a lot of the fffffe and fefefe type colors on a white background, but
it misses a lot fo them also, such as when the color value is 2 lines after
the keyword.
Image area is I think useless. I can't remember ever seeing a hit on one of
these rules in current spam.
Image_only and occasionally image-ratio tags show up and can help classify
spam.
Loren