You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Adam Katz <an...@khopis.com> on 2011/04/22 01:35:26 UTC

Darxus's LOCAL_8X_TAGS

Broken apart from previous thread to prevent confusion.

On 04/21/2011 04:18 PM, darxus@chaosreigns.com wrote:
> On 04/21, Adam Katz wrote:
>> rawbody LOCAL_5X_BR_TAGS   /(?:<br\/?>[\s\r\n]{0,4}){5}/mi
>
> I wonder if it would be useful to generalize this as:
>
> rawbody LOCAL_8X_TAGS   /(?:<[^>]*>[\s\r\n]{0,4}){8}/mi
>
> Just a mess of tags in a row without any content.

I'm not sure about email clients specifically, but it is (or rather,
used to be -- I'm way out of date here) a common WYSIWYG foible to
create empty tags when the user plays with various formatting buttons
(like bold and italics) as they decide how something is presented.
Therefore, it is not uncommon to have strings like this:

<b></b><b>1.</b> <b><i>Example bullet</i></b><b>
</b>

I kept thinking that there was a good psychology study in there
somewhere since good knowledge with the inner workings of a specific
WYSIWYG editor would reveal lots of information about how the document
was composed (order, revisions, etc).

HTML generators' sloppiness is so abundant that many of them actually
run their final code through a cleanser application (e.g. Wikipedia uses
HTML Tidy).


Re: Darxus's LOCAL_8X_TAGS

Posted by da...@chaosreigns.com.
On 04/21, darxus@chaosreigns.com wrote:
> rawbody MUCH_HTML_SPACE  /(?:<\s*(?:p|br)[\s\/]*>\W*){8}/is 

A little better:

rawbody MUCH_HTML_SPACE  /(?:<\s*(?:p|br)[\s\/]*>[^[:alnum:]]*){8}/is

Same results on current copora.  Hits 15 out of 57 most recently missed
spams, and none of 5,841 hams.

-- 
"The most merciful thing in the world, I think, is the inability of the
human mind to correlate all its contents."
http://www.ChaosReigns.com

Re: Darxus's LOCAL_8X_TAGS

Posted by da...@chaosreigns.com.
On 04/21, Adam Katz wrote:
> > rawbody LOCAL_8X_TAGS   /(?:<[^>]*>[\s\r\n]{0,4}){8}/mi

> I'm not sure about email clients specifically, but it is (or rather,
> used to be -- I'm way out of date here) a common WYSIWYG foible to
> create empty tags when the user plays with various formatting buttons

Well, you were right.  I did find scary piles of tags.  How about this?

rawbody MUCH_HTML_SPACE  /(?:<\s*(?:p|br)[\s\/]*>\W*){8}/is 

Hits 15 of the last 57 spams SA has missed (score less than 5, all from
this month), and and 0 out of 5,841 of my hams.


All 15 are also hit by your:

rawbody TAGNAME /(?:<br>){5}/mi 

I'm just attempting to more robustly catch large amounts of space.

-- 
"The most merciful thing in the world, I think, is the inability of the
human mind to correlate all its contents."
http://www.ChaosReigns.com

Re: Darxus's LOCAL_8X_TAGS

Posted by Karsten Bräckelmann <gu...@rudersport.de>.
On Thu, 2011-04-21 at 16:35 -0700, Adam Katz wrote:
> Broken apart from previous thread to prevent confusion.
> 
> On 04/21/2011 04:18 PM, darxus@chaosreigns.com wrote:

> > I wonder if it would be useful to generalize this as:
> >
> > rawbody LOCAL_8X_TAGS   /(?:<[^>]*>[\s\r\n]{0,4}){8}/mi

Rawbody. Matches on plain text, too. And given the "zero or more" nature
of the quantifiers, it does match '<>' repeated 8 times. Salt with space
as you see fit.

I seem to recall such things occasionally being used in text plain mail
as some sort of fancy [1] delimiter in sigs... :/


[1] An euphemism for ugly.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}