You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by "Kevin A. McGrail" <KM...@PCCC.com> on 2014/02/06 21:58:29 UTC

Re: Help with a regex to catch spam with gibberish html tags

On 1/30/2014 6:37 PM, David B Funk wrote:
> On Thu, 30 Jan 2014, Amir Caspi wrote:
>
>> On Jan 30, 2014, at 10:28 AM, Kevin A. McGrail <KM...@PCCC.com> 
>> wrote:
>>
>>       If you want to share the complete rule, I can throw it into my 
>> sandbox and see what masscheck thinks as well.
>>
>>
>> The complete rule would be something like this, assuming Andy 
>> implemented it as I wrote it:
>>
>> rawbody HTML_NONSENSE_TAGS /(?:<[A-Za-z0-9]{4,}>\s*){10,}/
>> describe HTML_NONSENSE_TAGS Many consecutive multi-letter HTML tags, 
>> likely nonsense/spam
>> score HTML_NONSENSE_TAGS 0.001
>
> Actually that unbounded {10,} repeat can be written as an explicit 
> {10} with out
> reducing the effectiveness of the rule and make it more CPU efficient. 
> IE once
> you've found at least 10 consecutive pseudo-tags do you care if there 
> are more
> than 10 (since you're not looking for anything specific after the 
> match nor
> doing anything with knowing the exact number of them)
Just an FYI that I checked a bit ago and AC_HTML_NONSENSE_TAGS was 
promoted to a published rule scoring 0.51.

Regards,
KAM