You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/01/15 20:42:34 UTC

[Bug 2929] New: Suggested rule - filtering out invalid HTML tags

http://bugzilla.spamassassin.org/show_bug.cgi?id=2929

           Summary: Suggested rule - filtering out invalid HTML tags
           Product: Spamassassin
           Version: 2.61
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: Rules
        AssignedTo: spamassassin-dev@incubator.apache.org
        ReportedBy: seann@herdejurgen.com


A number of spam e-mails I receive contain invalid HTML tags.  For example:

We</defensible> be</squashy>lie</thou>ve</eigenspace> orde</tercel>ring me=
</xavier>dication should be</glen> as simple</bedimming> as orde</bellini>=
ring anything e</ersatz>lse</cypriot> on the</postfix> Inte</priscilla>rne=

Since there are only about 100 valid HTML tags, you could check every tag of 
the form </TAG> and see if they are valid or not.  If the percentage of 
invalid tags is greater than some number, say 50%, then set the rule to true.

Another possible rule would be to 'de-html' a message before checking for 
words.  I have a short dehtml script written in Perl here:

    while (<>) { s/<.*?>//gs; print $_; }



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.