You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/02/05 22:21:54 UTC

[Bug 1987] Rule for detecting non-HTML tags

http://bugzilla.spamassassin.org/show_bug.cgi?id=1987





------- Additional Comments From jdl@imaginenet.net  2004-02-05 13:21 -------
Rather than reinventing the wheel, how about using the Tidy project to check 
the HTML portion of an e-mail. I know that the project does have a perl module 
in addition to a library and executable.

As a test, I extracted the HTML portion of a spam mail and ran the binary 
version of tidy against it. I ended up getting 42 warnings (see below). 
Perhaps the warning count could be multiplied by some value and the result 
used as a score for this test. If you wanted to go a step further, you could 
parse the waring log giving scores to each entry. It appears the "discarding 
unexpected" entries should be given more weight.

line 1 column 1 - Warning: SYSTEM, PUBLIC, W3C, DTD, EN must be upper case
line 6 column 1 - Warning: <meta> unexpected or duplicate quote mark
line 6 column 1 - Warning: <meta> attribute with missing trailing quote mark
line 6 column 1 - Warning: <meta> unexpected or duplicate quote mark
line 6 column 1 - Warning: unknown attribute "text/html;"
line 6 column 1 - Warning: <meta> attribute with missing trailing quote mark
line 8 column 1 - Warning: <style> unexpected or duplicate quote mark
line 8 column 1 - Warning: <style> attribute with missing trailing quote mark
line 16 column 1 - Warning: <table> unexpected or duplicate quote mark
line 16 column 1 - Warning: <table> attribute with missing trailing quote mark
line 16 column 1 - Warning: <table> unexpected or duplicate quote mark
line 16 column 1 - Warning: <table> attribute with missing trailing quote mark
line 16 column 1 - Warning: <table> unexpected or duplicate quote mark
line 16 column 1 - Warning: <table> attribute with missing trailing quote mark
line 16 column 1 - Warning: <table> unexpected or duplicate quote mark
line 16 column 1 - Warning: <table> attribute with missing trailing quote mark
line 16 column 1 - Warning: <table> attribute "cellpadding" has invalid 
value "3D"
line 16 column 1 - Warning: <table> attribute "cellspacing" has invalid 
value "3D"
line 16 column 1 - Warning: <table> attribute "width" has invalid value "3D"
line 16 column 1 - Warning: <table> lacks "summary" attribute
line 19 column 25 - Warning: <div> unexpected or duplicate quote mark
line 19 column 25 - Warning: <div> attribute with missing trailing quote mark
line 32 column 26 - Warning: <a> unexpected or duplicate quote mark
line 32 column 26 - Warning: <a> attribute with missing trailing quote mark
line 38 column 1 - Warning: discarding unexpected </earthmoving>
line 38 column 15 - Warning: discarding unexpected </pomegranate>
line 38 column 29 - Warning: discarding unexpected </intimacy>
line 38 column 40 - Warning: discarding unexpected </mightn>
line 38 column 51 - Warning: discarding unexpected </coherent>
line 38 column 62 - Warning: discarding unexpected </curse>
line 39 column 1 - Warning: discarding unexpected </guilford>
line 39 column 12 - Warning: discarding unexpected </civet>
line 39 column 20 - Warning: discarding unexpected </suffragette>
line 39 column 34 - Warning: discarding unexpected </certify>
line 39 column 44 - Warning: discarding unexpected </buyer>
line 39 column 52 - Warning: discarding unexpected </czarina>
line 40 column 1 - Warning: discarding unexpected </alongside>
line 40 column 13 - Warning: discarding unexpected </bromide>
line 40 column 23 - Warning: discarding unexpected </gully>
line 40 column 31 - Warning: discarding unexpected </buff>
line 40 column 38 - Warning: discarding unexpected </waive>
line 40 column 46 - Warning: discarding unexpected </wander>





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.