You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Kenneth Porter <sh...@sewingwitch.com> on 2006/03/11 02:08:15 UTC

HTML Validator (was: Interesting Phishing Trick)

On Wednesday, March 08, 2006 6:46 PM -0800 Kenneth Porter 
<sh...@sewingwitch.com> wrote:

> Makes me wonder about installing outbound filters that run a validator
> and reject anything that fails. I often see flame wars on mailing lists
> about allowing HTML posts to the list, but I wonder how the arguments
> would change if one allowed only *validated* HTML. I'll bet most who
> insist on using HTML would immediately be rejected by the validator.
> "Sorry, your message was rejected because your MUA vendor writes garbage
> that we can't parse, and makes you look like a spammer." ;)

Anyone know of a good validator that can be run over a MIME part to report 
on the quality of the HTML? This might be used as a go/no-go filter at 
milter level, or it could be used as an SA plugin to assign a variable 
score based on the quality of the HTML.

For mailing lists catering to newbies who love HTML and can't understand 
why us old-timers hate it, we can set the list to exclude all invalid HTML. 
"Sure, we'll accept your HTML. But only if it's really HTML. Not that crap 
that most MUA's write."

Re: HTML Validator

Posted by Kenneth Porter <sh...@sewingwitch.com>.
--On Friday, March 10, 2006 5:08 PM -0800 Kenneth Porter 
<sh...@sewingwitch.com> wrote:

> Anyone know of a good validator that can be run over a MIME part to
> report on the quality of the HTML? This might be used as a go/no-go
> filter at milter level, or it could be used as an SA plugin to assign a
> variable score based on the quality of the HTML.
>
> For mailing lists catering to newbies who love HTML and can't understand
> why us old-timers hate it, we can set the list to exclude all invalid
> HTML. "Sure, we'll accept your HTML. But only if it's really HTML. Not
> that crap that most MUA's write."

I was trying to remember a web page I found that counseled not to use 
DOCTYPE and HTML tags around email to escape spam filters (pretty weird 
advice IMO) and I ran across indications that AOL is rejecting mail that 
fails to pass validation:

<http://www.petefreitag.com/item/307.cfm>
<http://info.aol.co.uk/about/spam/mailer-daemon.adp>
<http://postmaster.info.aol.com/errors/554hvufo.html>
<http://www.clickz.com/showPage.html?page=3490146>

Re: HTML Validator

Posted by Philip Prindeville <ph...@redfish-solutions.com>.
Eric W. Bates wrote:
> I have never used it in a mail context; but tidy (from our friends at w3
> http://www.w3.org/People/Raggett/tidy/) is a very nice validator. Might
> be too big a load for SA, tho.  I think you will also find that M$ html
> output from OE is probably full of errors anyway...

All the better.  Maybe they can be shamed into fixing it.  ;-)

And maybe pigs will grow wings...  Sigh.

-Philip



Re: HTML Validator

Posted by "Eric W. Bates" <er...@vineyard.net>.
Kenneth Porter wrote:
> On Wednesday, March 08, 2006 6:46 PM -0800 Kenneth Porter
> <sh...@sewingwitch.com> wrote:
> 
>> Makes me wonder about installing outbound filters that run a validator
>> and reject anything that fails. I often see flame wars on mailing lists
>> about allowing HTML posts to the list, but I wonder how the arguments
>> would change if one allowed only *validated* HTML. I'll bet most who
>> insist on using HTML would immediately be rejected by the validator.
>> "Sorry, your message was rejected because your MUA vendor writes garbage
>> that we can't parse, and makes you look like a spammer." ;)
> 
> 
> Anyone know of a good validator that can be run over a MIME part to
> report on the quality of the HTML? This might be used as a go/no-go
> filter at milter level, or it could be used as an SA plugin to assign a
> variable score based on the quality of the HTML.
> 
> For mailing lists catering to newbies who love HTML and can't understand
> why us old-timers hate it, we can set the list to exclude all invalid
> HTML. "Sure, we'll accept your HTML. But only if it's really HTML. Not
> that crap that most MUA's write."

I have never used it in a mail context; but tidy (from our friends at w3
http://www.w3.org/People/Raggett/tidy/) is a very nice validator. Might
be too big a load for SA, tho.  I think you will also find that M$ html
output from OE is probably full of errors anyway...

> 
> 


Re: HTML Validator

Posted by Theo Van Dinter <fe...@apache.org>.
On Wed, Mar 15, 2006 at 08:13:48PM -0700, Philip Prindeville wrote:
> I'm wondering what would be involved in putting in an HTML parser
> that could call various rules to check things, like the case of:

Well, you wouldn't "call various rules", you'd look for a behavior while
parsing and flag it for later detection by a rule.  The current code
means modificaations have to be made to HTML.pm.

> <a href="http://www.foo.com/xyzzy">http://www.bar.com/aardvark</a>

This kind of rule actually doesn't need to be in the HTML parser,
you could easily write a plugin that uses the already parsed anchor
information.

FWIW though, this rule has previously been discussed and dismissed as
being non-useful (too many FPs).  Earlier today on this list even. ;)

-- 
Randomly Generated Tagline:
"You can lead a bigot to water, but if you don't tie him up you can't
 make him drown." - The Psychodots

Re: HTML Validator

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Mar 16, 2006 at 12:50:34PM -0700, Philip Prindeville wrote:
> Hmm.  Thanks.  Trying out the attachment, but having issues.  Using 3.1.0
> on FC3 Linux.
> 
> Updated the bug.

In general, it's bad to have the same conversation in multiple locations.
I'd prefer to discuss issues with the plugin here as opposed to bugzilla since
the plugin was put there so that people in the future can easily access it.
Debugging problems and such I'd prefer to talk about here.

I also responded to your issue in the ticket.  It essentially came down to:
yes, the plugin works fine with 3.1.0.  The errors you saw indicate that
you're not using 3.1.x.

-- 
Randomly Generated Tagline:
Diversity is God's way of amusing himself.

Re: HTML Validator

Posted by Philip Prindeville <ph...@redfish-solutions.com>.
Theo Van Dinter wrote:

>On Wed, Mar 15, 2006 at 09:58:52PM -0700, Philip Prindeville wrote:
>  
>
>>Ok, does anyone have *recent* statistical analysis (i.e. not almost a
>>year old)
>>on this?  It could be that the people using this "boneheaded" construct have
>>realized the error of their ways, and stopped doing it.
>>    
>>
>
>Unfortunately not.  I updated the ticket
>(http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4255) with new
>stats and a plugin that implements the check so people can play with it.
>The best version was comparing domains:
>
>  MSECS    SPAM%     HAM%     S/O    RANK   SCORE  NAME
>      0    28446     5023    0.850   0.00    0.00  (all messages)
>0.00000  84.9921  15.0079    0.850   0.00    0.00  (all messages as %)
>  0.302   0.3340   0.1195    0.737   0.00    0.01  T_HTTPS_HTTP_MISMATCH
>
>If people want to play with the plugin and can improve the hit rate to
>a usable level (or if you find a bug in the code), please let us know!
>But otherwise this rule sucks pretty badly.  :(
>
>  
>
Hmm.  Thanks.  Trying out the attachment, but having issues.  Using 3.1.0
on FC3 Linux.

Updated the bug.

-Philip


Re: HTML Validator

Posted by Theo Van Dinter <fe...@apache.org>.
On Wed, Mar 15, 2006 at 09:58:52PM -0700, Philip Prindeville wrote:
> Ok, does anyone have *recent* statistical analysis (i.e. not almost a
> year old)
> on this?  It could be that the people using this "boneheaded" construct have
> realized the error of their ways, and stopped doing it.

Unfortunately not.  I updated the ticket
(http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4255) with new
stats and a plugin that implements the check so people can play with it.
The best version was comparing domains:

  MSECS    SPAM%     HAM%     S/O    RANK   SCORE  NAME
      0    28446     5023    0.850   0.00    0.00  (all messages)
0.00000  84.9921  15.0079    0.850   0.00    0.00  (all messages as %)
  0.302   0.3340   0.1195    0.737   0.00    0.01  T_HTTPS_HTTP_MISMATCH

If people want to play with the plugin and can improve the hit rate to
a usable level (or if you find a bug in the code), please let us know!
But otherwise this rule sucks pretty badly.  :(

-- 
Randomly Generated Tagline:
 Fry: Whoah. Check out that guy. He makes Speedy Gonzales look like 
  Regular Gonzalez.

Re: HTML Validator

Posted by Philip Prindeville <ph...@redfish-solutions.com>.
Theo Van Dinter wrote:

>On Wed, Mar 15, 2006 at 08:40:51PM -0700, Philip Prindeville wrote:
>  
>
>>Does anyone have a way of doing a statistical analysis of ham that contains
>>http(s?):// as the beginning of the anchor text?
>>    
>>
>
>So for the second time today:
>
>http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4255
>
>  
>

Ok, does anyone have *recent* statistical analysis (i.e. not almost a
year old)
on this?  It could be that the people using this "boneheaded" construct have
realized the error of their ways, and stopped doing it.

-Philip


Re: HTML Validator

Posted by Theo Van Dinter <fe...@apache.org>.
On Wed, Mar 15, 2006 at 08:40:51PM -0700, Philip Prindeville wrote:
> Does anyone have a way of doing a statistical analysis of ham that contains
> http(s?):// as the beginning of the anchor text?

So for the second time today:

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4255

-- 
Randomly Generated Tagline:
We are what we pretend to be.
 		-- Kurt Vonnegut, Jr.

Re: HTML Validator

Posted by Philip Prindeville <ph...@redfish-solutions.com>.
Craig Morrison wrote:

>Philip Prindeville wrote:
>  
>
>>I'm wondering what would be involved in putting in an HTML parser
>>that could call various rules to check things, like the case of:
>>
>><a href="http://www.foo.com/xyzzy">http://www.bar.com/aardvark</a>
>>
>>where the link disagrees with the text between the anchor tags (yeah, you
>>could limit it to partial matches on the host-portion)...
>>    
>>
>
>This is the functional equivalent of pissing in the wind. If you are 
>downwind, you are going to get wet.
>
>Anchor text in too many/most cases will not match the HREF. grep is 
>good, but it isn't good enough to catch all cases without significant 
>overhead. Anchor text is a descriptor, nothing more than that. It is not 
>a regurgitation of the link HREF.
>
>  
>

Usually it's not.  That's the point.  It's when the anchor text is tries
to look
like a URL that one needs to be suspicious.  At the very least, if the
anchor text
starts with "https://" but the anchor URL looks like "http://", I'd say
that this is a
definite spam.

Does anyone have a way of doing a statistical analysis of ham that contains
http(s?):// as the beginning of the anchor text?

-Philip


-Philip


Re: HTML Validator

Posted by Craig Morrison <cr...@2cah.com>.
Philip Prindeville wrote:
> I'm wondering what would be involved in putting in an HTML parser
> that could call various rules to check things, like the case of:
> 
> <a href="http://www.foo.com/xyzzy">http://www.bar.com/aardvark</a>
> 
> where the link disagrees with the text between the anchor tags (yeah, you
> could limit it to partial matches on the host-portion)...

This is the functional equivalent of pissing in the wind. If you are 
downwind, you are going to get wet.

Anchor text in too many/most cases will not match the HREF. grep is 
good, but it isn't good enough to catch all cases without significant 
overhead. Anchor text is a descriptor, nothing more than that. It is not 
a regurgitation of the link HREF.


Re: HTML Validator

Posted by Philip Prindeville <ph...@redfish-solutions.com>.
Kenneth Porter wrote:

>On Friday, March 10, 2006 9:43 PM -0700 Philip Prindeville 
><ph...@redfish-solutions.com> wrote:
>
>  
>
>>Do you mean:
>>
>>http://validator.w3.org/source/
>>    
>>
>
>I thought that was just a web form-based validator. I'll have to look at it 
>to see if the validator can be run over an attachment (ie. an HTML MIME 
>part) from a separate mail filter (eg. MIMEDefang).
>  
>

I'm wondering what would be involved in putting in an HTML parser
that could call various rules to check things, like the case of:

<a href="http://www.foo.com/xyzzy">http://www.bar.com/aardvark</a>

where the link disagrees with the text between the anchor tags (yeah, you
could limit it to partial matches on the host-portion)...

This seems to be the Korean Chase issue that Chris encountered.

-Philip


Re: HTML Validator

Posted by Kenneth Porter <sh...@sewingwitch.com>.
On Friday, March 10, 2006 9:43 PM -0700 Philip Prindeville 
<ph...@redfish-solutions.com> wrote:

> Do you mean:
>
> http://validator.w3.org/source/

I thought that was just a web form-based validator. I'll have to look at it 
to see if the validator can be run over an attachment (ie. an HTML MIME 
part) from a separate mail filter (eg. MIMEDefang).

Re: HTML Validator

Posted by Philip Prindeville <ph...@redfish-solutions.com>.
Kenneth Porter wrote:

> Anyone know of a good validator that can be run over a MIME part to report 
> on the quality of the HTML? This might be used as a go/no-go filter at 
> milter level, or it could be used as an SA plugin to assign a variable 
> score based on the quality of the HTML.
> 
> For mailing lists catering to newbies who love HTML and can't understand 
> why us old-timers hate it, we can set the list to exclude all invalid HTML. 
> "Sure, we'll accept your HTML. But only if it's really HTML. Not that crap 
> that most MUA's write."

Do you mean:

http://validator.w3.org/source/

-Philip