You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Theo Van Dinter <fe...@apache.org> on 2007/07/01 03:23:31 UTC

Re: URIBL_BLACK matching on messages with no URLs in them...

On Sat, Jun 30, 2007 at 12:07:04PM -0700, Jo Rhett wrote:
> There's no URL in this message.  What is it mis-matching against?

When in doubt, run through "spamassassin -D":

[9710] dbg: uridnsbl: domains to query: sync.pl svcolo.com

SA doesn't just look for full URLs, it looks for things that could be
hostnames ala "copy www.example.com into your browser".

-- 
Randomly Selected Tagline:
"If all the girls who attended the Harvard-Yale game were laid end to end,
 I wouldn't be surprised."       - Dorothy Parker

Re: URIBL_BLACK matching on messages with no URLs in them...

Posted by Theo Van Dinter <fe...@apache.org>.
On Mon, Jul 02, 2007 at 01:28:27PM -0700, Jo Rhett wrote:
> Both of these assume I know every person who needs to e-mail me, and  
> everything they will send me.  Theo, you're active in enough open  
> source projects to know better.

Well, you just said you were receiving a large amount of "system" type mails,
which for me would all be from my own/well defined set of systems.

> Well then we need to alter the code.  While bareword domain matching  
> might make sense, it doesn't make sense for /a/valid/system/path/ 
> file.pl for "file.pl" to be checked.  Zero hits on spam corpus.

I think this is definitely a section of SA that could
use some work, so ...  Patches welcome. :)    As a start,
PerMsgStatus::_get_parsed_uri_list() is the function that goes through
the text looking for hostnames or domains.  It looks for both schemed URIs
(http://.../) and schemeless URIs, which is where you're getting hit.

Everything else, such as URIDNSBL, keys off of that.


Random thought: URIDNSBL actually has a set of priorities when figuring out
which domains to query.  I wonder if the results would be better/worse if the
rules were based on the source type -- at least HTML versus parsed, but could
also be HTML tag, etc.

-- 
Randomly Selected Tagline:
"G: And are you using Windows or a Mac?
  T: Neither, I'm using Linux.
  G: Oh, you're a power user."            - Theo and his ex-ISP

Re: URIBL_BLACK matching on messages with no URLs in them...

Posted by Jo Rhett <jr...@netconsonance.com>.
On Jul 2, 2007, at 1:22 PM, Theo Van Dinter wrote:
> If these are from known good sources, just whitelist them (or skip SA
> altogether).  Otherwise, if the names are specific, you could always
> use uridnsbl_skip_domain to bypass URIDNSBL checks on the parsed  
> domains.

Both of these assume I know every person who needs to e-mail me, and  
everything they will send me.  Theo, you're active in enough open  
source projects to know better.

> There is no non-code-altering way of modifying the behavior (not  
> parsing
> the domains out or having URIDNSBL not use parsed domains).

Well then we need to alter the code.  While bareword domain matching  
might make sense, it doesn't make sense for /a/valid/system/path/ 
file.pl for "file.pl" to be checked.  Zero hits on spam corpus.

Frankly, spam doesn't work unless people can just click it.  So we  
really only need to look for things that stupid Windows programs will  
try to interpret for the user.  The file.pl example above will never  
get a user to the file.pl spam site.

-- 
Jo Rhett
Net Consonance : consonant endings by net philanthropy, open source  
and other randomness



Re: URIBL_BLACK matching on messages with no URLs in them...

Posted by Theo Van Dinter <fe...@apache.org>.
On Mon, Jul 02, 2007 at 01:05:17PM -0700, Jo Rhett wrote:
> I need to completely disable this over-opportunistic behavior.  90%  
> of my e-mails have either system output, or are concerning code  
> segments or router interfaces, etc, etc.  I need these mails to get  
> through.
>
> At the very least, common collisions like script.pl need to be disabled.

If these are from known good sources, just whitelist them (or skip SA
altogether).  Otherwise, if the names are specific, you could always
use uridnsbl_skip_domain to bypass URIDNSBL checks on the parsed domains.

There is no non-code-altering way of modifying the behavior (not parsing
the domains out or having URIDNSBL not use parsed domains).

-- 
Randomly Selected Tagline:
"If you're choking someone, and you remove your hand, you're going to get
 punched in the face."    - Hal Stern

Re: URIBL_BLACK matching on messages with no URLs in them...

Posted by Jo Rhett <jr...@netconsonance.com>.
> From: Jo Rhett [mailto:jrhett@netconsonance.com]
>> I need to completely disable this over-opportunistic behavior.  90%
>> of my e-mails have either system output, or are concerning code
>> segments or router interfaces, etc, etc.  I need these mails to get
>> through.
>>
>> At the very least, common collisions like script.pl need to be
>> disabled.

On Jul 2, 2007, at 1:13 PM, Rosenbaum, Larry M. wrote:
> uridnsbl_skip_domain script.pl

Perhaps some people can think of every possible script, filename,  
router command, etc that might ever be mailed to them but I'm not one  
of them.  So manually listing each one in my SA config isn't an option.

I don't want bare words to be checked against a URI blacklist, it  
can't be that useful.

And checking the filename at the end of a system path (which is what  
this did) would *never* match against any spam.  I just ran a test  
against several million known spam and got zero hits.

This might be valid spam URL:   www.spammy.com/a/valid/url
This will never be a spam URL:   /a/valid/url/spammy.com

The FP from this message tested the latter case.

-- 
Jo Rhett
Net Consonance : consonant endings by net philanthropy, open source  
and other randomness



RE: URIBL_BLACK matching on messages with no URLs in them...

Posted by "Rosenbaum, Larry M." <ro...@ornl.gov>.
> From: Jo Rhett [mailto:jrhett@netconsonance.com]
> 
> > SA doesn't just look for full URLs, it looks for things that could
be
> > hostnames ala "copy www.example.com into your browser".
> 
> This is fairly nonfunctional.  I've been chasing around all sorts of
> FPs that seem to hit pretty much every message that comes to me with
> source code inside it, and you've probably nailed every one of them
> on the head.  I didn't realize that they were related at the SA level.
> 
> I need to completely disable this over-opportunistic behavior.  90%
> of my e-mails have either system output, or are concerning code
> segments or router interfaces, etc, etc.  I need these mails to get
> through.
> 
> At the very least, common collisions like script.pl need to be
> disabled.

uridnsbl_skip_domain script.pl

Re: URIBL_BLACK matching on messages with no URLs in them...

Posted by Jo Rhett <jr...@netconsonance.com>.
On Jun 30, 2007, at 6:23 PM, Theo Van Dinter wrote:
> On Sat, Jun 30, 2007 at 12:07:04PM -0700, Jo Rhett wrote:
>> There's no URL in this message.  What is it mis-matching against?
>
> When in doubt, run through "spamassassin -D":
> [9710] dbg: uridnsbl: domains to query: sync.pl svcolo.com

Thanks for the reminder.

> SA doesn't just look for full URLs, it looks for things that could be
> hostnames ala "copy www.example.com into your browser".

This is fairly nonfunctional.  I've been chasing around all sorts of  
FPs that seem to hit pretty much every message that comes to me with  
source code inside it, and you've probably nailed every one of them  
on the head.  I didn't realize that they were related at the SA level.

I need to completely disable this over-opportunistic behavior.  90%  
of my e-mails have either system output, or are concerning code  
segments or router interfaces, etc, etc.  I need these mails to get  
through.

At the very least, common collisions like script.pl need to be disabled.

-- 
Jo Rhett
Net Consonance : consonant endings by net philanthropy, open source  
and other randomness