You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Joseph Brennan <br...@columbia.edu> on 2008/05/28 15:11:53 UTC

uri rules

I was surprised that this rule...

  uri CU_CN_LINK      /http:..\w+\.cn\b/

matches not only this...

  <a href="http://foobar.cn">

but also this...

  <a href="http://www.columbia.edu/foo.html">KooXoo Buys Kuxun.cn Domain</a>


First, I did not realize that SpamAssassin's idea of "uri" includes not
only the uri, but the start tag, end tag, and all in between.  That's
useful but not real clear in Mail::SpamAssassin::Conf.

Second, I can't figure out how \w+ matches the punctuation and spaces!

Joseph Brennan
Columbia University I T



Re: uri rules

Posted by mouss <mo...@netoyen.net>.
Joseph Brennan wrote:
>
> I was surprised that this rule...
>
>  uri CU_CN_LINK      /http:..\w+\.cn\b/
>
> matches not only this...
>
>  <a href="http://foobar.cn">
>
> but also this...
>
>  <a href="http://www.columbia.edu/foo.html">KooXoo Buys Kuxun.cn 
> Domain</a>
>
>
> First, I did not realize that SpamAssassin's idea of "uri" includes not
> only the uri, but the start tag, end tag, and all in between.


it actually hits "Kuxun.cn" (not the href part). The reason is that some 
spammers put uris without the http part (and without href).

the drawback is that uri checks may hit things that are not really 
domains. this includes ldap strings, program names (program.com), ... etc.

>   That's
> useful but not real clear in Mail::SpamAssassin::Conf.
>
> Second, I can't figure out how \w+ matches the punctuation and spaces!

see above. just run with -D and you'll see
...
[73674] dbg: rules: ran uri rule CU_CN_LINK ======> got hit: 
"http://Kuxun.cn"
...


>
> Joseph Brennan
> Columbia University I T
>
>


Re: uri rules

Posted by Matt Kettler <mk...@verizon.net>.
Randy Ramsdell wrote:
>
>
> How so? How does spamassassin URI check determine Kuxun.cn  in a URI 
> as opposed to someone who forgot to add a "space" after a sentence end?
Well, CN is a rather strange word to start a sentence with, but it 
doesn't know the difference between an intentional domain and a lack of 
spacing. SpamAssassin no more selective than some email clients are. 
There's a "word" object ending in a . and a valid TLD, so it gets 
treated as a URI.

However, it shouldn't linkify things like : experiment.see  because 
"see" isn't a valid TLD.

> Is it because it is located within the "a" tag?
The "a" tag has nothing to do with it.

IIRC, the code that does this runs after all the HTML tags have been 
stripped out, so it cannot have anything to do with it. (i.e.: it runs 
on the same text that "body" rules see).






Re: uri rules

Posted by mouss <mo...@netoyen.net>.
Randy Ramsdell wrote:
> Matt Kettler wrote:
>> Joseph Brennan wrote:
>>>
>>> I was surprised that this rule...
>>>
>>>  uri CU_CN_LINK      /http:..\w+\.cn\b/
>>>
>>> matches not only this...
>>>
>>>  <a href="http://foobar.cn">
>>>
>>> but also this...
>>>
>>>  <a href="http://www.columbia.edu/foo.html">KooXoo Buys Kuxun.cn 
>>> Domain</a>
>>>
>>>
>>> First, I did not realize that SpamAssassin's idea of "uri" includes not
>>> only the uri, but the start tag, end tag, and all in between.  That's
>>> useful but not real clear in Mail::SpamAssassin::Conf.
>> Actually, it doesn't.. your second example has two URIs as far as 
>> SpamAssassin is concerned. "http://www.columbia.edu/foo.html" and 
>> "http://Kuxun.cn". Two separate URIs.
>>
>> Since many email clients "auto-link" domains in text portions, like 
>> www.google.com, SpamAssassin tries to find text strings that clients 
>> will treat as URIs and use them in the URI tests as well.
>>
>
> How so? How does spamassassin URI check determine Kuxun.cn  in a URI 
> as opposed to someone who forgot to add a "space" after a sentence 
> end? Is it because it is located within the "a" tag?

try putting this
    "I often forget spaces.it happens to me all the time..."
in a message and run with -D. you'll see:

...
[74536] dbg: uridnsbl: domains to query: spaces.it
...
[74536] dbg: rules: ran uri rule __LOCAL_PP_NONPPURL ======> got hit: 
"http://spaces.it"
...

As you see, SA can't guess that a space is missing, so it checks the 
"resulting" URI anyway.


Things get "tricky" when you want to hit things like
    Did you visit http://www.example.com/foo/bar?if so...
and you are looking for specific patterns in the "bar" part...



>>>
>>> Second, I can't figure out how \w+ matches the punctuation and spaces!
>> It doesn't. :)
>>
>>
>


Re: uri rules

Posted by Randy Ramsdell <rr...@livedatagroup.com>.
Matt Kettler wrote:
> Joseph Brennan wrote:
>>
>> I was surprised that this rule...
>>
>>  uri CU_CN_LINK      /http:..\w+\.cn\b/
>>
>> matches not only this...
>>
>>  <a href="http://foobar.cn">
>>
>> but also this...
>>
>>  <a href="http://www.columbia.edu/foo.html">KooXoo Buys Kuxun.cn 
>> Domain</a>
>>
>>
>> First, I did not realize that SpamAssassin's idea of "uri" includes not
>> only the uri, but the start tag, end tag, and all in between.  That's
>> useful but not real clear in Mail::SpamAssassin::Conf.
> Actually, it doesn't.. your second example has two URIs as far as 
> SpamAssassin is concerned. "http://www.columbia.edu/foo.html" and 
> "http://Kuxun.cn". Two separate URIs.
>
> Since many email clients "auto-link" domains in text portions, like 
> www.google.com, SpamAssassin tries to find text strings that clients 
> will treat as URIs and use them in the URI tests as well.
>

How so? How does spamassassin URI check determine Kuxun.cn  in a URI as 
opposed to someone who forgot to add a "space" after a sentence end? Is 
it because it is located within the "a" tag?
>>
>> Second, I can't figure out how \w+ matches the punctuation and spaces!
> It doesn't. :)
>
>


Re: uri rules

Posted by mouss <mo...@netoyen.net>.
Joseph Brennan wrote:
>
> Thanks, Mouss and Matt.
>
> So a uri regexp will match a "http://" that is not there.  OK, well...
>

SA tries to check based on what MUAs do. if you write
    "please visit www.example.com"
then so-called "modern" MUAs will highlight www.example.com and if you 
bring your mouse over it, you'll see that it points to 
http://www.example.com.

even in the browser address bar, you can omit the "http://" part (it is 
the default "scheme" for URIs).

While this is sometimes annoying (and/or surprising), it works as 
intended most of the time. and this is what really matters.


Re: uri rules

Posted by Joseph Brennan <br...@columbia.edu>.
Thanks, Mouss and Matt.

So a uri regexp will match a "http://" that is not there.  OK, well...

Joe Brennan




Re: uri rules

Posted by Matt Kettler <mk...@verizon.net>.
Joseph Brennan wrote:
>
> I was surprised that this rule...
>
>  uri CU_CN_LINK      /http:..\w+\.cn\b/
>
> matches not only this...
>
>  <a href="http://foobar.cn">
>
> but also this...
>
>  <a href="http://www.columbia.edu/foo.html">KooXoo Buys Kuxun.cn 
> Domain</a>
>
>
> First, I did not realize that SpamAssassin's idea of "uri" includes not
> only the uri, but the start tag, end tag, and all in between.  That's
> useful but not real clear in Mail::SpamAssassin::Conf.
Actually, it doesn't.. your second example has two URIs as far as 
SpamAssassin is concerned. "http://www.columbia.edu/foo.html" and 
"http://Kuxun.cn". Two separate URIs.

Since many email clients "auto-link" domains in text portions, like 
www.google.com, SpamAssassin tries to find text strings that clients 
will treat as URIs and use them in the URI tests as well.

>
> Second, I can't figure out how \w+ matches the punctuation and spaces!
It doesn't. :)