You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Craig Jackson <cj...@localsurface.com> on 2005/05/31 03:28:44 UTC

Top level domain test -- somewhat OT

Hi,
Our small business never receives mail from top level domains other than 
com,net,org,mil,edu,gov,and us -- except spam. Additionally, we never 
receive email with links containing other level domains -- except spam. 
The logic is that we are small and do no business outside our geographic 
area. So I wrote a body test for checking links that don't have these 
top level domains:

 
m{https?://[^/\s]+?(?<!\.com)(?<!\.net)(?<!\.org)(?<!\.gov)(?<!\.us)(?<!\.edu)(?<!\.mil)(\/\[^\s])?}

This I copied from the Spamassassin test for odd ports. The logic is 
similar. However I have never seen some of this notation. And of course 
the test doesn't work -- too many false positives.

1) What do the enclosing {} mean?
2) What is the ?<! supposed to do?
3) Does this work with line wrapped links?
4) Shouldn't the domains be separated by | instead of all enclosed in ()?

If you would point to a tutorial that covers this I would be grateful. I 
have checked a few beginner regex sites and even read most of the regex 
book, but don't remember this particular syntax.

Thanks,
Craig Jackson

Re: Top level domain test -- somewhat OT

Posted by Matt Kettler <mk...@comcast.net>.
At 09:38 PM 5/30/2005, Craig Jackson wrote:
>Craig Jackson wrote:
>>
>>m{https?://[^/\s]+?(?<!\.com)(?<!\.net)(?<!\.org)(?<!\.gov)(?<!\.us)(?<!\.edu)(?<!\.mil)(\/\[^\s])?} 
>>
>>This I copied from the Spamassassin test for odd ports. The logic is 
>>similar. However I have never seen some of this notation. And of course 
>>the test doesn't work -- too many false positives.
>>1) What do the enclosing {} mean?

They are the delimiters. Instead of using a pair of / to delimit the regex 
they used curly braces. It's somewhat rare to see this done, but it's 
sometimes convenient.

When you prefix with the match operator (that m at the beginning) you can 
use almost any character you want as a delimiter instead of forward slash. 
This way you can do http:// without having to escape it as http:\/\/ like 
you would in most normal / delimited rule.



>>3) Does this work with line wrapped links?

If you do it as a uri rule, I think so. As a rawbody rule, no.


  But please questions 1) and 3) above I still haven't answered.
>Thanks


Re: Top level domain test -- somewhat OT

Posted by Loren Wilton <lw...@earthlink.net>.
> >
m{https?://[^/\s]+?(?<!\.com)(?<!\.net)(?<!\.org)(?<!\.gov)(?<!\.us)(?<!\.ed
u)(?<!\.mil)(\/\[^\s])?}
> >
> >
> One of the amazing things about posting to lists is that shortly after
> posting I usually find the answer to the question. Well, I've now
> learned something about negative look-ahead assertions that I did not

Actually that is a negative lookBEHIND assertion that they are using.
Negative lookAHEAD is (?!, not (?<!.

What this test is saying, in more or less english is: Match 'http', possibly
followed by 's', and then followed by '//:'.  Then match everything up to a
/ or space, but don't be greedy about it.  (which means, stop on the FIRST /
or space you find, not the last one.)  Now that you are pointing at a / or
space, are the preceeding 4 characters not .com, and not .net, etc.

Then we get to the last part, which I suspect you added, since the coding
style is  different, and it does some things it odd things.  In fact, I'm
not at all sure exactly what the intent was here.  I think perhaps it was
trying to look for a / optionally followed by a space after the url.  But we
already know that there is a / or space here from the original non-greedy
match.

In any case, if that was the intent, it should have been coded as
"(?:/[^\s])?".  The ?: after the ( says that you are only using the parends
as grouping, and not as a capturing group.  This is MUCH faster, according
to the Perl pundits.  You don't need a backslash in front of the slash in
this case, because the overall delimiter characters are {} instead of the
more common //.  And you certainly don't want a backslash in front of the
[ character that is part of the character grouping, unless you wanted to
compare a literal [ character.  In that case you would also need a backslash
in front of the ] character.

(I suspect that the appropriate match here would be simply "[/\s]" to match
the slash or space we know is here.  Or more simply, just a dot.  We don't
care what it matches, and we already have a pretty good idea of what it will
match.)

        Loren


Re: Top level domain test -- somewhat OT

Posted by Craig Jackson <cj...@localsurface.com>.
Craig Jackson wrote:
> Hi,
> Our small business never receives mail from top level domains other than 
> com,net,org,mil,edu,gov,and us -- except spam. Additionally, we never 
> receive email with links containing other level domains -- except spam. 
> The logic is that we are small and do no business outside our geographic 
> area. So I wrote a body test for checking links that don't have these 
> top level domains:
> 
> 
> m{https?://[^/\s]+?(?<!\.com)(?<!\.net)(?<!\.org)(?<!\.gov)(?<!\.us)(?<!\.edu)(?<!\.mil)(\/\[^\s])?} 
> 
> 
> This I copied from the Spamassassin test for odd ports. The logic is 
> similar. However I have never seen some of this notation. And of course 
> the test doesn't work -- too many false positives.
> 
> 1) What do the enclosing {} mean?
> 2) What is the ?<! supposed to do?
> 3) Does this work with line wrapped links?
> 4) Shouldn't the domains be separated by | instead of all enclosed in ()?
> 
> If you would point to a tutorial that covers this I would be grateful. I 
> have checked a few beginner regex sites and even read most of the regex 
> book, but don't remember this particular syntax.
> 

One of the amazing things about posting to lists is that shortly after 
posting I usually find the answer to the question. Well, I've now 
learned something about negative look-ahead assertions that I did not 
know about. But please questions 1) and 3) above I still haven't answered.
Thanks