You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Craig Jackson <cj...@localsurface.com> on 2005/05/31 03:28:44 UTC
Top level domain test -- somewhat OT
Hi,
Our small business never receives mail from top level domains other than
com,net,org,mil,edu,gov,and us -- except spam. Additionally, we never
receive email with links containing other level domains -- except spam.
The logic is that we are small and do no business outside our geographic
area. So I wrote a body test for checking links that don't have these
top level domains:
m{https?://[^/\s]+?(?<!\.com)(?<!\.net)(?<!\.org)(?<!\.gov)(?<!\.us)(?<!\.edu)(?<!\.mil)(\/\[^\s])?}
This I copied from the Spamassassin test for odd ports. The logic is
similar. However I have never seen some of this notation. And of course
the test doesn't work -- too many false positives.
1) What do the enclosing {} mean?
2) What is the ?<! supposed to do?
3) Does this work with line wrapped links?
4) Shouldn't the domains be separated by | instead of all enclosed in ()?
If you would point to a tutorial that covers this I would be grateful. I
have checked a few beginner regex sites and even read most of the regex
book, but don't remember this particular syntax.
Thanks,
Craig Jackson
Re: Top level domain test -- somewhat OT
Posted by Matt Kettler <mk...@comcast.net>.
At 09:38 PM 5/30/2005, Craig Jackson wrote:
>Craig Jackson wrote:
>>
>>m{https?://[^/\s]+?(?<!\.com)(?<!\.net)(?<!\.org)(?<!\.gov)(?<!\.us)(?<!\.edu)(?<!\.mil)(\/\[^\s])?}
>>
>>This I copied from the Spamassassin test for odd ports. The logic is
>>similar. However I have never seen some of this notation. And of course
>>the test doesn't work -- too many false positives.
>>1) What do the enclosing {} mean?
They are the delimiters. Instead of using a pair of / to delimit the regex
they used curly braces. It's somewhat rare to see this done, but it's
sometimes convenient.
When you prefix with the match operator (that m at the beginning) you can
use almost any character you want as a delimiter instead of forward slash.
This way you can do http:// without having to escape it as http:\/\/ like
you would in most normal / delimited rule.
>>3) Does this work with line wrapped links?
If you do it as a uri rule, I think so. As a rawbody rule, no.
But please questions 1) and 3) above I still haven't answered.
>Thanks
Re: Top level domain test -- somewhat OT
Posted by Loren Wilton <lw...@earthlink.net>.
> >
m{https?://[^/\s]+?(?<!\.com)(?<!\.net)(?<!\.org)(?<!\.gov)(?<!\.us)(?<!\.ed
u)(?<!\.mil)(\/\[^\s])?}
> >
> >
> One of the amazing things about posting to lists is that shortly after
> posting I usually find the answer to the question. Well, I've now
> learned something about negative look-ahead assertions that I did not
Actually that is a negative lookBEHIND assertion that they are using.
Negative lookAHEAD is (?!, not (?<!.
What this test is saying, in more or less english is: Match 'http', possibly
followed by 's', and then followed by '//:'. Then match everything up to a
/ or space, but don't be greedy about it. (which means, stop on the FIRST /
or space you find, not the last one.) Now that you are pointing at a / or
space, are the preceeding 4 characters not .com, and not .net, etc.
Then we get to the last part, which I suspect you added, since the coding
style is different, and it does some things it odd things. In fact, I'm
not at all sure exactly what the intent was here. I think perhaps it was
trying to look for a / optionally followed by a space after the url. But we
already know that there is a / or space here from the original non-greedy
match.
In any case, if that was the intent, it should have been coded as
"(?:/[^\s])?". The ?: after the ( says that you are only using the parends
as grouping, and not as a capturing group. This is MUCH faster, according
to the Perl pundits. You don't need a backslash in front of the slash in
this case, because the overall delimiter characters are {} instead of the
more common //. And you certainly don't want a backslash in front of the
[ character that is part of the character grouping, unless you wanted to
compare a literal [ character. In that case you would also need a backslash
in front of the ] character.
(I suspect that the appropriate match here would be simply "[/\s]" to match
the slash or space we know is here. Or more simply, just a dot. We don't
care what it matches, and we already have a pretty good idea of what it will
match.)
Loren
Re: Top level domain test -- somewhat OT
Posted by Craig Jackson <cj...@localsurface.com>.
Craig Jackson wrote:
> Hi,
> Our small business never receives mail from top level domains other than
> com,net,org,mil,edu,gov,and us -- except spam. Additionally, we never
> receive email with links containing other level domains -- except spam.
> The logic is that we are small and do no business outside our geographic
> area. So I wrote a body test for checking links that don't have these
> top level domains:
>
>
> m{https?://[^/\s]+?(?<!\.com)(?<!\.net)(?<!\.org)(?<!\.gov)(?<!\.us)(?<!\.edu)(?<!\.mil)(\/\[^\s])?}
>
>
> This I copied from the Spamassassin test for odd ports. The logic is
> similar. However I have never seen some of this notation. And of course
> the test doesn't work -- too many false positives.
>
> 1) What do the enclosing {} mean?
> 2) What is the ?<! supposed to do?
> 3) Does this work with line wrapped links?
> 4) Shouldn't the domains be separated by | instead of all enclosed in ()?
>
> If you would point to a tutorial that covers this I would be grateful. I
> have checked a few beginner regex sites and even read most of the regex
> book, but don't remember this particular syntax.
>
One of the amazing things about posting to lists is that shortly after
posting I usually find the answer to the question. Well, I've now
learned something about negative look-ahead assertions that I did not
know about. But please questions 1) and 3) above I still haven't answered.
Thanks