You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Terry Carmen <te...@cnysupport.com> on 2011/03/21 18:07:19 UTC

Regex Rule Help?

I'm trying to match any URL that points to a URL shortener.

They typically consist of http(s) followed by a domain name, a slash  
and a small series of alphanumeric characters, *without a trailing "/"  
or file extension*.

I seem to be having pretty good luck matching the URL, however I can't  
figure out how to make the regex explicity *not* match anything that  
ends in a slash or contains an extension.

For example, I want to match "http://asdf.ghi/j2kj4l23", but not  
"http://asdf.ghi/j2kj4l23/abc.html" or "http://asdf.ghi/j2kj4l23/"

I tried using the perl negative look-ahead as both : (?!/) and (?!\/)  
without success.

Can anybody toss me a clue?

Thanks!

Terry


Re: Regex Rule Help?

Posted by Adam Katz <an...@khopis.com>.
On 03/21/2011 10:07 AM, Terry Carmen wrote:
> I'm trying to match any URL that points to a URL shortener.
> 
> They typically consist of http(s) followed by a domain name,
> a slash and a small series of alphanumeric characters,
> *without a trailing "/" or file extension*.
> 
> I seem to be having pretty good luck matching the URL, however I
> can't figure out how to make the regex explicity *not* match
> anything that ends in a slash or contains an extension.
> 
> For example, I want to match "http://asdf.ghi/j2kj4l23", but not 
> "http://asdf.ghi/j2kj4l23/abc.html" or "http://asdf.ghi/j2kj4l23/"

In this specific case, I think you want a simple end-of-line indicator,

uri  ASDF_GHI_SHORT  m'^http://asdf\.ghi/[\w-]{1,12}$'i

In order to match  http://asdf.ghi/j2kj4l23#mno  you might want:

uri  ASDF_GHI_SHORT  m'^http://asdf\.ghi/[\w-]{1,12}(?:[^/.\w-]|$)'i

( I used m'' instead of // so I didn't have to escape the slashes.  Any
punctuation can be used in that manner, though the leading "m" is only
optional in m// ).

> I tried using the perl negative look-ahead as both : (?!/) and
> (?!\/) without success.

As to using a negative look-ahead operator:  Though I'm not exactly sure
about when it's needed, you sometimes have to put something after it,
like  /foo(?!bar)(?:.|$)/  ... this is not mentioned in the spec.


Re: Regex Rule Help?

Posted by Bowie Bailey <Bo...@BUC.com>.
On 3/21/2011 1:07 PM, Terry Carmen wrote:
> I'm trying to match any URL that points to a URL shortener.
>
> They typically consist of http(s) followed by a domain name, a slash
> and a small series of alphanumeric characters, *without a trailing "/"
> or file extension*.
>
> I seem to be having pretty good luck matching the URL, however I can't
> figure out how to make the regex explicity *not* match anything that
> ends in a slash or contains an extension.
>
> For example, I want to match "http://asdf.ghi/j2kj4l23", but not
> "http://asdf.ghi/j2kj4l23/abc.html" or "http://asdf.ghi/j2kj4l23/"
>
> I tried using the perl negative look-ahead as both : (?!/) and (?!\/)
> without success.
>
> Can anybody toss me a clue?

Show us your current rule and we can tell you what you are doing wrong.

-- 
Bowie

Re: Regex Rule Help?

Posted by Martin Gregorie <ma...@gregorie.org>.
On Mon, 2011-03-21 at 13:07 -0400, Terry Carmen wrote:
> I'm trying to match any URL that points to a URL shortener.
> 
> They typically consist of http(s) followed by a domain name, a slash  
> and a small series of alphanumeric characters, *without a trailing "/"  
> or file extension*.
> 
> I seem to be having pretty good luck matching the URL, however I can't  
> figure out how to make the regex explicity *not* match anything that  
> ends in a slash or contains an extension.
> 
> For example, I want to match "http://asdf.ghi/j2kj4l23", but not  
> "http://asdf.ghi/j2kj4l23/abc.html" or "http://asdf.ghi/j2kj4l23/"
> 
> I tried using the perl negative look-ahead as both : (?!/) and (?!\/)  
> without success.
> 
> Can anybody toss me a clue?
> 
Have you looked at the DecodeShortURLs plugin? That would seem to do
what you need *and* check whether the shortened URL points to anything
harmful.


Martin