You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Dan Patnode <da...@patnode.net> on 2006/04/24 06:18:44 UTC

URI Basics

Another Newbie question here,

So IRIs find links in the body.  I'm trying to get a handle on URI  
syntax and have found several disparate examples:


1) uri HTTP_CTRL_CHARS_HOST		/^https?\:\/\/[^\/\s]*[\x00-\x08\x0b\x0c 
\x0e-\x1f]/

2) uri NORMAL_HTTP_TO_IP		m{^https?://\d+\.\d+\.\d+\.\d+}i

3) uri URI_4YOU			m@^(?:https?://|mailto:)[^\/]*4you@i

4) uri HTTP_77			/http:\/\/.{0,2}\%77/

5) uri BARGAIN_URL		/bargain([sz]|-\S+)?\.(?:com|biz)/

6) uri URI_OFFERS			m/offer([sz]|-\S+)?\.(?:com|bi?z)/i

7) uri URI_AFFILIATE		/aff\w+id=/i


I have a few questions and welcome other tips.  What do m{, m/, and  
m@ mean?  Are m||, m(), and m{} interchangeable or does each mean  
something different?  Does it matter if the ^ is on the outside (3)  
or the inside (1&2) of the beginning?  I see the value of URIs with  
5-7 so an anchor is not needed, is there an improvement over rawbody  
when http is used as in 1-4?

Thanks,
Dan

Re: URI Basics

Posted by Ramprasad <ra...@netcore.co.in>.
> There is definitely a VERY significant performance penalty to using
> rawbody over URI, for any rule.
> 
> Consider the size of input. A rawbody regex must be run against the
> entire text of the body after QP decoding. A uri regex must be run
> against all the text of the URIs that SA found. There is likely to be at
> least a 100:1 difference in size of input. There's no "penalty" for
> using a uri rule, as SA will always extract all the URIs and build the
> input text, even if you aren't using it.
> 
> However, there are some cases where rawbody is useful, particularly when
> you want to examine the formatting of newlines inserted into a HTML tag.
> 
> rawbody is also useful when you're looking for a "new trick" the
> obfuscates URIs in such a way that SA can't parse them, but outlook can
> still open them. This used to be common enough that most folks used
> rawbody for all their URI type rules. However, nowadays most of them are
> caught.

Will URI catch a plain text message with url , not hyperlinked. 
By sending a text url the spammer purpose is solved because I think mail
clients hyper link plain text url's themselves.

Thanks
Ram


Re: URI Basics

Posted by Theo Van Dinter <fe...@apache.org>.
On Mon, Apr 24, 2006 at 05:18:23PM -0700, Dan wrote:
> Are you saying that in URIs, any character (@ in this case) can serve  
> as the delimiter, so long as it displays after the m and again at the  
> end of the entry?

Yes.  Take a look at the perlre and perlop (specifically the m// operator)
documentation.  "Mastering Regular Expressions" from O'Reilly may be a
good book to read as well. :)

-- 
Randomly Generated Tagline:
I'll just change into someone more comfortable.

Re: URI Basics

Posted by Dan <a...@patnode.net>.
Gentlemen,

Thank you for the all the great input.


> Specifically, you're learning perl regular expressions, and perl is  
> a language that gives you a million different ways to skin a cat,  
> so to speak.  As the quote goes "all things are permissible, but  
> not all things are beneficial".
>
> It's also a programming language that many people tend to describe  
> as looking like line noise (if you ever used an old dial up line,  
> in terminal mode instead of as a SLIP/PPP link, you may actually  
> get the joke ... especially if you had call waiting turned on).

I'm new to regex and SA (and open source for that matter) but I'm  
actually old school tech.  I remember well, the thrill of upgrading  
from 2400 to 14.4k bps.  One my fondest tech memories is bringing  
online my own ISDN based 56k RAS in the 90's.  And that thing has a  
CLI that would make SA blush.


> Between the two, yes, it feels very unstructured.
>
> In addition to the other book that was recommended, it might be a  
> good idea to pick up Learning Perl.  It's easier to understand a  
> thing when you know how it thinks.

I know what you mean.  Being new to both, its been tough not knowing  
when regex ends and SA begins.  I'm used to being able to make  
systems sing so coming in cold to a system this big and well  
established (even while understanding the principals being used) is  
intimidating.  With your help, I'll have SA breaking a sweat in no time.


Dan

Re: URI Basics

Posted by Theo Van Dinter <fe...@apache.org>.
On Mon, Apr 24, 2006 at 09:27:47PM -0400, Matt Kettler wrote:
> > Is URI the way to go when tracking obsfucation, as in:
> > uri __LINKAGE_A284 m{@%77%77w}i

Yes.  The uri rules run over both the raw version and the decoded versions.

> Neither of the above will work.. Both uri and rawbody rules are run
> after QP (and base 64) decoding is done.

FWIW, the character encoding (w = %77) isn't QP or base64, it's just encoding.

> There's some proposals to have a more configurable set of choices but
> right now "raw" is really "half cooked", and uri is "fully cooked" just
> like body.

uri is a large array of all the uris found in the mail.  for each raw
one found in the mail, SA goes through and "canonicalizes" them (remove
obfuscation, find redirector patterns, etc,) and then all of those
(raw and canonical) are run through by the uri rules.

-- 
Randomly Generated Tagline:
"Well, last time I checked, I wasn't a trout ..." - rei.com radio ad

Re: URI Basics

Posted by Matt Kettler <mk...@comcast.net>.
Dan wrote:
> Follow up question:
>
> Is URI the way to go when tracking obsfucation, as in:
> uri __LINKAGE_A284 m{@%77%77w}i
>
> ...or will URI's translation get in the way, requiring something more
> like?:
> rawbody __LINKAGE_A284 m{@%77%77w}i
>
Neither of the above will work.. Both uri and rawbody rules are run
after QP (and base 64) decoding is done.

There's some proposals to have a more configurable set of choices but
right now "raw" is really "half cooked", and uri is "fully cooked" just
like body.



Re: URI Basics

Posted by Dan <a...@patnode.net>.
Follow up question:

Is URI the way to go when tracking obsfucation, as in:
uri __LINKAGE_A284 m{@%77%77w}i

...or will URI's translation get in the way, requiring something more  
like?:
rawbody __LINKAGE_A284 m{@%77%77w}i

Thanks,
Dan


Re: URI Basics

Posted by Matt Kettler <mk...@comcast.net>.
Dan wrote:
>> In 3 ^ is the first character of the regex, just as it is in 1 and 2. It
>> is also inside the delimiters, just like 1 and 2. In example 3 @ is
>> being used as a delimiter,  and ^ is the first character after it.
>
> Are you saying that in URIs, any character (@ in this case) can serve
> as the delimiter, so long as it displays after the m and again at the
> end of the entry?
Well, any non-alphanumeric non-whitespace can be used. i.e. any punctuation.

Actually This actually is true of ANY SA rule, not just URIs. The use of
m to set up a regex delimiter is just part of the perl regex syntax,
which SA supports all of. It's called the "match operator".

So
 /foo/
m/foo/
m!foo!

Just be warry of what you use as a delimiter. Choosing something other
than / should only done to make things easier to read. It also
over-rides that character's normal uses until the end of the regex.

You can find a lot of detail about using the match operator (m) for this
purpose in section 7.4.3 of:

http://www.unix.org.ua/orelly/perl/learn/ch07_04.htm

(note: that page is general perl programing oriented, so a lot of things
in there are not so relevant.


>
> I'm beginning to realize how many of my learning curve issues are
> attempts to understand the very structure of a system created with a
> bare minimum of structure.
Heh, it's not that bad.. but there are a lot of advanced quirks you'll
see people using from their knowledge of heavy perl wizzardry.
>
>
>> There is definitely a VERY significant performance penalty to using
>> rawbody over URI, for any rule.
>>
>> Consider the size of input. A rawbody regex must be run against the
>> entire text of the body after QP decoding. A uri regex must be run
>> against all the text of the URIs that SA found. There is likely to be at
>> least a 100:1 difference in size of input. There's no "penalty" for
>> using a uri rule, as SA will always extract all the URIs and build the
>> input text, even if you aren't using it.
>
> Great information Matt, thanks. 
No problem.

Re: URI Basics

Posted by John Rudd <jr...@ucsc.edu>.
On Apr 24, 2006, at 5:18 PM, Dan wrote:
> I'm beginning to realize how many of my learning curve issues are 
> attempts to understand the very structure of a system created with a 
> bare minimum of structure.

Specifically, you're learning perl regular expressions, and perl is a 
language that gives you a million different ways to skin a cat, so to 
speak.  As the quote goes "all things are permissible, but not all 
things are beneficial".

It's also a programming language that many people tend to describe as 
looking like line noise (if you ever used an old dial up line, in 
terminal mode instead of as a SLIP/PPP link, you may actually get the 
joke ... especially if you had call waiting turned on).

Between the two, yes, it feels very unstructured.

In addition to the other book that was recommended, it might be a good 
idea to pick up Learning Perl.  It's easier to understand a thing when 
you know how it thinks.


Re: URI Basics

Posted by Matt Kettler <mk...@comcast.net>.
Dan Patnode wrote:
> Another Newbie question here,
>
> So IRIs find links in the body.  I'm trying to get a handle on URI
> syntax and have found several disparate examples:
>
>
> 1) uri HTTP_CTRL_CHARS_HOST       
> /^https?\:\/\/[^\/\s]*[\x00-\x08\x0b\x0c\x0e-\x1f]/
>
> 2) uri NORMAL_HTTP_TO_IP        m{^https?://\d+\.\d+\.\d+\.\d+}i
>
> 3) uri URI_4YOU            m@^(?:https?://|mailto:)[^\/]*4you@i
>
> 4) uri HTTP_77            /http:\/\/.{0,2}\%77/
>
> 5) uri BARGAIN_URL        /bargain([sz]|-\S+)?\.(?:com|biz)/
>
> 6) uri URI_OFFERS            m/offer([sz]|-\S+)?\.(?:com|bi?z)/i
>
> 7) uri URI_AFFILIATE        /aff\w+id=/i
>
>
> I have a few questions and welcome other tips.  What do m{, m/, and m@
> mean?  
Those are the "match" operator.. It's basically used so you can use
something other than / to delimit the start and end of your regex. It is
very common to do this for URIs so you can do http:// instead of having
to escape it into http:\/\/, as in example 4.

Why example 6 uses m/ is beyond me, as / is the default.

> Are m||, m(), and m{} interchangeable or does each mean something
> different?  
Interchangeable
> Does it matter if the ^ is on the outside (3) or the inside (1&2) of
> the beginning?
In 3 ^ is the first character of the regex, just as it is in 1 and 2. It
is also inside the delimiters, just like 1 and 2. In example 3 @ is
being used as a delimiter,  and ^ is the first character after it. You
can't put a ^ outside your delimiter and have it act as an anchor.
> I see the value of URIs with 5-7 so an anchor is not needed,
I don't believe the use of anchors is a significant performance penalty.
In general, they may actually cause a rule to run faster than one
without. That said, make your choice about anchors based on accuracy
needs, not performance.
> is there an improvement over rawbody when http is used as in 1-4? 

There is definitely a VERY significant performance penalty to using
rawbody over URI, for any rule.

Consider the size of input. A rawbody regex must be run against the
entire text of the body after QP decoding. A uri regex must be run
against all the text of the URIs that SA found. There is likely to be at
least a 100:1 difference in size of input. There's no "penalty" for
using a uri rule, as SA will always extract all the URIs and build the
input text, even if you aren't using it.

However, there are some cases where rawbody is useful, particularly when
you want to examine the formatting of newlines inserted into a HTML tag.

rawbody is also useful when you're looking for a "new trick" the
obfuscates URIs in such a way that SA can't parse them, but outlook can
still open them. This used to be common enough that most folks used
rawbody for all their URI type rules. However, nowadays most of them are
caught.

>
> Thanks,
> Dan
>