You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Robert Boyl <ro...@gmail.com> on 2017/09/08 16:24:37 UTC

Ends with string

Hello, everyone!

Is there a way to create a Spamassassin rule that checks for a certain URL
suffix such as .ru but makes sure it has to be at the end of the URI? Ends
with string.

Thanks!
Rob

Re: Ends with string

Posted by Benny Pedersen <me...@junc.eu>.

Robert Boyl skrev den 2017-09-08 18:24:

> Is there a way to create a Spamassassin rule that checks for a certain
> URL suffix such as .ru but makes sure it has to be at the end of the
> URI? Ends with string.

have you in mind to just match a tld ?

in that case read:

perldoc Mail::SpamAssassin::Conf (see section enlists)

http://spamassassin.apache.org/full/3.4.x/doc/Mail_SpamAssassin_Conf.html

Re: Ends with string

Posted by sh...@shanew.net.

If I recall correctly (and it's been a while), I was seeing false
positives where t.co was matching t.com (or something like that) so I
was only paying attention to the need to not allow an alpha-num.
Short-sighted, I know (and I might have forgotten that \b isn't a
character match).

The regex I use to anchor tlds these days (and please tell me if this
doesn't work the way I intend) looks like:

uri  NEWTLD_URI  /\.(accountant|beer|bid|......|win|work|xyz)\b[^\.-]/i

I have slightly different regexes to match email addresses or server
names in headers, but they all basically express the rule "I need to
see a word boundary here, but certain non-word characters don't count
because it implies the domain name may continue in the given context"

On Fri, 8 Sep 2017, RW wrote:

> On Fri, 8 Sep 2017 13:03:57 -0400
> Kevin A. McGrail wrote:
>
>> On 9/8/2017 12:24 PM, Robert Boyl wrote:
>>> Hello, everyone!
>>>
>>> Is there a way to create a Spamassassin rule that checks for a
>>> certain URL suffix such as .ru but makes sure it has to be at the
>>> end of the URI? Ends with string.
>>>
>>> Thanks!
>>> Rob
>>
>> Yes, it's called an anchor and Shane Williams a long time ago gave me
>> some advice on that I used in this rule:
>>
>> uri             __KAM_SHORT
>> /(\/|^|\b)(?:j\.mp|bit\.ly|goo\.gl|x\.co|t\.co|t\.cn|tinyurl\.com|hop\.kz|urla\.ru|fw\.to)(\/|$|\b)/i
>
> That doesn't look right, at least not in the context of the OP's
> question.
>
> In  (\/|$|\b)  the \b seems superfluous as it will match a boundary
> between a letter and a '.' so the rule will for example match
>
> goo.gl.example.com
>

-- 
Public key #7BBC68D9 at            |                 Shane Williams
http://pgp.mit.edu/                |      System Admin - UT CompSci
=----------------------------------+-------------------------------
All syllogisms contain three lines |              shanew@shanew.net
Therefore this is not a syllogism  | www.ischool.utexas.edu/~shanew

Re: Ends with string

Posted by RW <rw...@googlemail.com>.

On Fri, 8 Sep 2017 13:03:57 -0400
Kevin A. McGrail wrote:

> On 9/8/2017 12:24 PM, Robert Boyl wrote:
> > Hello, everyone!
> >
> > Is there a way to create a Spamassassin rule that checks for a
> > certain URL suffix such as .ru but makes sure it has to be at the
> > end of the URI? Ends with string.
> >
> > Thanks!
> > Rob  
> 
> Yes, it's called an anchor and Shane Williams a long time ago gave me 
> some advice on that I used in this rule:
> 
> uri             __KAM_SHORT 
> /(\/|^|\b)(?:j\.mp|bit\.ly|goo\.gl|x\.co|t\.co|t\.cn|tinyurl\.com|hop\.kz|urla\.ru|fw\.to)(\/|$|\b)/i

That doesn't look right, at least not in the context of the OP's
question. 

In  (\/|$|\b)  the \b seems superfluous as it will match a boundary
between a letter and a '.' so the rule will for example match

 goo.gl.example.com

Re: Ends with string

Posted by RW <rw...@googlemail.com>.

On Fri, 15 Sep 2017 15:46:31 -0500 (CDT)
shanew@shanew.net wrote:


> So, my rule for just matching TLDs looks like:
> 
> uri __TEST_URLS  /\.(vn|pl|my|lu|vn|ar)\b[^\.-]/i
> 
> The "\b" part excludes the letters, numbers and underscore because
> those wouldn't be a word boundary.  The "[^\.-]" part excludes the
> hyphen and literal "." from being on the right side of that word
> boundary.

note that [^\.-] has to match a character after the tld so it wouldn't
match "http://example.vn"
 

> And now that I'm looking at it, I'm wondering if it would match a
> URI like "https://legit.domain.com/great.beer/" ("beer" being one of
> the TLDs my rule contains).  

Yes it would, you can use something like ^[a-z]+\/\/:[^\/]* at the
beginning to avoid that.

An alternative is to use the URIDetail plugin and just test the domain.

https://spamassassin.apache.org/full/3.4.x/doc/Mail_SpamAssassin_Plugin_URIDetail.html
 

> Like I said, the enlist_uri method might be worth it just to avoid
> regular expressions.

In this case it is.

Re: Ends with string

Posted by sh...@shanew.net.

On Fri, 15 Sep 2017, Robert Boyl wrote:

> uri             __KAM_SHORT/(\/|^|\b)(?:j\.mp|bit\.ly|goo\.gl|x\.co|t\.co|t\.cn|tinyurl\.com|hop\.kz|u
> rla\.ru|fw\.to)(\/|$|\b)/i
> 
> Seems a bit complicated.
> 
> It would be to make this rule check that suffixes are at the end of URI.
> 
> uri __TEST_URLS /\b(\.vn|\.pl|\.my|\.lu|\.vn|\.ar)\b/i
> 
> I believe this does it, correct?
> 
> uri __TEST_URLS /\b(\.vn$|\.pl$|\.my$|\.lu$|\.vn$|\.ar$)\b/i

As Paul said, if you're just looking at uris, the enlist_uri might be
the better way to go.  And it has the advantage that you don't have to
use (some might say abuse) regular expressions.

I believe URIs as collected for the uri tests consist of more than
just the server part of the URI, but maybe I'm wrong (or maybe the
list includes the server part only as well as the full URI).  If I'm
correct, then using the "$" will not work where URIs have a local part
and might not work where there's only a trailing "/".

In the case where you're only looking at the TLD, you don't have to
worry about the front word boundary because you're explicitly
anchoring the front of the match with the "\." part.  At the end, you
need to make sure that you're not allowing characters that would
indicate the server part of the URI continues past your intended match
(to avoid things like matching "blah.com" when you're really trying to
match ".co").  In my estimation, the characters that might indicate
continuation of the URI are letters, numbers, underscores, hyphens,
and the literal ".".

So, my rule for just matching TLDs looks like:

uri __TEST_URLS  /\.(vn|pl|my|lu|vn|ar)\b[^\.-]/i

The "\b" part excludes the letters, numbers and underscore because
those wouldn't be a word boundary.  The "[^\.-]" part excludes the
hyphen and literal "." from being on the right side of that word
boundary.

And now that I'm looking at it, I'm wondering if it would match a
URI like "https://legit.domain.com/great.beer/" ("beer" being one of
the TLDs my rule contains).  Like I said, the enlist_uri method might
be worth it just to avoid regular expressions.

-- 
Public key #7BBC68D9 at            |                 Shane Williams
http://pgp.mit.edu/                |      System Admin - UT CompSci
=----------------------------------+-------------------------------
All syllogisms contain three lines |              shanew@shanew.net
Therefore this is not a syllogism  | www.ischool.utexas.edu/~shanew

Re: Ends with string

Posted by Paul Stead <pa...@zeninternet.co.uk>.


On 15/09/2017, 20:59, "Paul Stead" <pa...@zeninternet.co.uk> wrote:



    On 15/09/2017, 20:57, "shanew@shanew.net" <sh...@shanew.net> wrote:

        If you're only looking at uris, it probably is (though I wonder a
        little about processing time between a long list of such entries and a
        single (if also long) regular expression).  I have rules for "bad"
        tlds that look in headers as well (Received, From, Env_From being the
        main ones), so these wouldn't help with that.  If there's something
        similar for those cases, I'd love to know about it.


    The following patch works for me:

    https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7354

… though not with Received


--
Paul Stead
Systems Engineer
Zen Internet

Re: Ends with string

Posted by Paul Stead <pa...@zeninternet.co.uk>.


On 15/09/2017, 20:57, "shanew@shanew.net" <sh...@shanew.net> wrote:

    If you're only looking at uris, it probably is (though I wonder a
    little about processing time between a long list of such entries and a
    single (if also long) regular expression).  I have rules for "bad"
    tlds that look in headers as well (Received, From, Env_From being the
    main ones), so these wouldn't help with that.  If there's something
    similar for those cases, I'd love to know about it.


The following patch works for me:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7354


--
Paul Stead
Systems Engineer
Zen Internet

Re: Ends with string

Posted by sh...@shanew.net.

On Fri, 15 Sep 2017, Paul Stead wrote:

> Something along the following still seems the easiest to read approach to me
> 
> enlist_uri_host (BADTLDS) vn
> 
> enlist_uri_host (BADTLDS) pl
> 
> enlist_uri_host (BADTLDS) my
> 
> enlist_uri_host (BADTLDS) lu
> 
> enlist_uri_host (BADTLDS) ar
> 
> header __TEST_URLS eval:check_uri_host_listed('BADTLDS')

If you're only looking at uris, it probably is (though I wonder a
little about processing time between a long list of such entries and a
single (if also long) regular expression).  I have rules for "bad"
tlds that look in headers as well (Received, From, Env_From being the
main ones), so these wouldn't help with that.  If there's something
similar for those cases, I'd love to know about it.

-- 
Public key #7BBC68D9 at            |                 Shane Williams
http://pgp.mit.edu/                |      System Admin - UT CompSci
=----------------------------------+-------------------------------
All syllogisms contain three lines |              shanew@shanew.net
Therefore this is not a syllogism  | www.ischool.utexas.edu/~shanew

Re: Ends with string

Posted by Paul Stead <pa...@zeninternet.co.uk>.

Something along the following still seems the easiest to read approach to me

enlist_uri_host (BADTLDS) vn
enlist_uri_host (BADTLDS) pl
enlist_uri_host (BADTLDS) my
enlist_uri_host (BADTLDS) lu
enlist_uri_host (BADTLDS) ar

header __TEST_URLS eval:check_uri_host_listed('BADTLDS')

Paul

From: Robert Boyl <ro...@gmail.com>
Date: Friday, 15 September 2017 at 17:48
To: "users@spamassassin.apache.org" <us...@spamassassin.apache.org>
Subject: Re: Ends with string

Hi!

Thanks! I didnt find this info in Writing rules tutorial.

I see

uri             __KAM_SHORT /(\/|^|\b)(?:j\.mp|bit\.ly|goo\.gl|x\.co|t\.co|t\.cn|tinyurl\.com|hop\.kz|urla\.ru|fw\.to)(\/|$|\b)/i

Seems a bit complicated.

It would be to make this rule check that suffixes are at the end of URI.

uri __TEST_URLS /\b(\.vn|\.pl|\.my|\.lu|\.vn|\.ar)\b/i

I believe this does it, correct?

uri __TEST_URLS /\b(\.vn$|\.pl$|\.my$|\.lu$|\.vn$|\.ar$)\b/i

Thanks.
Rob

--
Paul Stead
Systems Engineer
Zen Internet

Re: Ends with string

Posted by "Kevin A. McGrail" <ke...@mcgrail.com>.

On 9/15/2017 12:48 PM, Robert Boyl wrote:
> Thanks! I didnt find this info in Writing rules tutorial.
Yeah, I rewrote the rule a bit already.  Thanks!

It's in the latest KAM.cf.

Re: Ends with string

Posted by Robert Boyl <ro...@gmail.com>.

Hi!

Thanks! I didnt find this info in Writing rules tutorial.

I see

uri             __KAM_SHORT
/(\/|^|\b)(?:j\.mp|bit\.ly|goo\.gl|x\.co|t\.co|t\.cn|tinyurl\.com|hop\.kz|urla\.ru|fw\.to)(\/|$|\b)/i

Seems a bit complicated.

It would be to make this rule check that suffixes are at the end of URI.

uri __TEST_URLS /\b(\.vn|\.pl|\.my|\.lu|\.vn|\.ar)\b/i

I believe this does it, correct?

uri __TEST_URLS /\b(\.vn$|\.pl$|\.my$|\.lu$|\.vn$|\.ar$)\b/i

Thanks.
Rob

2017-09-08 14:03 GMT-03:00 Kevin A. McGrail <ke...@mcgrail.com>:

> On 9/8/2017 12:24 PM, Robert Boyl wrote:
>
>> Hello, everyone!
>>
>> Is there a way to create a Spamassassin rule that checks for a certain
>> URL suffix such as .ru but makes sure it has to be at the end of the URI?
>> Ends with string.
>>
>> Thanks!
>> Rob
>>
>
> Yes, it's called an anchor and Shane Williams a long time ago gave me some
> advice on that I used in this rule:
>
> uri             __KAM_SHORT /(\/|^|\b)(?:j\.mp|bit\.ly|goo
> \.gl|x\.co|t\.co|t\.cn|tinyurl\.com|hop\.kz|urla\.ru|fw\.to)(\/|$|\b)/i
>
> Regards,
> KAM
>
>

Re: Ends with string

Posted by Benny Pedersen <me...@junc.eu>.

Kevin A. McGrail skrev den 2017-09-08 19:03:

> Yes, it's called an anchor and Shane Williams a long time ago gave me
> some advice on that I used in this rule:
> 
> uri             __KAM_SHORT
> /(\/|^|\b)(?:j\.mp|bit\.ly|goo\.gl|x\.co|t\.co|t\.cn|tinyurl\.com|hop\.kz|urla\.ru|fw\.to)(\/|$|\b)/i

why make it complicated ?

enlist_url_host (MYTLD) ru
enlist_url_host (MYTLD) dk

and i have forgot my own rules to this list :=)

googled:

https://lists.gt.net/spamassassin/devel/154398

Example 1:

enlist_uri_host (LOW) geocities.com
enlist_uri_host (MED) geocities.yahoo.com.br
enlist_uri_host (LOW) AutoFinanceUK.co.uk
enlist_uri_host (HIGH) blasdutro buckrea.com
enlist_uri_host (MED) True.com
enlist_uri_host (LOW) imageshack.us

and the corresponding rules:

header URI_HOST_LOW eval:check_uri_host_listed('LOW')
describe URI_HOST_LOW Host or domain found in URI is listed in the LOW 
list
tflags URI_HOST_LOW userconf noautolearn
score URI_HOST_LOW 1.5

header URI_HOST_MED eval:check_uri_host_listed('MED')
describe URI_HOST_MED Host or domain found in URI is listed in the MED 
list
tflags URI_HOST_MED userconf noautolearn
score URI_HOST_MED 4

header URI_HOST_HIGH eval:check_uri_host_listed('HIGH')
describe URI_HOST_HIGH Host or domain found in URI is listed in the HIGH 
list
tflags URI_HOST_HIGH userconf noautolearn
score URI_HOST_HIGH 12


Example 2:

blacklist_uri_host www.need-lust.com www.crave-lust
blacklist_uri_host sommerphantasie.com klick2go.com lucymeier.com
blacklist_uri_host www.replaceftpsmtp.com www.aectransfer.org
blacklist_uri_host epsore.com www.alveal.com
blacklist_uri_host reppsetinte.com preprotissit.com
blacklist_uri_host www.weinportale.de www.fasctvideos.cn
blacklist_uri_host www.dilcasino.com www.hotgoldgambling.net
blacklist_uri_host www.antos.si www.omegaic.net www.clickonevent.com
blacklist_uri_host www.exorcism.org www.eturning.com 
www.piramidasunca.ba
blacklist_uri_host 64.15.147.100
blacklist_uri_host bot.tormaxusa.net www.qtechna.si www.clecle.si
blacklist_uri_host www.ninadesign.co.nr constructionfiles.net 
aecfiles02.com
blacklist_uri_host filetransfer00.com filetransfer01.com 
filetransfer02.com
blacklist_uri_host filetransfer03.com filetransfer04.com 
filetransfer05.com
blacklist_uri_host filetransfer06.com filetransfer07.com 
filetransfer08.com
blacklist_uri_host filetransfer09.com

header URI_HOST_IN_BLACKLIST eval:check_uri_host_listed('BLACK')
describe URI_HOST_IN_BLACKLIST Host or domain found in URI is 
blacklisted
tflags URI_HOST_IN_BLACKLIST userconf noautolearn
score URI_HOST_IN_BLACKLIST 8

header URI_HOST_IN_WHITELIST eval:check_uri_host_listed('WHITE')
describe URI_HOST_IN_WHITELIST Host or domain found in URI is 
blacklisted
tflags URI_HOST_IN_WHITELIST userconf nice noautolearn
score URI_HOST_IN_WHITELIST -10


Example 3:

enlist_uri_host (RCKT) ru !aaa.example.kr cn kr tr
header URI_HOST_RCKT eval:check_uri_host_listed('RCKT')
score URI_HOST_RCKT 0.1

enlist_uri_host (RU) ru
header URI_HOST_RU eval:check_uri_host_listed('RU')
score URI_HOST_RU 1.8

enlist_uri_host (CN) cn
header URI_HOST_CN eval:check_uri_host_listed('CN')
score URI_HOST_CN 1.2

enlist_uri_host (KR) kr
header URI_HOST_KR eval:check_uri_host_listed('KR')
score URI_HOST_KR 1.5

enlist_uri_host (TR) tr
header URI_HOST_TR eval:check_uri_host_listed('TR')
score URI_HOST_TR 1.5


sorry for spamming with more examples, it was intended to make more good 
rules

Re: Ends with string

Posted by "Kevin A. McGrail" <ke...@mcgrail.com>.

On 9/8/2017 12:24 PM, Robert Boyl wrote:
> Hello, everyone!
>
> Is there a way to create a Spamassassin rule that checks for a certain 
> URL suffix such as .ru but makes sure it has to be at the end of the 
> URI? Ends with string.
>
> Thanks!
> Rob

Yes, it's called an anchor and Shane Williams a long time ago gave me 
some advice on that I used in this rule:

uri             __KAM_SHORT 
/(\/|^|\b)(?:j\.mp|bit\.ly|goo\.gl|x\.co|t\.co|t\.cn|tinyurl\.com|hop\.kz|urla\.ru|fw\.to)(\/|$|\b)/i

Regards,
KAM

Re: Ends with string

Posted by Ralph Seichter <m1...@monksofcool.net>.

On 08.09.2017 18:24, Robert Boyl wrote:

> Is there a way to create a Spamassassin rule that checks for a certain
> URL suffix such as .ru but makes sure it has to be at the end of the
> URI? Ends with string.

There is (foo$). SpamAssassin uses Perl regular expressions, and you can
find many related examples and tutorials. See also "WritingRules" on the
SpamAssassin Wiki.

-Ralph