You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Charles Gregory <cg...@hwcn.org> on 2009/06/16 19:52:55 UTC

Re: [sa] Re: Suggested Change For FS_TEEN_BAD

On Tue, 16 Jun 2009, RW wrote:
> On Tue, 16 Jun 2009 12:03:43 -0500
> Andy Dorman <ad...@ironicdesign.com> wrote:
>> ##{ FS_TEEN_BAD
>> header   FS_TEEN_BAD    Subject =~
>> /\b(?:teens?|girls?|boys?).{1,15}\b(?:pussy|sex(?:xy|ual)?|slut(?:s|ty)?|ass(?:es|fuck(?:ing|ed)?|whip(?:ing|ped)?|spank(?:ing|ed)?)?|fuck(?:ing|ed)?|rap(?:e|ed|ing)+)\b/i
>> describe FS_TEEN_BAD    Subject says something bad about teens,
>> girls, boys ##} FS_TEEN_BAD
> You aren't checking the boundary after the first word.  Since it's a
> subject test I think the .{1,15} could probably be a .+
> You might also throw in jailbait and lolita. Also it's very common for
> porn spam to use z in plurals e.g. girlz, boyz.

Two 'p's in 'whipping'. One 'x' in 'sexy'.... :)

- Charles

Re: Suggested Change For FS_TEEN_BAD

Posted by Charles Gregory <cg...@hwcn.org>.
On Tue, 16 Jun 2009, McDonald, Dan wrote:
>> Two 'p's in 'whipping'. One 'x' in 'sexy'.... :)
> I've seen sexxxy as well....

(BIG LOUD LAUGH)

(clutches head in pain) No! Not obfuscation checking code! No! Please make 
it stop! Make it stop! The pain! I can't take it!

You are, of course, correct. :)

- Charles


Re: Suggested Change For FS_TEEN_BAD

Posted by John Hardin <jh...@impsec.org>.
On Tue, 16 Jun 2009, McDonald, Dan wrote:

>>>> /\b(?:teens?|girls?|boys?...
>
> doesn't the first ?: negate that whole part of the test?

No, that means "don't capture the match", not "this is optional".

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   ...people who insist that religion is required for morality remind
   me of hoplophobes who insist that I be disarmed because _they're_
   unsafe with a gun.                  -- MarkHB at munchkinwrangler's
-----------------------------------------------------------------------
  2 days until SWMBO's Birthday

Re: [sa] Re: Suggested Change For FS_TEEN_BAD

Posted by Charles Gregory <cg...@hwcn.org>.
On Tue, 16 Jun 2009, Andy Dorman wrote:
> ##{ FS_TEEN_BAD
>
> header   FS_TEEN_BAD Subject =~
>   /\b(?:teen(?:s|z)?|girl(?:s|z)?|boy(?:s|z)?|jailbait|lolita(?:s|z)?)
>   .*\b(?:pussy|sex(?:x{0,3}y|ual)?|slut(?:s|ty)?|
>   ass(?:es|fuck(?:ing|ed)?|whip(?:ping|ped)?|
>   spank(?:ing|ed)?)?|fuck(?:ing|ed)?|rap(?:e|ed|ing)+)\b/i
> describe FS_TEEN_BAD Subject says something bad about girls or boys
>
> ##} FS_TEEN_BAD

It's a good looking rule, but I would suggest splitting it into TWO.
One low-score version for combinations that could reasonably FP on 
expressions like 'teenage sexual health', and a high-score version for 
more distinctly spammy phrases like "teenz getting whipped"....

- Charles

Re: Suggested Change For FS_TEEN_BAD

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, Jun 18, 2009 at 7:26 AM, Michael
Monnerie<mi...@is.it-management.at> wrote:
> On Mittwoch 17 Juni 2009 Theo Van Dinter wrote:
>> Yes, it matters (one path is tried then the other has to be tried, as
>> opposed to having a single path)
>
> So which is better performance wise? I guess [sz]? but I'm not sure now.

[sz] is better than (s|z), I want to say always (true from the
theoretical POV), but it depends on the RE compiler which can optimize
(convert) one to the other (the reality POV).  IMO, it's good habit to
just do the right thing yourself, since different RE compilers are
well, different.

In short:
if you want to match one of several specific single characters, use a
character class "[]".  only use "(...|...)" if you need to catch more
complicated/non-single character things.


If you want to know more gory details, search around for "finite automata". :)

Re: Suggested Change For FS_TEEN_BAD

Posted by Henrik K <he...@hege.li>.
On Thu, Jun 18, 2009 at 07:26:58AM +0200, Michael Monnerie wrote:
> On Mittwoch 17 Juni 2009 Theo Van Dinter wrote:
> > Yes, it matters (one path is tried then the other has to be tried, as
> > opposed to having a single path)
> 
> So which is better performance wise? I guess [sz]? but I'm not sure now.

It's very simple to test, here's a very unscientific one:

#!/usr/bin/perl
use Benchmark qw(cmpthese);
cmpthese(1000000, {
    'group_nomatch' => sub { 'abc' =~ /b[xy]?/;     },
    'group_match'   => sub { 'abc' =~ /b[xc]?/;     },
    'alt_nomatch'   => sub { 'abc' =~ /b(?:x|y)?/;  },
    'alt_match'     => sub { 'abc' =~ /b(?:x|c)?/;  },
});


                   Rate     alt_match group_nomatch   group_match   alt_nomatch
alt_match      534759/s            --          -75%          -75%          -77%
group_nomatch 2127660/s          298%            --           -2%           -9%
group_match   2173913/s          307%            2%            --           -7%
alt_nomatch   2325581/s          335%            9%            7%            --

Not much difference, but alternation having a match ends up very slow.
Grouping would logically thinking be better and it seems so. Though it
always ends up slightly slower as a not matching alternation?

I'm sure there would be other test cases. Where are the perl über-geeks now? ;)

Cheers,
Henrik

Re: Suggested Change For FS_TEEN_BAD

Posted by Michael Monnerie <mi...@is.it-management.at>.
On Mittwoch 17 Juni 2009 Theo Van Dinter wrote:
> Yes, it matters (one path is tried then the other has to be tried, as
> opposed to having a single path)

So which is better performance wise? I guess [sz]? but I'm not sure now.

mfg zmi
-- 
// Michael Monnerie, Ing.BSc    -----      http://it-management.at
// Tel: 0660 / 415 65 31                      .network.your.ideas.
// PGP Key:         "curl -s http://zmi.at/zmi.asc | gpg --import"
// Fingerprint: AC19 F9D5 36ED CD8A EF38  500E CE14 91F7 1C12 09B4
// Keyserver: wwwkeys.eu.pgp.net                  Key-ID: 1C1209B4


Re: Suggested Change For FS_TEEN_BAD

Posted by Justin Mason <jm...@jmason.org>.
I'm pretty sure it still matters.

On Wed, Jun 17, 2009 at 19:16, Theo Van Dinter<fe...@apache.org> wrote:
> Yes, it matters (one path is tried then the other has to be tried, as
> opposed to having a single path), though the overall amount is
> probably negligible.  Perl's RE compiler could well optimize this away
> anyway.
>
>
> On Wed, Jun 17, 2009 at 7:45 PM, Kelson<ke...@speed.net> wrote:
>> Wouldn't it be more efficient to write all the single-letter matches like
>> "(?:s|z)?" as "[sz]?" or does it end up not making a difference when the
>> regex is actually processed?
>
>

Re: Suggested Change For FS_TEEN_BAD

Posted by Theo Van Dinter <fe...@apache.org>.
Yes, it matters (one path is tried then the other has to be tried, as
opposed to having a single path), though the overall amount is
probably negligible.  Perl's RE compiler could well optimize this away
anyway.


On Wed, Jun 17, 2009 at 7:45 PM, Kelson<ke...@speed.net> wrote:
> Wouldn't it be more efficient to write all the single-letter matches like
> "(?:s|z)?" as "[sz]?" or does it end up not making a difference when the
> regex is actually processed?

Re: Suggested Change For FS_TEEN_BAD

Posted by Kelson <ke...@speed.net>.
Wouldn't it be more efficient to write all the single-letter matches 
like "(?:s|z)?" as "[sz]?" or does it end up not making a difference 
when the regex is actually processed?

-- 
Kelson Vibber
SpeedGate Communications <www.speed.net>

Re: Suggested Change For FS_TEEN_BAD

Posted by Andy Dorman <ad...@ironicdesign.com>.
OK, I think/hope this is the final pass.  Thanks for all the good thoughts & 
ideas (and spelling corrections) from everyone.

##{ FS_TEEN_BAD

header   FS_TEEN_BAD Subject =~ 
/\b(?:teen(?:s|z)?|girl(?:s|z)?|boy(?:s|z)?|jailbait|lolita(?:s|z)?).*\b(?:pussy|sex(?:x{0,3}y|ual)?|slut(?:s|ty)?|ass(?:es|fuck(?:ing|ed)?|whip(?:ping|ped)?|spank(?:ing|ed)?)?|fuck(?:ing|ed)?|rap(?:e|ed|ing)+)\b/i

describe FS_TEEN_BAD Subject says something bad about girls or boys

##} FS_TEEN_BAD

-- 
Andy Dorman
Ironic Design, Inc.
AnteSpam.com, HomeFreeMail.com, ComeHome.net

Re: Suggested Change For FS_TEEN_BAD

Posted by "McDonald, Dan" <Da...@austinenergy.com>.
On Tue, 2009-06-16 at 13:52 -0400, Charles Gregory wrote:
> On Tue, 16 Jun 2009, RW wrote:
> > On Tue, 16 Jun 2009 12:03:43 -0500
> > Andy Dorman <ad...@ironicdesign.com> wrote:
> >> ##{ FS_TEEN_BAD
> >> header   FS_TEEN_BAD    Subject =~
> >> /\b(?:teens?|girls?|boys?...
> >> describe FS_TEEN_BAD    Subject says something bad about teens,
> >> girls, boys ##} FS_TEEN_BAD
> > You aren't checking the boundary after the first word.  Since it's a
> > subject test I think the .{1,15} could probably be a .+
> > You might also throw in jailbait and lolita. Also it's very common for
> > porn spam to use z in plurals e.g. girlz, boyz.

doesn't the first ?: negate that whole part of the test?
Seems like it should start out as
/\b(teen|girl|boy|jailbait|lolita)[sz]?.{0,15}\b
> 
> Two 'p's in 'whipping'. One 'x' in 'sexy'.... :)

I've seen sexxxy as well, maybe:
sex(?:x{0,3}y|ual)
> 
> - Charles
-- 
Daniel J McDonald, CCIE # 2495, CISSP # 78281, CNX
www.austinenergy.com