You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Mynabbler <my...@live.com> on 2011/10/13 13:17:48 UTC

Chickenpoxed subjects

Typically the chickenpox rules do not get a lot of love abroad, since they
tend to trip over other languages than English. However, does someone have
an idea how to use the logic in chickenpox for subjects like these:

Subject: P "0:R.N| M+0&V}l `E"
Subject: T+r &a n !s \s &e.x )u/a _l =s# P =o :r^n
Subject: A D U ~L%T) Pl %C, T{U$R^E%
Subject: S |CH. O O /L Gl_ R^L$ P~0^R |N \
Subject: SH}E"M ;AL+E :S
Subject: S C/H ,O 0=LG)l :R$L$S ) P -0 RN
Subject: T (R A _N N_Y ,S " S#E`X
Subject: P ,0 :R N^ V I, D #E. O
Subject: G)A "N \G{B A NG =
Subject: AD}U)L T P *l C +TU `R|E:
Subject: A /D)U %L ;T} P (I {C. T U,R"E =
Subject: M)as;tu "r )b %a$t l:n g ~
Subject: G_ a n g |b^a n {g (
Subject: U N&D/R ES :S |I$N G|
Subject: B:L {0 }W /J (0 +B
Subject: T*E {E;N} P #0, RN _
Subject: S e}x} P\ic^t,u {r|es =
Subject: A/D_U &L:T ! M  0{V. l E &
Subject: Bl g`d }lc :k :s `
Subject: B^l /g%d l )c =k
Subject: T R +A N/N!l :ES,  P \0 R  N
Subject: S(H +E (M ;A ^L&E)S$ P_0 R -N &
Subject: T r#a `n |n{l +e& S *e .x :
Subject: S,CH =0`0 L |G!lR#L-S % P, 0 R #N "
Subject: M AT ,U {R E _ P 0_R !N{
Subject: S C:H$O #0 ;L `G^I "R:L#S  P|0%R}N
Subject: A "d u=l t  M*o {v.i ^e s :
Subject: M:a %s_ t /u. r$b_ a)t!l.n;g .
Subject: S E&X/ V|l_ D ;E(0.S
Subject: S E\X _ M OV|I E
Subject: T `E -E;N;S , P-0_ R N
Subject: T ra{n, n^y !

... or does someone have a decent rule to tag this kind of crap?
-- 
View this message in context: http://old.nabble.com/Chickenpoxed-subjects-tp32644509p32644509.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Chickenpoxed subjects

Posted by RW <rw...@googlemail.com>.
On Wed, 19 Oct 2011 04:43:52 -0700 (PDT)
Mynabbler wrote:

> 
> RW-15 wrote:
> 
> MN> As I explained, even if the rule would have fired, it adds a
> MN> whopping 0.1 score. It only shows teeth when combined with other
> MN> findings...
> 
> RW> So, why isn't it worth scoring if it's a useful rule?
> 
> Because mail with odd characters is not per se spam

But if you really believed your rule had merit, you wouldn't score it at
0.1

> RW>  And why score it so high with FREEMAIL?
> 
> You are kidding, right? 50% of this crap comes from FREEMAIL
> addresses,

But there should be some logic to it, and there's no real connection
between FREEMAIL and Chickenpox. If anything it should be the other way
around, your rule FPs most frequently in mailing lists where freemail
addresses are very common. 

You'd be much better-off using decent chickenpox rules that are
worth scoring in their own right.

Re: Chickenpoxed subjects

Posted by Philip Prindeville <ph...@redfish-solutions.com>.
On 10/20/11 8:24 PM, Adam Katz wrote:
> On 10/19/2011 04:43 AM, Mynabbler wrote:
>> You are kidding, right? 50% of this crap comes from FREEMAIL
>> addresses, and even more specific: 44% of this crap is delivered by
>> aol.com.  The aol deliveries have about 85% unique from@aol
>> addresses, so they pretty much 'own' aol.
> 
> We're writing spam filters, not idiot filters.  The fact that there is
> so much overlap is often useful, bit the overlap is not complete.  There
> is also a decent amount of overlap between the
> mostly-computer-illiterate and freemail users.  I think this drives your
> current line of thinking.
> 
> There are a lot of people that do very spammy things.  It is a testament
> to SA and other filters that such non-spam doesn't so commonly flag as spam.
> 

Sorry to come to the party late on this, was traveling a bit.

It seems to me that if you have lines like:

Subject: T R +A N/N!l :ES,  P \0 R  N
Subject: S C/H ,O 0=LG)l :R$L$S ) P -0 RN

Then the solution is to use agrep.  Make deletions of punctuation very low cost, as well as the usual transformations like:

0 => O
1 => l
$ => S
...

also be low-cost.  (Of course, then you end up with the possibility of clash between deleting $ and replacing it with 'S', but agrep is good about checking both)... they you just grep through a dictionary of the "usual offenders":

lesbian
cash
meds
porn
...

I'm not familiar with perl-String-Approx...  reading up on it, it uses the Levenshtein distances just like agrep does... so it would be ideal for doing approximate matches.

http://search.cpan.org/~jhi/String-Approx-3.26/Approx.pm

-Philip

Re: Chickenpoxed subjects

Posted by Adam Katz <an...@khopis.com>.
On 10/19/2011 04:43 AM, Mynabbler wrote:
> You are kidding, right? 50% of this crap comes from FREEMAIL
> addresses, and even more specific: 44% of this crap is delivered by
> aol.com.  The aol deliveries have about 85% unique from@aol
> addresses, so they pretty much 'own' aol.

We're writing spam filters, not idiot filters.  The fact that there is
so much overlap is often useful, bit the overlap is not complete.  There
is also a decent amount of overlap between the
mostly-computer-illiterate and freemail users.  I think this drives your
current line of thinking.

There are a lot of people that do very spammy things.  It is a testament
to SA and other filters that such non-spam doesn't so commonly flag as spam.


Re: Chickenpoxed subjects

Posted by Mynabbler <my...@live.com>.
RW-15 wrote:

MN> As I explained, even if the rule would have fired, it adds a whopping
MN> 0.1 score. It only shows teeth when combined with other findings...

RW> So, why isn't it worth scoring if it's a useful rule?

Because mail with odd characters is not per se spam

RW>  And why score it so high with FREEMAIL?

You are kidding, right? 50% of this crap comes from FREEMAIL addresses, and
even more specific: 44% of this crap is delivered by aol.com.  The aol
deliveries have about 85% unique from@aol addresses, so they pretty much
'own' aol.

RW> The danger here is that you end-up with a lot FREEMAIL && WEAK_RULE
metas
RW> that are prone to high-scoring FPs that BAYES_00 can't save.

As most spammers try to find something other than BOTNET's at the moment, I
think it's only fair to be very critical about FREEMAIL.

RW>  If FREEMAIL_FROM is a good indicator then score it up, and score other
rules
RW> on their merits.

Well... in itself FREEMAIL isn't spam a priori. It's just that chances are a
lot higher that it is. Hence my method of meta-ing FREEMAIL with fairly low
scoring rules, like links to free blogsites, free websites, tumblr, odd
punctuation in Subject rules, stuff like that.  Interestingly enough the
most used subject from valid freemail is "Re: " and "<none>". I don't see a
problem with being picky about freemail. The only free email provider
succesfully fighting _out_going spam is gmail.com.
-- 
View this message in context: http://old.nabble.com/Chickenpoxed-subjects-tp32644509p32681681.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Chickenpoxed subjects

Posted by RW <rw...@googlemail.com>.
On Tue, 18 Oct 2011 13:07:21 -0700 (PDT)
Mynabbler wrote:

> 
> 
> RW-15 wrote:
> > 
> > It would hit:
> > Re: Did you pick-up the dry-cleaning?
> > 
> Nope. Scores just two (one ':' and a '?') and the rule needs three
> different odd characters.

OK the font I'm using makes ~ look very like a -, but the point remains.
If a subject starts with FW: or  Re: and has a [!?], which is
pretty common, you are then triggering on only one extra character. If
you look back through this list you will find numerous such replies.


> RW-15 wrote:
> > 
> > I think it needs more work, maybe combine it with tests for lots of
> > very short words or adjacent punctuation pairs.
> > 
> As I explained, even if the rule would have fired, it adds a whopping
> 0.1 score. It only shows teeth when combined with other findings...


So, why isn't it worth scoring if it's a useful rule? And why score it
so high with FREEMAIL?. The danger here is that you end-up with a lot
FREEMAIL && WEAK_RULE metas that are prone to high-scoring FPs that
BAYES_00 can't save. If FREEMAIL_FROM is a good indicator then score it
up, and score other rules on their merits.

Re: Chickenpoxed subjects

Posted by Mynabbler <my...@live.com>.

RW-15 wrote:
> 
> It would hit:
> Re: Did you pick-up the dry-cleaning?
> 
Nope. Scores just two (one ':' and a '?') and the rule needs three different
odd characters.

RW-15 wrote:
> 
> I think it needs more work, maybe combine it with tests for lots of
> very short words or adjacent punctuation pairs.
> 
As I explained, even if the rule would have fired, it adds a whopping 0.1
score. It only shows teeth when combined with other findings...
-- 
View this message in context: http://old.nabble.com/Chickenpoxed-subjects-tp32644509p32677140.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Chickenpoxed subjects

Posted by RW <rw...@googlemail.com>.
On Tue, 18 Oct 2011 01:21:36 -0700 (PDT)
Mynabbler wrote:

> 
> 
> Adam Katz wrote:
> > 
> >> On Mon, 17 Oct 2011, Adam Katz wrote:
> >>> Time for F-U-N
> >>> I like D&D and rock&roll
> >>> /var/spool/mail is full
> > 
> ... those examples don't get a hit with the rule I cooked up (since
> it needs three different odd characters), 

It would hit:

Re: Did you pick-up the dry-cleaning?

I think it needs more work, maybe combine it with tests for lots of
very short words or adjacent punctuation pairs.

Re: Chickenpoxed subjects

Posted by Mynabbler <my...@live.com>.

Adam Katz wrote:
> 
>> On Mon, 17 Oct 2011, Adam Katz wrote:
>>> Time for F-U-N
>>> I like D&D and rock&roll
>>> /var/spool/mail is full
> 
... those examples don't get a hit with the rule I cooked up (since it needs
three different odd characters), and besides, an MN_PUNCTUATION hits only
scores in meta combinations. Note I commented out [] and () since they score
too easily in valid email.

header  __MN_PUNC00 Subject =~ /~/
header  __MN_PUNC02 Subject =~ /`/
header  __MN_PUNC03 Subject =~ /\#/
header  __MN_PUNC04 Subject =~ /\$/
header  __MN_PUNC05 Subject =~ /%/
header  __MN_PUNC06 Subject =~ /\^/
header  __MN_PUNC07 Subject =~ /&/
header  __MN_PUNC08 Subject =~ /\*/
# header  __MN_PUNC09 Subject =~ /\(|\)/
header  __MN_PUNC10 Subject =~ /\?/
header  __MN_PUNC11 Subject =~ /\+/
header  __MN_PUNC12 Subject =~ /=/
header  __MN_PUNC13 Subject =~ /\{|\}/
# header  __MN_PUNC14 Subject =~ /\[|\]/
header  __MN_PUNC15 Subject =~ /\|/
header  __MN_PUNC16 Subject =~ /\"/
header  __MN_PUNC17 Subject =~ /\;/
header  __MN_PUNC18 Subject =~ /\:/
header  __MN_PUNC19 Subject =~ /\//
header  __MN_PUNC20 Subject =~ /_/
meta      MN_PUNCTUATION (__MN_PUNC01 + __MN_PUNC02 + __MN_PUNC03 +
__MN_PUNC04 + __MN_PUNC05 + __MN_PUNC06 + __MN_PUNC07 + __MN_PUNC08 + 
__MN_PUNC10 + __MN_PUNC11 + __MN_PUNC12 + __MN_PUNC13 + __MN_PUNC15 +
__MN_PUNC16 + __MN_PUNC17 +  __MN_PUNC18 + __MN_PUNC19 + __MN_PUNC20 >= 3)
score     MN_PUNCTUATION 0.1
#
# Now, let's go hunt with this:
meta      MN_PUNCS1 (MN_PUNCTUATION && (FREEWEB || HAS_SHORT_URL ||
MN_TUMBLR))
score     MN_PUNCS1 6 
describe  MN_PUNCS1 Garbled subject with free website or blogsite, SHORT_URL
or tumblr link
meta      MN_PUNCS2 (MN_PUNCTUATION && FREEMAIL)
score     MN_PUNCS2 3 
describe  MN_PUNCS2 Garbled subject from a free mail address
-- 
View this message in context: http://old.nabble.com/Chickenpoxed-subjects-tp32644509p32672891.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Chickenpoxed subjects

Posted by Adam Katz <an...@khopis.com>.
On 10/17/2011 04:36 PM, John Hardin wrote:
> On Mon, 17 Oct 2011, Adam Katz wrote:
>> Time for F-U-N
>> I like D&D and rock&roll
>> /var/spool/mail is full
> 
> It must hit more than a specified number of times. __SUBJ_OBFU_PUNCT
> isn't scored, SUBJ_OBFU_PUNCT_FEW and SUBJ_OBFU_PUNCT_MANY are.

Each of my examples hits SUBJ_OBFU_PUNCT_FEW, and it wouldn't be hard
for them to hit SUBJ_OBFU_PUNCT_MANY either.

>> I think this would satisfy the original request:
>>
>> header   __SUBJ_LACKS_WORDS
>>   Subject !~ /(?!^.{0,15}$)(?:^|\s)[a-z]{3,15}(?:\s|$)/
>>
>> (I have not checked that in, feel free if you like it.)
> 
> When I get home tonight.

See my other email, already checked in :-)


Re: Chickenpoxed subjects

Posted by John Hardin <jh...@impsec.org>.
On Mon, 17 Oct 2011, Adam Katz wrote:

> header      __SUBJ_OBFU_PUNCT      Subject =~
> /(?:[-~`"!@\#$%^&*()_+={}|\\\/?<>,.:;][a-z][-~`"!@\#$%^&*()_+={}|\\\/?<>,.:;\s]|[a-z][~`"!@\#$%^&*()_+={}|\\\/?<>,.:;][a-z])/i
>
> How does this differ from a negation, like:
>
> /[^\[\]'\w\s][a-z][^\[\]'\w]|[a-z][^\[\]'\w\s-][a-z]/i

I suppose which you'd choose would be based on how conservative you want 
to be. Matching on specific types of obfuscation (as mine does), or being 
less selective (as yours does).

> and how does this not FP all over the place with subjects like:
>
> Time for F-U-N
> I like D&D and rock&roll
> /var/spool/mail is full

It must hit more than a specified number of times. __SUBJ_OBFU_PUNCT isn't 
scored, SUBJ_OBFU_PUNCT_FEW and SUBJ_OBFU_PUNCT_MANY are.

> I think this would satisfy the original request:
>
> header   __SUBJ_LACKS_WORDS
>  Subject !~ /(?!^.{0,15}$)(?:^|\s)[a-z]{3,15}(?:\s|$)/
>
> (I have not checked that in, feel free if you like it.)

When I get home tonight.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Insofar as the police deter by their presence, they are very, very
   good. Criminals take great pains not to commit a crime in front of
   them.                                             -- Jeffrey Snyder
-----------------------------------------------------------------------
  312 days since the first successful private orbital launch (SpaceX)

Re: Chickenpoxed subjects

Posted by Adam Katz <an...@khopis.com>.
On 10/17/2011 02:29 PM, Adam Katz wrote:
> I think this would satisfy the original request:
> 
> header   __SUBJ_LACKS_WORDS
>   Subject !~ /(?!^.{0,15}$)(?:^|\s)[a-z]{3,15}(?:\s|$)/
> 
> (I have not checked that in, feel free if you like it.)

Okay, that needed a little work (boo to double-negatives).  Also, I
hadn't noticed the new thread (sorry).

Just checked this in:

header __SUBJ_NOT_SHORT    Subject =~ /^.{16}/
header __SUBJ_HAS_WORDS    Subject =~ /(?:^|\s)[^\W0-9_]{3,15}(?:\s|$)/
meta     SUBJ_LACKS_WORDS  __SUBJ_NOT_SHORT && !__SUBJ_HAS_WORDS &&
!__SUBJECT_ENCODED_B64
describe SUBJ_LACKS_WORDS  Non-short subject lacks words

Even this will hit a fair amount of ham, especially with foreign
languages (I tried to work around this with [^\W0-9_] instead of [a-z]
in the event a locale is in use).


Re: Chickenpoxed subjects

Posted by Adam Katz <an...@khopis.com>.
On 10/15/2011 03:37 PM, John Hardin wrote:
> On Thu, 13 Oct 2011, Mynabbler wrote:
> 
>> Typically the chickenpox rules do not get a lot of love abroad,
>> since they tend to trip over other languages than English. However,
>> does someone have an idea how to use the logic in chickenpox for
>> subjects like these:
>> 
>> ... or does someone have a decent rule to tag this kind of crap?
> 
> I've got something in local masscheck right now, should commit later 
> today. Check my sandbox tomorrow.

header      __SUBJ_OBFU_PUNCT      Subject =~
/(?:[-~`"!@\#$%^&*()_+={}|\\\/?<>,.:;][a-z][-~`"!@\#$%^&*()_+={}|\\\/?<>,.:;\s]|[a-z][~`"!@\#$%^&*()_+={}|\\\/?<>,.:;][a-z])/i

How does this differ from a negation, like:

/[^\[\]'\w\s][a-z][^\[\]'\w]|[a-z][^\[\]'\w\s-][a-z]/i

and how does this not FP all over the place with subjects like:

Time for F-U-N
I like D&D and rock&roll
/var/spool/mail is full


I think this would satisfy the original request:

header   __SUBJ_LACKS_WORDS
  Subject !~ /(?!^.{0,15}$)(?:^|\s)[a-z]{3,15}(?:\s|$)/

(I have not checked that in, feel free if you like it.)


Re: Chickenpoxed subjects

Posted by John Hardin <jh...@impsec.org>.
On Thu, 13 Oct 2011, Mynabbler wrote:

> Typically the chickenpox rules do not get a lot of love abroad, since they
> tend to trip over other languages than English. However, does someone have
> an idea how to use the logic in chickenpox for subjects like these:
>
> ... or does someone have a decent rule to tag this kind of crap?

I've got something in local masscheck right now, should commit later 
today. Check my sandbox tomorrow.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   One death is a tragedy; thirty is a media sensation;
   a million is a statistic.              -- Joseph Stalin, modernized
-----------------------------------------------------------------------
  310 days since the first successful private orbital launch (SpaceX)