You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Ian Turner <ve...@vectro.org> on 2012/10/19 02:56:41 UTC

False negatives with distinctive punctuated subjects

Hello all,

I regularly get one class of pharmacy spam which accounts for about
75% of my total false negatives. The messages come with very
distinct subjects, like so (these are all false negatives from the last six 
weeks):

GetBiG*MED(STO~fine
New^PhaRmaCYnew_make
Like+PhaRmaCYnew!get
Best!MedSTOonline^fine
Check&PhaRmaCYnew!best
Fast@PHARMaCY#BIG%super
Better%PhaRmaCYnew!get
Get(MEdSTO^like
Best(MEdSTO_get
Best!BiG*MED(STO^best
Best@BestPharmacy}get
Offer)MedStoreOnline~fine
Fast$MedStoreOnline+make
Super&MedStoreOnline!make
New!IncreasePenisSize}likeDRmpMjfWUz
Best^PenisEnlargerPills_fine
Super$PenisEnlargerPills&best
Best_IncreasePenisSize^make
Fast~GetBigPenis_fine
Fast~EnlargeYourPenis@fine
Fast}IncreasePenisSize~best
Fast}EnlargeYourPenis~best
Best{EnlargeYourPenis%like
Best}IncreasePenisSize*super
Best{Meds+like
Best_MedStore^get
Best_MedShop:super
Super+Meds#like
Best*MedShop@fine
Offer#MedicalShop)best
New_Meds$get{rea)

The message body is generally just a URL, with no MIME. The messages hit on 
the following rules (note, this is based only on messages of this type marked 
as ham by SA in the last six weeks). This is using SpamAssassin 3.3.1.

 0.0 RCVD_IN_DNSWL_NONE (100% of messages)
 0.0 T_DKIM_INVALID (100% of messages)
 0.0 UNPARSEABLE_RELAY (100% of messages)
 0.0 FREEMAIL_FROM (100% of messages)
-0.4 RP_MATCHES_RCVD (90% of messages, score varies from -0.2 down to -2.1)
 0.0 TVD_SPACE_RATIO (63% of messages)
 0.2 FREEMAIL_ENVFROM_END_DIGIT (60% of messages)
 1.7 URIBL_BLACK (56% of messages)
 0.0 URIBL_DBL_REDIR (28% of messages)
 0.2 SUBJ_OBFU_PUNCT_FEW (6% of messages)

-1.9 BAYES_00 (40% of messages)
-0.5 BAYES_05 (3% of messages)
 0.0 BAYES_20 (13% of messages)
 0.0 BAYES_40 (3% of messages)
 0.8 BAYES_50 (40% of messages)

The very low bayes scores occur for two reasons: Either one spam makes it 
through, and then bayes auto-learning allows similar ones until I
can mark the spam as such; or else there are just no spammy tokens (only hammy 
ones like e.g. yahoo.com). This can occur if I receieve several
similar spams before checking my inbox.

I added a rule to add score +2 for the combination of both FREEMAIL_FROM and 
UNPARSEABLE_RELAY, which combination does not occur in my ham corpus. But this 
is not enough to push much of this low-scoring spam over the threshold, and it 
makes me very nervous to put an even higher score on such a seemingly innocent 
rule.

Questions for SA folks:
1. Is anyone else seeing this type of spam?
2. Is there anything that can be done to the bayes classifier to
   improve handling of this type of subject? I notice that the message
   with subject Fast}EnlargeYourPenis~best generated Hammy token
   "0.016-1--sk:Enlarge", so maybe not. But it seems odd to me that
   bayes isn't working better here; I have, for example, never
   received ham with the word "Penis" in the subject, so I would have
   expected to see that as a spammy token, but I don't.
3. Speaking of Penis, I'm surprised there isn't already a rule
   looking for the word in subjects, let alone in combination with "Enlarge".
   Is this intentional?
4. I see there is already a rule for puctuation-obfuscated subjects;
   what about one for case-obfuscated subjects?
4. Any other advice on how to fix this?

Cheers,

--Ian Turner

Re: False negatives with distinctive punctuated subjects

Posted by Ian Turner <ve...@vectro.org>.

On Friday, October 19, 2012 01:55:33 PM John Wilcock wrote:
> Le 19/10/2012 13:22, Ian Turner a écrit :
> > I meant something to specifically pick out words like phArmACy.
> 
> You could try a rule with a negative lookahead to exclude the correct
> casing, something like this (untested):

Curiously, I stopped receiving these spams since ending this thread. 
Coincidence?

--Ian

Re: False negatives with distinctive punctuated subjects

Posted by John Wilcock <jo...@tradoc.fr>.

Le 19/10/2012 13:22, Ian Turner a écrit :
> I meant something to specifically pick out words like phArmACy.

You could try a rule with a negative lookahead to exclude the correct 
casing, something like this (untested):

header SUBJ_MIXED_CASE_PHARMACY	Subject =~ 
/(?![Pp]harmacy)[Pp][Hh][Aa][Rr][Mm][Aa][Cc][Yy]/

John.

-- 
-- Over 5000 webcams from ski resorts around the world - www.snoweye.com
-- Translate your technical documents and web pages    - www.tradoc.fr

Re: False negatives with distinctive punctuated subjects

Posted by Ian Turner <ve...@vectro.org>.

Hi Martin,

On Friday, October 19, 2012 03:04:44 AM Martin Gregorie wrote:
> > 3. Speaking of Penis, I'm surprised there isn't already a rule
> > 
> >    looking for the word in subjects, let alone in combination with
> >    "Enlarge".
> >    Is this intentional?
> 
> The rule:
> 
> header RULENAME  Subkect =~ /(penis|pharmacy|med.{0,1}s)/i
> 
> should match virtually all your subjects and, unless you're a pharmacist
> or medical doctor, you're most unlikely to get much mail with subjects
> that contain these strings.

Sadly, my ham corpus does contain many messages with "pharmacy" and "meds" in 
the subject. I'm not a pharmacist but for some reason people do seem to enjoy 
telling me about their medications.

> > 4. I see there is already a rule for puctuation-obfuscated subjects;
> > 
> >    what about one for case-obfuscated subjects?
> 
> It is not needed: just append 'i' to force a caseless match

I meant something to specifically pick out words like phArmACy.

> I'd use a meta rule that uses (3) above as one sub-rule and a RAW body
> rule that matches a URL surrounded with whitespace as the other
> sub-rule. If you're keen, consider add in UNPARSEABLE_RELAY and write
> the metarule to fire if any two of the three subrules match.

This seems like good advice, thanks.

--Ian

Re: False negatives with distinctive punctuated subjects

Posted by Martin Gregorie <ma...@gregorie.org>.

On Fri, 2012-10-19 at 03:04 +0100, Martin Gregorie wrote:

> The rule:   
> 
> header RULENAME  Subkect =~ /(penis|pharmacy|med.{0,1}s)/i  
> 
This should, of course, be:

header RULENAME  Subject =~ /(penis|pharmacy|med.{0,1}s)/i

Sorry about the other typos etc - it was really too late to be writing.


Martin

Re: False negatives with distinctive punctuated subjects

Posted by Martin Gregorie <ma...@gregorie.org>.

On Thu, 2012-10-18 at 20:56 -0400, Ian Turner wrote:

> Questions for SA folks:
> 1. Is anyone else seeing this type of spam?

I don't see it.

> 2. Is there anything that can be done to the bayes classifier to
>    improve handling of this type of subject? I notice that the message
>    with subject Fast}EnlargeYourPenis~best generated Hammy token
>    "0.016-1--sk:Enlarge", so maybe not. But it seems odd to me that
>    bayes isn't working better here; I have, for example, never
>    received ham with the word "Penis" in the subject, so I would have
>    expected to see that as a spammy token, but I don't.

I'm not a bayes expert but I'd guess not.

> 3. Speaking of Penis, I'm surprised there isn't already a rule
>    looking for the word in subjects, let alone in combination with "Enlarge".
>    Is this intentional?
>

The rule:   

header RULENAME  Subkect =~ /(penis|pharmacy|med.{0,1}s)/i  

should match virtually all your subjects and, unless you're a pharmacist
or medical doctor, you're most unlikely to get much mail with subjects
that contain these strings.

> 4. I see there is already a rule for puctuation-obfuscated subjects;
>    what about one for case-obfuscated subjects?

It is not needed: just append 'i' to force a caseless match

> 4. Any other advice on how to fix this?
> 
Two rule 4s. Yer got summat against 5?

I'd use a meta rule that uses (3) above as one sub-rule and a RAW body
rule that matches a URL surrounded with whitespace as the other
sub-rule. If you're keen, consider add in UNPARSEABLE_RELAY and write
the metarule to fire if any two of the three subrules match. 

Any message that triggers that metarule both these subrules is quite
unlikely to be ham, so give it a score of 5.0 or more.

Martin