You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Alan <sp...@ambitonline.com> on 2022/02/15 22:23:07 UTC

False "bad domain" positive

Here's a lovely edge case...

I've got someone who posted text from MS Office into an email (wish I 
could ban that). The text contained a numbered list. The fourth list 
item started with "Date & Time". The 4 and following period were in a 
span element with a margin to separate it from the text but no actual 
whitespace, so the plain text version comes up as (I've used {dot} to 
avoid another trigger) "4{dot}Date & Time". This then triggered :

   2.0 PDS_OTHER_BAD_TLD      Untrustworthy TLDs [URI: 4{dot}date (date)]
   5.0 KAM_SOMETLD_ARE_BAD_TLD .stream, .trade, .pw, .top, .press, .bid & .date TLD Abuse

Thus consigning a meeting agenda to the trash. I suspect this is an 
uncommon but not rare false positive.

These rules would benefit from excluding single character domain matches 
(which IIRC would be invalid domains anyway). A this sort of FP would be 
avoided. For bonus points excluding three-character roman numerals under 
10 (iii, vii, etc.) would be useful too.

--
For SpamAssassin Users List

Re: False "bad domain" positive

Posted by Greg Troxel <gd...@lexort.com>.
Alan <sp...@ambitonline.com> writes:

> I've got someone who posted text from MS Office into an email (wish I
> could ban that). The text contained a numbered list. The fourth list
> item started with "Date & Time". The 4 and following period were in a
> span element with a margin to separate it from the text but no actual
> whitespace, so the plain text version comes up as (I've used {dot} to
> avoid another trigger) "4{dot}Date & Time". This then triggered :

Wow, that's funny.  But agreed it's ham...

>   2.0 PDS_OTHER_BAD_TLD      Untrustworthy TLDs [URI: 4{dot}date (date)]

This seems reasonable.  2 points is not a killer rule and that probably
would not have messed up delivery.

>   5.0 KAM_SOMETLD_ARE_BAD_TLD .stream, .trade, .pw, .top, .press, .bid & .date TLD Abuse

That's the KAM ruleset, not base, and given that it's an add-on rule I
see that as effectively "the base rule should be scored 7" (at least for
the domains that overlap).

I suspect though that the rule/score are almost entirely right in terms
of probability, for uses of those tlds as domains.  They all sound
sketchy.

> Thus consigning a meeting agenda to the trash. I suspect this is an
> uncommon but not rare false positive.
>
> These rules would benefit from excluding single character domain
> matches (which IIRC would be invalid domains anyway). A this sort of
> FP would be avoided. For bonus points excluding three-character roman
> numerals under 10 (iii, vii, etc.) would be useful too.

My own view is that no rule should be scored above about 3 unless it is
vanishingly unlikely that the rule will fire on legit mail (even if the
legit mail is messed up in ways that actually happen to legit mail).
That's a different opinion than the one encoded in the KAM ruleset
socres, which I interpret as saying that it's ok to have a few FPs if
that's the price of getting rid of some nasty phishing/malware and a lot
of spam.

You need to think about your own needs on how to tune that FP and
effectiveness tradeoff, and if you're not willing to live what I
consider a little dangerously on FP risk then the KAM ruleset is not for
you.  I run it personally, and I find problems with rules that have very
high scores hitting ham, maybe once a month or every few months, and I'm
accumulating downscoring config.  But it saves me from a vast amount of
spam, I think.  I would be very nervous if I were configuring it for
lots of others, but I have the luxury of not having to admin mail for
more than myself and family.

My current config, in case you want to look at these rules and see what
you think.  Beware that the below is tuned to my personal ham; I'm on
mailinglists where people occasionally discuss voicemail and watches.  I
no longer remember all the reasons, but surely it was that the rule
fired on ham.

score   KAM_UNIV                        2       # was 4.5
score   KAM_SOMETLD_ARE_BAD_TLD         2       # was 5
score   KAM_FAKE_DELIVER                3       # was 6.25
score   KAM_SHORT                       0.5     # was 2, can't figure out why it fires
score   KAM_LIST3_1                     3.8     # was 5.8
score   KAM_TIME                        0.1     # was 3.0, FP on time-nuts
score   KAM_SENDGRID                    0.3     # was 1.5, but now URIBL_GREY
score   KAM_ASCII_DIVIDERS              0.1     # can't figure out why it fires

score	KAM_MARKADV		5	# was 10
score	KAM_VM			3	# was 5