You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Keith Dunnett <ke...@dunnett.org> on 2006/05/11 19:47:15 UTC

Scoring for rule SUBJ_ILLEGAL_CHARS

I've recently had a couple of false positives caused by this rule, and think
it may be scored too highly for a single check. The e-mails in question were
in Spanish, and the Spanish word for linguistics has two accented characters
which is enough to trigger this rule.

Admittedly, the blacklists account for 2.4 points (it was from Yahoo) but the
4.3 point score for the subject alone strikes me as excessive. I understand 
that anything that is not English is inherently suspect for most users, but 
to give 86% of the default spam score on almost *any* single rule would seem 
to me to be overkill.

Alternatively, is there (or should there be) a ruleset for those who wish to
receive e-mail in other languages? Ideally, a Spanish-friendly ruleset would
reduce the scores of character-based rules, while adding in rules for known 
spam in Spanish where possible. Does such a thing already exist? Should it?

The spam report from the e-mail in question follows, although the above 
pretty much sums it up.

X-Spam-Report: 
	*  0.0 DK_POLICY_SIGNSOME Domain Keys: policy says domain signs some mails
	*  0.0 DK_POLICY_TESTING Domain Keys: policy says domain is testing DK
	*  4.3 SUBJ_ILLEGAL_CHARS Subject: has too many raw illegal characters
	*  0.0 DK_SIGNED Domain Keys: message has an unverified signature
	* -0.0 DK_VERIFIED Domain Keys: signature passes verification
	*  0.5 HTML_40_50 BODY: Message is 40% to 50% HTML
	*  0.0 HTML_MESSAGE BODY: HTML included in message
	*  0.0 BAYES_50 BODY: Bayesian spam probability is 40 to 60%
	*      [score: 0.5000]
	*  0.2 DNS_FROM_RFC_ABUSE RBL: Envelope sender in abuse.rfc-ignorant.org
	*  1.4 DNS_FROM_RFC_WHOIS RBL: Envelope sender in whois.rfc-ignorant.org
	*  0.8 RCVD_IN_BLARS RBL: Received via a relay in block.blars.org
	*      [217.216.40.199 listed in block.blars.org]
	[66.163.178.160 listed in block.blars.org]
	* -0.5 AWL AWL: From: address is in the auto white-list

Regards,

Keith


Re: Scoring for rule SUBJ_ILLEGAL_CHARS

Posted by Kai Schaetzl <ma...@conactive.com>.
Kelson wrote on Fri, 12 May 2006 14:23:55 -0700:

> I count two:  The ü in für and the ´ in MODEL´S, which is different from 
> the ASCII single quote/apostrophe: '

Ah, you are right, I missed the "ü", it's too "natural" for me.
Nevertheless "too many" implies a bit more than *two* for me. I can't 
exactly say how much, but I'd use a better description. The rule is an eval 
rule, so I don't know how many characters it needs, maybe it's really just 
one.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com




Re: Scoring for rule SUBJ_ILLEGAL_CHARS

Posted by Kelson <ke...@speed.net>.
Kai Schaetzl wrote:
> The subject line hitting in the case of our customer was:
> Bewerbung für INS-2006-05-44444, "MODEL´S GESUCHT!!!"
> 
> I can identify only one character that is outside the ASCII range.

I count two:  The ü in für and the ´ in MODEL´S, which is different from
the ASCII single quote/apostrophe: '

-- 
Kelson Vibber
SpeedGate Communications <www.speed.net>

Re: Scoring for rule SUBJ_ILLEGAL_CHARS

Posted by jdow <jd...@earthlink.net>.
From: "Kai Schaetzl" <ma...@conactive.com>

> Theo Van Dinter wrote on Thu, 11 May 2006 13:49:11 -0400:
>
>> fwiw, the 8-bit characters ought to be encoded in base64 or quoted-printable.
>> then the rule wouldn't hit.
>
> I just found the same problem here with a whole bunch of messages coming from
> the same source. It seems the rule hits on *one* occurence of a non-ASCII
> character, however, the description says "Subject: has too many raw illegal
> characters". At least the description is wrong then.
> And, as Keith explains, I think that score is excessive. It's fairly common that
> some mail programs, especially if webmail or form-generated, have at least one
> none-encoded character in the subject.
>
> The subject line hitting in the case of our customer was:
> Bewerbung für INS-2006-05-44444, "MODEL´S GESUCHT!!!"
>
> I can identify only one character that is outside the ASCII range.
>
> Kai

1 is too many, of course.
{^_-} 


Re: Scoring for rule SUBJ_ILLEGAL_CHARS

Posted by Kai Schaetzl <ma...@conactive.com>.
Theo Van Dinter wrote on Thu, 11 May 2006 13:49:11 -0400:

> fwiw, the 8-bit characters ought to be encoded in base64 or quoted-printable. 
> then the rule wouldn't hit.

I just found the same problem here with a whole bunch of messages coming from 
the same source. It seems the rule hits on *one* occurence of a non-ASCII 
character, however, the description says "Subject: has too many raw illegal 
characters". At least the description is wrong then.
And, as Keith explains, I think that score is excessive. It's fairly common that 
some mail programs, especially if webmail or form-generated, have at least one 
none-encoded character in the subject.

The subject line hitting in the case of our customer was:
Bewerbung für INS-2006-05-44444, "MODEL´S GESUCHT!!!"

I can identify only one character that is outside the ASCII range.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com




Re: Scoring for rule SUBJ_ILLEGAL_CHARS

Posted by Theo Van Dinter <fe...@apache.org>.
On Thu, May 11, 2006 at 07:47:15PM +0200, Keith Dunnett wrote:
> I've recently had a couple of false positives caused by this rule, and think
> it may be scored too highly for a single check. The e-mails in question were
> in Spanish, and the Spanish word for linguistics has two accented characters
> which is enough to trigger this rule.

fwiw, the 8-bit characters ought to be encoded in base64 or quoted-printable.
then the rule wouldn't hit.

> Admittedly, the blacklists account for 2.4 points (it was from Yahoo) but 
> the
> 4.3 point score for the subject alone strikes me as excessive. I understand 
> that anything that is not English is inherently suspect for most users, but 
> to give 86% of the default spam score on almost *any* single rule would 
> seem to me to be overkill.

It's actually less about english vs non-english and more about messages
violating the rfc (non 7-bit ascii chars need to be encoded in the header).
however, english maps to 7-bit ascii very well, so ...

-- 
Randomly Generated Tagline:
"They who can give up essential liberty to obtain a little temporary
 safety deserve neither liberty nor safety." - Benjamin Franklin