You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by mamalos <ma...@eng.auth.gr> on 2010/01/18 19:01:53 UTC

Wrong functionality of SUBJ_ALL_CAPS in mixed English and Greek subject

Dear all,

I have three servers running spamassassin 3.2.5, and I have noticed the
following, unwanted behavior:

If someone sends an email with a subject written both in English and Greek,
spamassassin does the following mistake: if all *English* words are
capitalized (even if it is only one English word), and at least one of the
Greek words is not capitalized, then the rule SUBJ_ALL_CAPS is nevertheless
fired, and the email gets an additional 2.1 spam-score, which is wrong.

I am not well acquainted with spamassassin's rule-writing-syntax or which
function is involved, so I cannot send you a corrected version of the rule
SUBJ_ALL_CAPS. I assume that through this list, someone who is an editor of
these rulesets may be informed and correct this misbehavior.

Thank you all for your time, in advance.
-- 
View this message in context: http://old.nabble.com/Wrong-functionality-of-SUBJ_ALL_CAPS-in-mixed-English-and-Greek-subject-tp27214418p27214418.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Wrong functionality of SUBJ_ALL_CAPS in mixed English and Greek subject

Posted by Kai Schaetzl <ma...@conactive.com>.
Mamalos wrote on Tue, 19 Jan 2010 05:19:30 -0800 (PST):

> >     $subject =~ s/[^a-zA-Z]//g;          # only look at letters

The relevant part seems to be this. It removes all other characters 
(i9ncluding the Greek ones). As I said, it's hard to determine capitals 
for each and every language. So, this looks like by design for me.
Your best bet is to disable this rule for you if you think it doesn't do 
any good for you.

Kai

-- 
Get your web at Conactive Internet Services: http://www.conactive.com




Re: Wrong functionality of SUBJ_ALL_CAPS in mixed English and Greek subject

Posted by mamalos <ma...@eng.auth.gr>.


Mike Cardwell-16 wrote:
> 
> On 19/01/2010 10:07, mamalos wrote:
> 
>>> I just pasted that email into spamalyser.com and it gave this:
>>> http://spamalyser.com/v/u32d10ix/mime
>>>
>>> The subject looks fully capitalised to me when decoded? I'm not overly
>>> proficient on my Greek though.
>>>
>>> --
>>> Mike Cardwell    : UK based IT Consultant, LAMP developer, Linux admin
>>> Cardwell IT Ltd. : UK Company - http://cardwellit.com/       #06920226
>>> Technical Blog   : Tech Blog  - https://secure.grepular.com/blog/
>>> Spamalyser       : Spam Tool  - http://spamalyser.com/
>>
>>  From the link you sent me (spamalizer), the subject is all in lower case
>> except from the word "TEST" which is written in english.
> 
> Then I don't know the Greek alphabet. The relevant subroutine from 
> SpamAssassin::Plugin::HeaderEval is below:
> 
> ================================================================================
> sub subject_is_all_caps {
>     my ($self, $pms) = @_;
>     my $subject = $pms->get('Subject');
> 
>     $subject =~ s/^\s+//;
>     $subject =~ s/\s+$//;
>     return 0 if $subject !~ /\s/;        # don't match one word subjects
>     return 0 if (length $subject < 10);  # don't match short subjects
>     $subject =~ s/[^a-zA-Z]//g;          # only look at letters
> 
>     # now, check to see if the subject is encoded using a non-ASCII
> charset.
>     # If so, punt on this test to avoid FPs.  We just list the known 
> charsets
>     # this test will FP on, here.
>     my $subjraw = $pms->get('Subject:raw');
>     my $CLTFAC = 
> Mail::SpamAssassin::Constants::CHARSETS_LIKELY_TO_FP_AS_CAPS;
>     if ($subjraw =~ /=\?${CLTFAC}\?/i) {
>       return 0;
>     }
> 
>     return length($subject) && ($subject eq uc($subject));
> }
> ================================================================================
> 
> I guess another exception needs adding?
> 
> -- 
> Mike Cardwell    : UK based IT Consultant, LAMP developer, Linux admin
> Cardwell IT Ltd. : UK Company - http://cardwellit.com/       #06920226
> Technical Blog   : Tech Blog  - https://secure.grepular.com/blog/
> Spamalyser       : Spam Tool  - http://spamalyser.com/
> 
> 

My perl and perl-regexes are very rusty, so I am not sure about the code you
are mentioning above. The only thing I see that may trouble me is the line
that reads:


 $subject =~ s/[^a-zA-Z]//g;          # only look at letters

which would only capture Latin characters. After I saw this I sent an email
with a subject entirely written in Greek, where all letters where caps. The
rule was not fired, which means that the function does not check the Greek
part of the string at all, and only checks the Latin part.

Since the last line reads:

return length($subject) && ($subject eq uc($subject));

and $subject does not contain any Greek characters, the outcome returned
will be probably wrong. My problem is that I cannot understand where
non-ASCII characters are read in the above code snippet, and if they are
correctly checked against all characters of the subject.

Thanks again




-- 
View this message in context: http://old.nabble.com/Wrong-functionality-of-SUBJ_ALL_CAPS-in-mixed-English-and-Greek-subject-tp27214418p27225660.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: [sa] Wrong functionality of SUBJ_ALL_CAPS in mixed English and Greek subject

Posted by Kai Schaetzl <ma...@conactive.com>.
Charles Gregory wrote on Fri, 22 Jan 2010 09:55:56 -0500 (EST):

> Yup. Lazy. Fixed now. Thanks.

Thank you, Charles. This makes really a difference for those of us who use 
a client that can apply different appearance to quoted text. Thanks, 
again!

Kai

-- 
Get your web at Conactive Internet Services: http://www.conactive.com




Re: [sa] Re: Wrong functionality of SUBJ_ALL_CAPS in mixed English and Greek subject

Posted by Charles Gregory <cg...@hwcn.org>.
On Fri, 22 Jan 2010, Matus UHLAR - fantomas wrote:
>> On Tue, 19 Jan 2010, Mike Cardwell wrote:
>> : Then I don't know the Greek alphabet. The relevant subroutine from
>> : SpamAssassin::Plugin::HeaderEval is below:
>> :    $subject =~ s/[^a-zA-Z]//g;          # only look at letters
> On 19.01.10 10:28, Charles Gregory wrote:
>> I think the 'issue' is that spamassassin *should* have some 'higher level'
>> check for the *language* of the header.
> you apparently mean "charset" :)

Yup.

> btw you have been asked to use ">" for quoting, haven't you?

Yup. Lazy. Fixed now. Thanks.

- C

Re: Wrong functionality of SUBJ_ALL_CAPS in mixed English and Greek subject

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.
> On Tue, 19 Jan 2010, Mike Cardwell wrote:
> : Then I don't know the Greek alphabet. The relevant subroutine from
> : SpamAssassin::Plugin::HeaderEval is below:
> :    $subject =~ s/[^a-zA-Z]//g;          # only look at letters

On 19.01.10 10:28, Charles Gregory wrote:
> I think the 'issue' is that spamassassin *should* have some 'higher level' 
> check for the *language* of the header.

you apparently mean "charset" :)
btw you have been asked to use ">" for quoting, haven't you?

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
If Barbie is so popular, why do you have to buy her friends? 

Re: Wrong functionality of SUBJ_ALL_CAPS in mixed English and Greek subject

Posted by Charles Gregory <cg...@hwcn.org>.
On Tue, 19 Jan 2010, Mike Cardwell wrote:
: Then I don't know the Greek alphabet. The relevant subroutine from
: SpamAssassin::Plugin::HeaderEval is below:
:    $subject =~ s/[^a-zA-Z]//g;          # only look at letters

I think the 'issue' is that spamassassin *should* have some 'higher level' 
check for the *language* of the header. If it is 'encoded' in a non-Latin 
characterset, then it should 'know' it cannot perform tests like all-caps.
I thought I had read somewhere that it *does* this. Was I wrong, or did 
this 'sanity check' somehow get omitted during upgrades?

The 'problem' with the all-caps test is that it is designed to eliminate 
extraneous non-alphabetic characters, to get around simple spammer tricks 
like gappy or obfuscated text.

To the OP: Is it possible that the 'Greek' is being used, but not properly 
encoded, so that the sanity check I mention above would fail? I 
occasionally see non-English subjects that slip by the 'faraway' character 
set tests because they weren't encoded properly....

- C

Re: Wrong functionality of SUBJ_ALL_CAPS in mixed English and Greek subject

Posted by Mike Cardwell <sp...@lists.grepular.com>.
On 19/01/2010 10:07, mamalos wrote:

>> I just pasted that email into spamalyser.com and it gave this:
>> http://spamalyser.com/v/u32d10ix/mime
>>
>> The subject looks fully capitalised to me when decoded? I'm not overly
>> proficient on my Greek though.
>>
>> --
>> Mike Cardwell    : UK based IT Consultant, LAMP developer, Linux admin
>> Cardwell IT Ltd. : UK Company - http://cardwellit.com/       #06920226
>> Technical Blog   : Tech Blog  - https://secure.grepular.com/blog/
>> Spamalyser       : Spam Tool  - http://spamalyser.com/
>
>  From the link you sent me (spamalizer), the subject is all in lower case
> except from the word "TEST" which is written in english.

Then I don't know the Greek alphabet. The relevant subroutine from 
SpamAssassin::Plugin::HeaderEval is below:

================================================================================
sub subject_is_all_caps {
    my ($self, $pms) = @_;
    my $subject = $pms->get('Subject');

    $subject =~ s/^\s+//;
    $subject =~ s/\s+$//;
    return 0 if $subject !~ /\s/;        # don't match one word subjects
    return 0 if (length $subject < 10);  # don't match short subjects
    $subject =~ s/[^a-zA-Z]//g;          # only look at letters

    # now, check to see if the subject is encoded using a non-ASCII charset.
    # If so, punt on this test to avoid FPs.  We just list the known 
charsets
    # this test will FP on, here.
    my $subjraw = $pms->get('Subject:raw');
    my $CLTFAC = 
Mail::SpamAssassin::Constants::CHARSETS_LIKELY_TO_FP_AS_CAPS;
    if ($subjraw =~ /=\?${CLTFAC}\?/i) {
      return 0;
    }

    return length($subject) && ($subject eq uc($subject));
}
================================================================================

I guess another exception needs adding?

-- 
Mike Cardwell    : UK based IT Consultant, LAMP developer, Linux admin
Cardwell IT Ltd. : UK Company - http://cardwellit.com/       #06920226
Technical Blog   : Tech Blog  - https://secure.grepular.com/blog/
Spamalyser       : Spam Tool  - http://spamalyser.com/

Re: Wrong functionality of SUBJ_ALL_CAPS in mixed English and Greek subject

Posted by mamalos <ma...@eng.auth.gr>.


Mike Cardwell-16 wrote:
> 
> On 19/01/2010 09:11, mamalos wrote:
> 
>>>> and at least one of the
>>>> Greek words is not capitalized,
>>>
>>> Greek? In a subject? Encoded, unencoded?
>>>
>>>> I assume that through this list, someone who is an editor of
>>>> these rulesets may be informed and correct this misbehavior.
>>>
>>> You can submit it as a bug. But first it might be helpful to look at the
>>> subject, for instance if it is encoded or not. Why didn't you provide an
>>> example?
>>>
>>> Kai
>>>
>> The mail is encoded as well as the subject. Here is an example:
>>
>> MIME-Version: 1.0
>> Received: by 10.142.9.1 with SMTP id 1mr1347249wfi.92.1263827282289; Mon,
>> 18
>>          Jan 2010 07:08:02 -0800 (PST)
>> Date: Mon, 18 Jan 2010 17:08:02 +0200
>> Message-ID:<39...@mail.gmail.com>
>> Subject: =?ISO-8859-7?B?9OXz9CDh8Pwg4+zh6esgVEVTVA==?=
>> From: sender<se...@example.com>
>> To: recipient@anotherexample.com
>> Content-Type: multipart/alternative;
>> boundary=00504502af8e37e4ce047d71b85b
>> X-Virus-Scanned: ClamAV using ClamSMTP
>>
>> --00504502af8e37e4ce047d71b85b
>> Content-Type: text/plain; charset=ISO-8859-7
>> Content-Transfer-Encoding: base64
>>
>> 1MXT1CDl3/Dh7OUK
>> --00504502af8e37e4ce047d71b85b
>> Content-Type: text/html; charset=ISO-8859-7
>> Content-Transfer-Encoding: base64
>>
>> 1MXT1CDl3/Dh7OU8YnI+Cg==
>> --00504502af8e37e4ce047d71b85b--
>>
>> So, where should I report this bug?
> 
> I just pasted that email into spamalyser.com and it gave this: 
> http://spamalyser.com/v/u32d10ix/mime
> 
> The subject looks fully capitalised to me when decoded? I'm not overly 
> proficient on my Greek though.
> 
> -- 
> Mike Cardwell    : UK based IT Consultant, LAMP developer, Linux admin
> Cardwell IT Ltd. : UK Company - http://cardwellit.com/       #06920226
> Technical Blog   : Tech Blog  - https://secure.grepular.com/blog/
> Spamalyser       : Spam Tool  - http://spamalyser.com/
> 
> 

>From the link you sent me (spamalizer), the subject is all in lower case
except from the word "TEST" which is written in english.
-- 
View this message in context: http://old.nabble.com/Wrong-functionality-of-SUBJ_ALL_CAPS-in-mixed-English-and-Greek-subject-tp27214418p27223548.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Wrong functionality of SUBJ_ALL_CAPS in mixed English and Greek subject

Posted by Mike Cardwell <sp...@lists.grepular.com>.
On 19/01/2010 09:11, mamalos wrote:

>>> and at least one of the
>>> Greek words is not capitalized,
>>
>> Greek? In a subject? Encoded, unencoded?
>>
>>> I assume that through this list, someone who is an editor of
>>> these rulesets may be informed and correct this misbehavior.
>>
>> You can submit it as a bug. But first it might be helpful to look at the
>> subject, for instance if it is encoded or not. Why didn't you provide an
>> example?
>>
>> Kai
>>
> The mail is encoded as well as the subject. Here is an example:
>
> MIME-Version: 1.0
> Received: by 10.142.9.1 with SMTP id 1mr1347249wfi.92.1263827282289; Mon, 18
>          Jan 2010 07:08:02 -0800 (PST)
> Date: Mon, 18 Jan 2010 17:08:02 +0200
> Message-ID:<39...@mail.gmail.com>
> Subject: =?ISO-8859-7?B?9OXz9CDh8Pwg4+zh6esgVEVTVA==?=
> From: sender<se...@example.com>
> To: recipient@anotherexample.com
> Content-Type: multipart/alternative; boundary=00504502af8e37e4ce047d71b85b
> X-Virus-Scanned: ClamAV using ClamSMTP
>
> --00504502af8e37e4ce047d71b85b
> Content-Type: text/plain; charset=ISO-8859-7
> Content-Transfer-Encoding: base64
>
> 1MXT1CDl3/Dh7OUK
> --00504502af8e37e4ce047d71b85b
> Content-Type: text/html; charset=ISO-8859-7
> Content-Transfer-Encoding: base64
>
> 1MXT1CDl3/Dh7OU8YnI+Cg==
> --00504502af8e37e4ce047d71b85b--
>
> So, where should I report this bug?

I just pasted that email into spamalyser.com and it gave this: 
http://spamalyser.com/v/u32d10ix/mime

The subject looks fully capitalised to me when decoded? I'm not overly 
proficient on my Greek though.

-- 
Mike Cardwell    : UK based IT Consultant, LAMP developer, Linux admin
Cardwell IT Ltd. : UK Company - http://cardwellit.com/       #06920226
Technical Blog   : Tech Blog  - https://secure.grepular.com/blog/
Spamalyser       : Spam Tool  - http://spamalyser.com/

Re: Wrong functionality of SUBJ_ALL_CAPS in mixed English and Greek subject

Posted by Kai Schaetzl <ma...@conactive.com>.
Mamalos wrote on Tue, 19 Jan 2010 01:11:49 -0800 (PST):

> The mail is encoded as well as the subject. Here is an example:

Yeah, and they are lower-case, I see.

> So, where should I report this bug?

I think the problem here is determination of capital/non-capital letters in 
other scripts than ASCII. It's probably impossible to do that for all 
languages. This is actually an eval rule (e.g. it points to a function, not a 
simple regex), so there's probably at least some tries on getting better 
results than just for ASCII. In short: as it works now may be by design.
Anyway, if you go to spamassassin.org you will find a link to "Bugs", where 
you can report it.

Kai

-- 
Get your web at Conactive Internet Services: http://www.conactive.com




Re: Wrong functionality of SUBJ_ALL_CAPS in mixed English and Greek subject

Posted by mamalos <ma...@eng.auth.gr>.

Kai Schaetzl wrote:
> 
> Mamalos wrote on Mon, 18 Jan 2010 10:01:53 -0800 (PST):
> 
>> and at least one of the
>> Greek words is not capitalized,
> 
> Greek? In a subject? Encoded, unencoded?
> 
>> I assume that through this list, someone who is an editor of
>> these rulesets may be informed and correct this misbehavior.
> 
> You can submit it as a bug. But first it might be helpful to look at the 
> subject, for instance if it is encoded or not. Why didn't you provide an 
> example?
> 
> Kai
> 
The mail is encoded as well as the subject. Here is an example:

MIME-Version: 1.0
Received: by 10.142.9.1 with SMTP id 1mr1347249wfi.92.1263827282289; Mon, 18
        Jan 2010 07:08:02 -0800 (PST)
Date: Mon, 18 Jan 2010 17:08:02 +0200
Message-ID: <39...@mail.gmail.com>
Subject: =?ISO-8859-7?B?9OXz9CDh8Pwg4+zh6esgVEVTVA==?=
From: sender <se...@example.com>
To: recipient@anotherexample.com
Content-Type: multipart/alternative; boundary=00504502af8e37e4ce047d71b85b
X-Virus-Scanned: ClamAV using ClamSMTP

--00504502af8e37e4ce047d71b85b
Content-Type: text/plain; charset=ISO-8859-7
Content-Transfer-Encoding: base64

1MXT1CDl3/Dh7OUK
--00504502af8e37e4ce047d71b85b
Content-Type: text/html; charset=ISO-8859-7
Content-Transfer-Encoding: base64

1MXT1CDl3/Dh7OU8YnI+Cg==
--00504502af8e37e4ce047d71b85b--



So, where should I report this bug?
-- 
View this message in context: http://old.nabble.com/Wrong-functionality-of-SUBJ_ALL_CAPS-in-mixed-English-and-Greek-subject-tp27214418p27222810.html
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


Re: Wrong functionality of SUBJ_ALL_CAPS in mixed English and Greek subject

Posted by Kai Schaetzl <ma...@conactive.com>.
Mamalos wrote on Mon, 18 Jan 2010 10:01:53 -0800 (PST):

> and at least one of the
> Greek words is not capitalized,

Greek? In a subject? Encoded, unencoded?

> I assume that through this list, someone who is an editor of
> these rulesets may be informed and correct this misbehavior.

You can submit it as a bug. But first it might be helpful to look at the 
subject, for instance if it is encoded or not. Why didn't you provide an 
example?

Kai

-- 
Get your web at Conactive Internet Services: http://www.conactive.com