You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by AJ Weber <aw...@comcast.net> on 2017/12/13 18:44:49 UTC

check utf-8 subjects/from?

Is there an easy way to check if the Subject or From is UTF-8 -- or 
non-ASCII -- char set?

I see in some of my recent spam, either the Subject or the From 
(sometimes both) starts with "=?UTF-8?" (in these cases the rest is 
Base64 encoded, but I don't want to qualify on that).

If I check a header with a "header ... =~" regex rule, is it the raw 
text that I will check, or is it the decoded characters I will be 
checking against?

If it's the raw text, I can probably just look for that prefix to 
indicate the UTF-8 encoding.

I do get some legitimate emails with encoded chars and emojis, etc...but 
I think I'd like a rule to support it being SPAM in general.

Thanks again,
AJ

Re: check utf-8 subjects/from?

Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 13 Dec 2017, at 21:08 (-0500), David B Funk wrote:

[...]
> There's nothing that prevents a mailer from using either for purely 
> 7-bit ASCII,
> even though it isn't necessary. You are more likely to see that used 
> by international clients. They may just utf-8 encode by default so not 
> to have to do special processing for non 7-bit ASCII headers.

There's even a SA rule for that: FROM_EXCESS_BASE64

--
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Currently Seeking Steady Work: https://linkedin.com/in/billcole

Re: check utf-8 subjects/from?

Posted by John Hardin <jh...@impsec.org>.
On Wed, 13 Dec 2017, Alex wrote:

> We've been seeing a number of emails with subjects using UTF-8 in an
> attempt to obscure the sender by using some form of 8-bit characters.
> For example, this spells dropbox:
>
>  From: "=?utf-8?B?xJByb3Bib8+X?=" <ab...@ecacolleges.com>
>
> How would we write a header rule against that? Just use From:raw?
>
> Is it possible to write a rule using the decoded characters, like
> "dróp-bóx" or "Dṙopḇoẋ"?
>
> I've also tried variations of "dropbox" such as "dr?pb?x" etc...

There are already obfuscated-text rules, and the subject is incorporated 
in the body text so they would scan that.

Take a look at the existing FUZZY_* rules.

Possibly (untested):

     body          FUZZY_DROPBOX  /<D>(?!ropbox)<R><O><P><B><O><X>/i
     replace_rules FUZZY_DROPBOX
     describe      FUZZY_DROPBOX  Obfuscated "dropbox"



-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Activist: Someone who gets involved.
   Unregistered Lobbyist: Someone who gets involved
        with something the MSM doesn't approve of.         -- WizardPC
-----------------------------------------------------------------------
  Tomorrow: Bill of Rights day

Re: check utf-8 subjects/from?

Posted by Alex <my...@gmail.com>.
Hi,

On Wed, Dec 13, 2017 at 9:08 PM, David B Funk
<db...@engineering.uiowa.edu> wrote:
> On Wed, 13 Dec 2017, AJ Weber wrote:
>
>> Is there an easy way to check if the Subject or From is UTF-8 -- or
>> non-ASCII -- char set?
>>
>> I see in some of my recent spam, either the Subject or the From (sometimes
>> both) starts with "=?UTF-8?" (in these cases the rest is Base64 encoded, but
>> I don't want to qualify on that).
>>
>> If I check a header with a "header ... =~" regex rule, is it the raw text
>> that I will check, or is it the decoded characters I will be checking
>> against?
>>
>> If it's the raw text, I can probably just look for that prefix to indicate
>> the UTF-8 encoding.
>>
>> I do get some legitimate emails with encoded chars and emojis, etc...but I
>> think I'd like a rule to support it being SPAM in general.
>
>
> As other people have said, the header ":raw" rule form will let you match on
> that.
> There are two commonly used encoding methods for UTF-8:
>  Base64 "=?utf-8?B?"
>  Quoted-Printable "=?utf-8?Q?"
>
> There's nothing that prevents a mailer from using either for purely 7-bit
> ASCII,
> even though it isn't necessary. You are more likely to see that used by
> international clients. They may just utf-8 encode by default so not to have
> to do special processing for non 7-bit ASCII headers.

We've been seeing a number of emails with subjects using UTF-8 in an
attempt to obscure the sender by using some form of 8-bit characters.
For example, this spells dropbox:

  From: "=?utf-8?B?xJByb3Bib8+X?=" <ab...@ecacolleges.com>

How would we write a header rule against that? Just use From:raw?

Is it possible to write a rule using the decoded characters, like
"dróp-bóx" or "Dṙopḇoẋ"?

I've also tried variations of "dropbox" such as "dr?pb?x" etc...

Re: check utf-8 subjects/from?

Posted by David B Funk <db...@engineering.uiowa.edu>.
On Wed, 13 Dec 2017, AJ Weber wrote:

> Is there an easy way to check if the Subject or From is UTF-8 -- or non-ASCII 
> -- char set?
>
> I see in some of my recent spam, either the Subject or the From (sometimes 
> both) starts with "=?UTF-8?" (in these cases the rest is Base64 encoded, but 
> I don't want to qualify on that).
>
> If I check a header with a "header ... =~" regex rule, is it the raw text 
> that I will check, or is it the decoded characters I will be checking 
> against?
>
> If it's the raw text, I can probably just look for that prefix to indicate 
> the UTF-8 encoding.
>
> I do get some legitimate emails with encoded chars and emojis, etc...but I 
> think I'd like a rule to support it being SPAM in general.

As other people have said, the header ":raw" rule form will let you match on that.
There are two commonly used encoding methods for UTF-8:
  Base64 "=?utf-8?B?"
  Quoted-Printable "=?utf-8?Q?"

There's nothing that prevents a mailer from using either for purely 7-bit ASCII,
even though it isn't necessary. You are more likely to see that used by 
international clients. They may just utf-8 encode by default so not to have to 
do special processing for non 7-bit ASCII headers.


-- 
Dave Funk                                  University of Iowa
<dbfunk (at) engineering.uiowa.edu>        College of Engineering
319/335-5751   FAX: 319/384-0549           1256 Seamans Center
Sys_admin/Postmaster/cell_admin            Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Re: check utf-8 subjects/from?

Posted by RW <rw...@googlemail.com>.
On Wed, 13 Dec 2017 13:44:49 -0500
AJ Weber wrote:

> If I check a header with a "header ... =~" regex rule, is it the raw 
> text that I will check, or is it the decoded characters I will be 
> checking against?

You can use  From:raw to get the raw From header.


BTW if you want to ask a new question, please just send an email to the
list address rather than reply to an existing thread. 

Re: check utf-8 subjects/from?

Posted by RW <rw...@googlemail.com>.
On Wed, 13 Dec 2017 18:37:59 -0500
AJ Weber wrote:


> >>>
> >>> that tells me that rougly 10% of all ham mails would hit  
> There seems to be a large disparity between your (10%) result and my 
> (2%) result.  Can you explain how that could be?

He's Austrian, so it's probably mainly due to umlauts.

Re: check utf-8 subjects/from?

Posted by AJ Weber <aw...@comcast.net>.
On 12/13/2017 6:58 PM, Reindl Harald wrote:
> > There seems to be a large disparity between your (10%) result and my
> > (2%) result.  Can you explain how that could be?
>
> surely, from the moment you have not only english messages it looks 
> completly different and don't forget that the corpus where i run the 
> quick grep is only a very low subset of real mailflow for training as 
> ham when needed
>
I'm not sure I understand what you are saying now.

Are you saying you ran a flawed/inaccurate test but sent the results 
anyway in order to make a point that no one asked you about?

Or are you saying that every mail environment is (necessarily) 
different, and whatever your opinion and results in your local 
environment are, they may not be applicable to another environment in 
another country, so you probably should not make your assumptions and 
opinions sound like facts?

In my OPINION, the aforementioned rule that I will test is likely NOT a 
good candidate for many environments - but I never promoted it as such 
in the first place.

Apologies to all whose inboxes were cluttered with this tangent.

Re: check utf-8 subjects/from?

Posted by AJ Weber <aw...@comcast.net>.
On 12/13/2017 5:18 PM, Reindl Harald wrote:
>
> my statements are based on a decade expierinece with a lot of users 
> from all over the world, on you personal server you can even reject 
> anything not whitelisted, from the moment on when other peoples 
> mailflow is affected it's no longer that easy
It's true.  At first I noticed a pattern and decided to look-into how I 
could write a rule, probably starting with a low score, to test its 
effectiveness.

However, I ran your test to determine how many emails it would actually 
affect.  In a folder of just over 5100 emails, there would be < 2% 
false-positives.  That's actually better than I expected!  If you 
offered me a rule that only anticipated 2% false positives to try, I 
would say it was worth it for sure!

>
>>> this would be a rule with a majority of false positives
>>> you really should also look at your HAM
I didn't see the basis for your "majority" of false positives.  Did you 
run your test against a spam folder as well?  What were the results there?
>>>
>>> cat *.eml | grep UTF-8 | grep -i subject | wc -l
>>> 2150
>>>
>>> that tells me that rougly 10% of all ham mails would hit
There seems to be a large disparity between your (10%) result and my 
(2%) result.  Can you explain how that could be?

Thank you again!

Re: check utf-8 subjects/from?

Posted by AJ Weber <aw...@comcast.net>.
Would you be so kind as to tell me how you hacked into my mail server to 
determine the basis for your statements?



On 12/13/2017 4:52 PM, Reindl Harald wrote:
>
>
> Am 13.12.2017 um 19:44 schrieb AJ Weber:
>> Is there an easy way to check if the Subject or From is UTF-8 -- or 
>> non-ASCII -- char set?
>>
>> I see in some of my recent spam, either the Subject or the From 
>> (sometimes both) starts with "=?UTF-8?" (in these cases the rest is 
>> Base64 encoded, but I don't want to qualify on that).
>>
>> If I check a header with a "header ... =~" regex rule, is it the raw 
>> text that I will check, or is it the decoded characters I will be 
>> checking against?
>>
>> If it's the raw text, I can probably just look for that prefix to 
>> indicate the UTF-8 encoding.
>>
>> I do get some legitimate emails with encoded chars and emojis, 
>> etc...but I think I'd like a rule to support it being SPAM in general
>
> based on what?
>
> this would be a rule with a majority of false positives
> you really should also look at your HAM
>
> cat *.eml | grep UTF-8 | grep -i subject | wc -l
> 2150
>
> that tells me that rougly 10% of all ham mails would hit