You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by ha...@t-online.de on 2017/12/14 12:03:00 UTC

Re: check utf-8 subjects/from?

>> Hi,
>> 
>> On Wed, Dec 13, 2017 at 9:08 PM, David B Funk
>> <db...@engineering.uiowa.edu> wrote:
>> > On Wed, 13 Dec 2017, AJ Weber wrote:
>> >
>> >> Is there an easy way to check if the Subject or From is UTF-8 -- or
>> >> non-ASCII -- char set?
>> >>
>> >> I see in some of my recent spam, either the Subject or the From (sometimes
>> >> both) starts with "=?UTF-8?" (in these cases the rest is Base64 encoded, but
>> >> I don't want to qualify on that).
>> >>
>> >> If I check a header with a "header ... =~" regex rule, is it the raw text
>> >> that I will check, or is it the decoded characters I will be checking
>> >> against?
>> >>
>> >> If it's the raw text, I can probably just look for that prefix to indicate
>> >> the UTF-8 encoding.
>> >>
>> >> I do get some legitimate emails with encoded chars and emojis, etc...but I
>> >> think I'd like a rule to support it being SPAM in general.
>> >
>> >
>> > As other people have said, the header ":raw" rule form will let you match on
>> > that.
>> > There are two commonly used encoding methods for UTF-8:
>> >  Base64 "=?utf-8?B?"
>> >  Quoted-Printable "=?utf-8?Q?"
>> >
>> > There's nothing that prevents a mailer from using either for purely 7-bit
>> > ASCII,
>> > even though it isn't necessary. You are more likely to see that used by
>> > international clients. They may just utf-8 encode by default so not to have
>> > to do special processing for non 7-bit ASCII headers.
>> 
>> We've been seeing a number of emails with subjects using UTF-8 in an
>> attempt to obscure the sender by using some form of 8-bit characters.
>> For example, this spells dropbox:
>> 
>>   From: "=?utf-8?B?xJByb3Bib8+X?=" <ab...@ecacolleges.com>
>> 
>> How would we write a header rule against that? Just use From:raw?
>> 
>> Is it possible to write a rule using the decoded characters, like
>> "dr�p-b�x" or "D?op?o?"?
>> 
>> I've also tried variations of "dropbox" such as "dr?pb?x" etc...

Hi Alex,

as I live in Germany, I also see nothing special in encoded utf-8 ... 
Just use the decoded From line rather than the raw version.

One thing that certainly is worth detecting is a plain name part containing a different email. (I am
not sure if such a rule already exists)
Now for your example, you would probably have to write rules with the purported sender's spelling variations
and a meta in case the _real_ name and a valid email is detected.

Regards
Wolfgang