You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Clive Jacques <we...@gmail.com> on 2021/05/20 15:42:59 UTC

Detect Emoticons in Subject

Hi,

I've been using SA a long time.  Lately, I'm getting more and more spam
with emoticons in the subject line.  I'd say about 90% of my emails with
emoticons in the subject are spam.  I'd like to create a local rule which
scores email with emoticons in the subject.  I saw a previous discussion on
this in the archive, but it was focused on whether such emails were *always
*spam.  I think an emoticon rule, in combination with other rules, will
help my installation.  I've tried to match as follows, but it won't lint.
I'm not really a perl programmer.  I've written several other more
conventional local rules, but here I'm a bit out of my depth.  I'd
appreciate some guidance.

# Local Rule for Emoticons in subject
subject        EMOTICON_IN_SUBJECT      Subject =~ /\p{Emoticons}/
score          EMOTICON_IN_SUBJECT      3.0
describe        EMOTICON_IN_SUBJECT     Subject Line Has Emoticons

-CJ

Re: Detect Emoticons in Subject

Posted by Henrik K <he...@hege.li>.
On Fri, May 21, 2021 at 09:53:36AM +0200, Tom Hendrikx wrote:
>
> Can someone explain why SA cannot support this type of syntax, or what would
> be needed to get it supported? IMHO it makes it a lot easier for end-users
> to understand a rule, and for rule developers to write or even contribute
> new UTF-8-related rules, so it might be worth the effort to get it
> supported?

Perl strings internally would have to be UTF8.  Mandatory prerequisite would
be normalize_charset 1 in SA.  Could be some cases where SA can't decode
mails properly to UTF8, so it's a question mark what happens then.

Some changes are coming already in 4.0, for example normalize_charset 1 will
be default.  But more complex internal/rule changes require a lot of thought
on how to maintain backwards compatibility.  I'm sure some people will still
run 3.4 for years to come.

Sorry to say but there are too few developers right now.  It's up to the
community to pick up the pace.


Re: Detect Emoticons in Subject

Posted by Tom Hendrikx <to...@whyscream.net>.
On 20-05-2021 18:19, RW wrote:
> On Thu, 20 May 2021 11:42:59 -0400
> Clive Jacques wrote:
> 
>> Hi,
>>
>> I've been using SA a long time.  Lately, I'm getting more and more
>> spam with emoticons in the subject line.  I'd say about 90% of my
>> emails with emoticons in the subject are spam.  I'd like to create a
>> local rule which scores email with emoticons in the subject.
> 
>> # Local Rule for Emoticons in subject
>> subject        EMOTICON_IN_SUBJECT      Subject =~ /\p{Emoticons}/
> 
> The rule should start with "header", that's what's causing the lint
> failure.
> 
> However, AFAIK, the rule still won't work because \p{Emoticons}
> isn't supported in spamassassin, which works on byte sequences. You
> need to rewrite it to match UTF-8 bytes.
> 

I'm not a real fan of very complex regular expressions, as they tend to 
get hard to read/understand very quickly. This thread is a perfect 
example: the syntax that the OP proposed (/\p{Emoticons}/) seems 
perfectly readable, and all the actually working alternatives are, with 
all respect to the authors, a nightmare to decipher. Especially for 
users not really proficient in regular expressions, the OP's syntax is 
perfectly understandable and all the alternatives aren't.

I'm not really into the regex engine of perl/SA, so please correct if 
I'm wrong. The /\p{Emoticons}/ syntax seems to me a builtin feature of 
the regex spec/perl (as opposed to pseudo-code, displaying something 
that actually doesn't exist).

Can someone explain why SA cannot support this type of syntax, or what 
would be needed to get it supported? IMHO it makes it a lot easier for 
end-users to understand a rule, and for rule developers to write or even 
contribute new UTF-8-related rules, so it might be worth the effort to 
get it supported?

Thanks in advance,
	Tom

Re: Detect Emoticons in Subject

Posted by Martin Gregorie <ma...@gregorie.org>.
On Thu, 2021-05-20 at 18:34 +0200, Bert Van de Poel wrote:
> We've started getting lots of spam with emoji in the subject too the 
> past few weeks, so I've looked into this as well. As mentioned by RW, 
> you would need to create some kind of UTF8 regex header Subject rule. As
> I'm not too excited about writing such a regex, it's way at the bottom
> of my todo list 
>
Should be easy enough - IsASCII is just a name for [\x00-\x7f] and
IsXDigit is [0-9a-fA-F], so the same logic can be applied to define a
regex that triggers on any character within the three Unicode emoji
ranges. See Wikipedia doe more detail: 

https://en.wikipedia.org/wiki/Emoticon#Unicode

I haven't yet seen any emojis in Subject lines, regardless of whether
the message was spam or not, or I'd probably have already written such a
rule and given it a minimal score so it can be used in a more spam-
specific meta rule.

Martin




Re: Detect Emoticons in Subject

Posted by RW <rw...@googlemail.com>.
On Thu, 20 May 2021 19:39:06 +0100
RW wrote:

> 
> /\xF0\x9F(?:\x98[\x80-\xBF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xBF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/


This includes the block mentioned by Bill Cole and and is simplified a
bit


/\xF0\x9F[\x98-\x99\xA4-\xA7\x8C-\x97][\x80-\x8F]|\xE2\x98[\xB9-\xBB]/


However, if you don't expect to get any legitimate mail with Asian
languages in the subject, you can probably get away with including all
4-byte UTF-8. Those code points are dominated by CJK, symbols, emojis
and dead languages.


/[\xF0-\xF7][\x80-\xBF]{3}|\xE2\x98[\xB9-\xBB]/

Re: Detect Emoticons in Subject

Posted by RW <rw...@googlemail.com>.
On Thu, 20 May 2021 19:26:30 +0100
RW wrote:

> On Thu, 20 May 2021 18:44:43 +0100
> RW wrote:
> 
> > On Thu, 20 May 2021 18:30:03 +0100
> > RW wrote:
> > 
> >   
> > > Try this:
> > > 
> > > 
> > > header  EMOTICON_IN_SUBJECT  Subject =~
> > > /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
> > >     
> > 
> > Actually that's only the original block, but it probably works most
> > of the time  
> 
> This extends it to Supplemental Symbols and Pictographs and
> adds the three original faces from Miscellaneous Symbols
> 
> 
> /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xFF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/
> 
> it also fixes a minor problem with a continuation bytes in the
> original.
> 
I still didn't get continuity bytes right, I forgot that bit 6 is always
0 - it's a long time since I've done this.

/\xF0\x9F(?:\x98[\x80-\xBF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xBF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/

Re: Detect Emoticons in Subject

Posted by RW <rw...@googlemail.com>.
On Thu, 20 May 2021 18:44:43 +0100
RW wrote:

> On Thu, 20 May 2021 18:30:03 +0100
> RW wrote:
> 
> 
> > Try this:
> > 
> > 
> > header  EMOTICON_IN_SUBJECT  Subject =~
> > /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
> >   
> 
> Actually that's only the original block, but it probably works most of
> the time

This extends it to Supplemental Symbols and Pictographs and
adds the three original faces from Miscellaneous Symbols


/\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xFF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/

it also fixes a minor problem with a continuation bytes in the original.


Re: Detect Emoticons in Subject

Posted by Clive Jacques <we...@gmail.com>.
That's fine - I'm not saying all email containing emojis in the subject (or
elsewhere) *is *spam - just that it's uncommon and right now, about 90% of
the time it is *for me*.  I just want to score it as part of the greater
constellation of factors (just like DKIM, SPF etc.).

On Thu, May 20, 2021 at 2:48 PM Bill Cole <
sausers-20150205@billmail.scconsult.com> wrote:

>
> People send wanted mail with all sorts of weirdness.
>
>

Re: Detect Emoticons in Subject

Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 2021-05-20 at 13:44:43 UTC-0400 (Thu, 20 May 2021 18:44:43 +0100)
RW <rw...@googlemail.com>
is rumored to have said:

> On Thu, 20 May 2021 18:30:03 +0100
> RW wrote:
>
>
>> Try this:
>>
>>
>> header  EMOTICON_IN_SUBJECT  Subject =~
>> /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
>>
>
> Actually that's only the original block, but it probably works most of
> the time

Not so sure about that...

I regularly get mail from Patreon with emoji in the encoded header which 
don't match that pattern:


# grep '^Subject: ' /tmp/ham |cut -d? -f4 |decode-base64 |hexdump -C
00000000  f0 9f 8e 89 20 50 61 74  72 69 63 6b 20 57 61 72  |.... 
Patrick War|
00000010  64 6c 65 20 6a 75 73 74  20 73 68 61 72 65 64 20  |dle just 
shared |
00000020  22 f0 9f 93 9d 20 4e                              |".... N|
00000027

People send wanted mail with all sorts of weirdness.

Looking at the full set 
(https://www.unicode.org/emoji/charts/full-emoji-list.html) I can 
understand why \p{Emoticons} would be so much better than trying to 
define them all in a regex of hex bytes in UTF-8 form.

-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire

Re: Detect Emoticons in Subject

Posted by RW <rw...@googlemail.com>.
On Thu, 20 May 2021 18:30:03 +0100
RW wrote:


> Try this:
> 
> 
> header  EMOTICON_IN_SUBJECT  Subject =~
> /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
> 

Actually that's only the original block, but it probably works most of
the time

Re: Detect Emoticons in Subject

Posted by RW <rw...@googlemail.com>.
On Thu, 20 May 2021 18:34:54 +0200
Bert Van de Poel wrote:

> We've started getting lots of spam with emoji in the subject too the 
> past few weeks, so I've looked into this as well. As mentioned by RW, 
> you would need to create some kind of UTF8 regex header Subject rule.
> As I'm not too excited about writing such a regex, it's way at the
> bottom of my todo list to contemplate whether an SA plugin could be
> written for that and to then reach out to the SA developers to see
> whether that would be something upstream would accept. But honestly,
> I won't be able to any time soon (I don't have the time). Still,
> thought I'd mention it, since it might be relevant to your question.
> If you do end up figuring out a regex that works out and isn't an
> extreme length, I think plenty of people on this list would love to
> know!

Try this:


header  EMOTICON_IN_SUBJECT  Subject =~ /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/


Re: Detect Emoticons in Subject

Posted by Bert Van de Poel <be...@ulyssis.org>.
We've started getting lots of spam with emoji in the subject too the 
past few weeks, so I've looked into this as well. As mentioned by RW, 
you would need to create some kind of UTF8 regex header Subject rule. As 
I'm not too excited about writing such a regex, it's way at the bottom 
of my todo list to contemplate whether an SA plugin could be written for 
that and to then reach out to the SA developers to see whether that 
would be something upstream would accept. But honestly, I won't be able 
to any time soon (I don't have the time). Still, thought I'd mention it, 
since it might be relevant to your question. If you do end up figuring 
out a regex that works out and isn't an extreme length, I think plenty 
of people on this list would love to know!

Bert

On 20/05/2021 18:19, RW wrote:
> On Thu, 20 May 2021 11:42:59 -0400
> Clive Jacques wrote:
>
>> Hi,
>>
>> I've been using SA a long time.  Lately, I'm getting more and more
>> spam with emoticons in the subject line.  I'd say about 90% of my
>> emails with emoticons in the subject are spam.  I'd like to create a
>> local rule which scores email with emoticons in the subject.
>> # Local Rule for Emoticons in subject
>> subject        EMOTICON_IN_SUBJECT      Subject =~ /\p{Emoticons}/
> The rule should start with "header", that's what's causing the lint
> failure.
>
> However, AFAIK, the rule still won't work because \p{Emoticons}
> isn't supported in spamassassin, which works on byte sequences. You
> need to rewrite it to match UTF-8 bytes.


Re: Detect Emoticons in Subject

Posted by RW <rw...@googlemail.com>.
On Thu, 20 May 2021 11:42:59 -0400
Clive Jacques wrote:

> Hi,
> 
> I've been using SA a long time.  Lately, I'm getting more and more
> spam with emoticons in the subject line.  I'd say about 90% of my
> emails with emoticons in the subject are spam.  I'd like to create a
> local rule which scores email with emoticons in the subject. 

> # Local Rule for Emoticons in subject
> subject        EMOTICON_IN_SUBJECT      Subject =~ /\p{Emoticons}/

The rule should start with "header", that's what's causing the lint
failure. 

However, AFAIK, the rule still won't work because \p{Emoticons}
isn't supported in spamassassin, which works on byte sequences. You
need to rewrite it to match UTF-8 bytes.

Re: Detect Emoticons in Subject: CHAOS

Posted by Benny Pedersen <me...@junc.eu>.
On 2021-05-20 22:33, Clive Jacques wrote:
> Here is a good example of such an email (attached, stripped of
> identifying info).

This attachment is suspicious because its type doesn't match the type 
declared in the message. If you do not trust the sender, you shouldn't 
open it in the browser because it may contain malicious contents.

Expected: text/plain (.txt); found: message/rfc822 (.eml)

should i ignore roundcube warnings ? :)

Re: Detect Emoticons in Subject: CHAOS

Posted by Clive Jacques <we...@gmail.com>.
Here is a good example of such an email (attached, stripped of identifying
info).

On Thu, May 20, 2021 at 4:03 PM RW <rw...@googlemail.com> wrote:

> On Thu, 20 May 2021 15:35:21 -0400
> Jared Hall wrote:
>
> > Clive Jacques wrote:
>
> > > # Local Rule for Emoticons in subject
> > > subject        EMOTICON_IN_SUBJECT      Subject =~ /\p{Emoticons}/
>
> >
> > The following regex will detect a good amount of Emojis:
> >
> >
> |/[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}]/ug
>
> > |
> That doesn't work in SA for the same reason that \p{Emoticons}
> doesn't work.
>

Re: Detect Emoticons in Subject: CHAOS

Posted by RW <rw...@googlemail.com>.
On Thu, 20 May 2021 15:35:21 -0400
Jared Hall wrote:

> Clive Jacques wrote:

> > # Local Rule for Emoticons in subject
> > subject        EMOTICON_IN_SUBJECT      Subject =~ /\p{Emoticons}/

> 
> The following regex will detect a good amount of Emojis:
> 
> |/[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}]/ug 
> |
That doesn't work in SA for the same reason that \p{Emoticons}
doesn't work.

Re: Detect Emoticons in Subject: CHAOS

Posted by Jared Hall <ja...@jaredsec.com>.
Clive Jacques wrote:
> Hi,
>
> I've been using SA a long time.  Lately, I'm getting more and more 
> spam with emoticons in the subject line.  I'd say about 90% of my 
> emails with emoticons in the subject are spam.  I'd like to create a 
> local rule which scores email with emoticons in the subject.  I saw a 
> previous discussion on this in the archive, but it was focused on 
> whether such emails were /always /spam.  I think an emoticon rule, in 
> combination with other rules, will help my installation.  I've tried 
> to match as follows, but it won't lint.  I'm not really a perl 
> programmer.  I've written several other more conventional local rules, 
> but here I'm a bit out of my depth.  I'd appreciate some guidance.
>
> # Local Rule for Emoticons in subject
> subject        EMOTICON_IN_SUBJECT      Subject =~ /\p{Emoticons}/
> score          EMOTICON_IN_SUBJECT      3.0
> describe        EMOTICON_IN_SUBJECT     Subject Line Has Emoticons
>
> -CJ

The following regex will detect a good amount of Emojis:

|/[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}]/ug 
|


Ref: 
https://stackoverflow.com/questions/43242440/javascript-unicode-emoji-regular-expressions/45138005#45138005

But it is not the greatest thing if you want to get a count out of that.

<toot>
However, I may have a solution for you with the CHAOS plugin:

https://github.com/telecom2k3/CHAOS

You can get (but shouldn't) Emojis even in From names, like this actual one:

DHL☺com

CHAOS will also help you with Unicode Character spoofs, via its 
UniBabble rulesets:

ᴀмαzσи ᴘ𝔯𝔦𝔪ё
𝘼𝔪𝔞𝘻𝙤𝘯 𝘾𝘶𝘴𝙩𝙤𝘮𝘦𝘳 𝙎𝔢𝘳𝙫𝘪𝘤𝔢
Amαzoɴ Priⅿë
🅰🅼🅰🆉🅾🅽 🆂🅴🆁🆅🅸🅲🅴
𝐀𝐦𝐚𝐳𝐨𝐧 𝐍𝐨𝐭𝐢𝐜𝐞
...
...

CHAOS will run on PERL 5.18 and later.

</toot>


-- Jared Hall