You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Clive Jacques <we...@gmail.com> on 2021/05/20 15:42:59 UTC
Detect Emoticons in Subject
Hi,
I've been using SA a long time. Lately, I'm getting more and more spam
with emoticons in the subject line. I'd say about 90% of my emails with
emoticons in the subject are spam. I'd like to create a local rule which
scores email with emoticons in the subject. I saw a previous discussion on
this in the archive, but it was focused on whether such emails were *always
*spam. I think an emoticon rule, in combination with other rules, will
help my installation. I've tried to match as follows, but it won't lint.
I'm not really a perl programmer. I've written several other more
conventional local rules, but here I'm a bit out of my depth. I'd
appreciate some guidance.
# Local Rule for Emoticons in subject
subject EMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/
score EMOTICON_IN_SUBJECT 3.0
describe EMOTICON_IN_SUBJECT Subject Line Has Emoticons
-CJ
Re: Detect Emoticons in Subject
Posted by Henrik K <he...@hege.li>.
On Fri, May 21, 2021 at 09:53:36AM +0200, Tom Hendrikx wrote:
>
> Can someone explain why SA cannot support this type of syntax, or what would
> be needed to get it supported? IMHO it makes it a lot easier for end-users
> to understand a rule, and for rule developers to write or even contribute
> new UTF-8-related rules, so it might be worth the effort to get it
> supported?
Perl strings internally would have to be UTF8. Mandatory prerequisite would
be normalize_charset 1 in SA. Could be some cases where SA can't decode
mails properly to UTF8, so it's a question mark what happens then.
Some changes are coming already in 4.0, for example normalize_charset 1 will
be default. But more complex internal/rule changes require a lot of thought
on how to maintain backwards compatibility. I'm sure some people will still
run 3.4 for years to come.
Sorry to say but there are too few developers right now. It's up to the
community to pick up the pace.
Re: Detect Emoticons in Subject
Posted by Tom Hendrikx <to...@whyscream.net>.
On 20-05-2021 18:19, RW wrote:
> On Thu, 20 May 2021 11:42:59 -0400
> Clive Jacques wrote:
>
>> Hi,
>>
>> I've been using SA a long time. Lately, I'm getting more and more
>> spam with emoticons in the subject line. I'd say about 90% of my
>> emails with emoticons in the subject are spam. I'd like to create a
>> local rule which scores email with emoticons in the subject.
>
>> # Local Rule for Emoticons in subject
>> subject EMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/
>
> The rule should start with "header", that's what's causing the lint
> failure.
>
> However, AFAIK, the rule still won't work because \p{Emoticons}
> isn't supported in spamassassin, which works on byte sequences. You
> need to rewrite it to match UTF-8 bytes.
>
I'm not a real fan of very complex regular expressions, as they tend to
get hard to read/understand very quickly. This thread is a perfect
example: the syntax that the OP proposed (/\p{Emoticons}/) seems
perfectly readable, and all the actually working alternatives are, with
all respect to the authors, a nightmare to decipher. Especially for
users not really proficient in regular expressions, the OP's syntax is
perfectly understandable and all the alternatives aren't.
I'm not really into the regex engine of perl/SA, so please correct if
I'm wrong. The /\p{Emoticons}/ syntax seems to me a builtin feature of
the regex spec/perl (as opposed to pseudo-code, displaying something
that actually doesn't exist).
Can someone explain why SA cannot support this type of syntax, or what
would be needed to get it supported? IMHO it makes it a lot easier for
end-users to understand a rule, and for rule developers to write or even
contribute new UTF-8-related rules, so it might be worth the effort to
get it supported?
Thanks in advance,
Tom
Re: Detect Emoticons in Subject
Posted by Martin Gregorie <ma...@gregorie.org>.
On Thu, 2021-05-20 at 18:34 +0200, Bert Van de Poel wrote:
> We've started getting lots of spam with emoji in the subject too the
> past few weeks, so I've looked into this as well. As mentioned by RW,
> you would need to create some kind of UTF8 regex header Subject rule. As
> I'm not too excited about writing such a regex, it's way at the bottom
> of my todo list
>
Should be easy enough - IsASCII is just a name for [\x00-\x7f] and
IsXDigit is [0-9a-fA-F], so the same logic can be applied to define a
regex that triggers on any character within the three Unicode emoji
ranges. See Wikipedia doe more detail:
https://en.wikipedia.org/wiki/Emoticon#Unicode
I haven't yet seen any emojis in Subject lines, regardless of whether
the message was spam or not, or I'd probably have already written such a
rule and given it a minimal score so it can be used in a more spam-
specific meta rule.
Martin
Re: Detect Emoticons in Subject
Posted by RW <rw...@googlemail.com>.
On Thu, 20 May 2021 19:39:06 +0100
RW wrote:
>
> /\xF0\x9F(?:\x98[\x80-\xBF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xBF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/
This includes the block mentioned by Bill Cole and and is simplified a
bit
/\xF0\x9F[\x98-\x99\xA4-\xA7\x8C-\x97][\x80-\x8F]|\xE2\x98[\xB9-\xBB]/
However, if you don't expect to get any legitimate mail with Asian
languages in the subject, you can probably get away with including all
4-byte UTF-8. Those code points are dominated by CJK, symbols, emojis
and dead languages.
/[\xF0-\xF7][\x80-\xBF]{3}|\xE2\x98[\xB9-\xBB]/
Re: Detect Emoticons in Subject
Posted by RW <rw...@googlemail.com>.
On Thu, 20 May 2021 19:26:30 +0100
RW wrote:
> On Thu, 20 May 2021 18:44:43 +0100
> RW wrote:
>
> > On Thu, 20 May 2021 18:30:03 +0100
> > RW wrote:
> >
> >
> > > Try this:
> > >
> > >
> > > header EMOTICON_IN_SUBJECT Subject =~
> > > /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
> > >
> >
> > Actually that's only the original block, but it probably works most
> > of the time
>
> This extends it to Supplemental Symbols and Pictographs and
> adds the three original faces from Miscellaneous Symbols
>
>
> /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xFF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/
>
> it also fixes a minor problem with a continuation bytes in the
> original.
>
I still didn't get continuity bytes right, I forgot that bit 6 is always
0 - it's a long time since I've done this.
/\xF0\x9F(?:\x98[\x80-\xBF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xBF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/
Re: Detect Emoticons in Subject
Posted by RW <rw...@googlemail.com>.
On Thu, 20 May 2021 18:44:43 +0100
RW wrote:
> On Thu, 20 May 2021 18:30:03 +0100
> RW wrote:
>
>
> > Try this:
> >
> >
> > header EMOTICON_IN_SUBJECT Subject =~
> > /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
> >
>
> Actually that's only the original block, but it probably works most of
> the time
This extends it to Supplemental Symbols and Pictographs and
adds the three original faces from Miscellaneous Symbols
/\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x80-\x8F])|xF0\x9F(?:[\xA4-\xA6][\x80-\xFF]|\xA7[\x80-\xBF])|\xE2\x98[\xB9-\xBB]/
it also fixes a minor problem with a continuation bytes in the original.
Re: Detect Emoticons in Subject
Posted by Clive Jacques <we...@gmail.com>.
That's fine - I'm not saying all email containing emojis in the subject (or
elsewhere) *is *spam - just that it's uncommon and right now, about 90% of
the time it is *for me*. I just want to score it as part of the greater
constellation of factors (just like DKIM, SPF etc.).
On Thu, May 20, 2021 at 2:48 PM Bill Cole <
sausers-20150205@billmail.scconsult.com> wrote:
>
> People send wanted mail with all sorts of weirdness.
>
>
Re: Detect Emoticons in Subject
Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 2021-05-20 at 13:44:43 UTC-0400 (Thu, 20 May 2021 18:44:43 +0100)
RW <rw...@googlemail.com>
is rumored to have said:
> On Thu, 20 May 2021 18:30:03 +0100
> RW wrote:
>
>
>> Try this:
>>
>>
>> header EMOTICON_IN_SUBJECT Subject =~
>> /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
>>
>
> Actually that's only the original block, but it probably works most of
> the time
Not so sure about that...
I regularly get mail from Patreon with emoji in the encoded header which
don't match that pattern:
# grep '^Subject: ' /tmp/ham |cut -d? -f4 |decode-base64 |hexdump -C
00000000 f0 9f 8e 89 20 50 61 74 72 69 63 6b 20 57 61 72 |....
Patrick War|
00000010 64 6c 65 20 6a 75 73 74 20 73 68 61 72 65 64 20 |dle just
shared |
00000020 22 f0 9f 93 9d 20 4e |".... N|
00000027
People send wanted mail with all sorts of weirdness.
Looking at the full set
(https://www.unicode.org/emoji/charts/full-emoji-list.html) I can
understand why \p{Emoticons} would be so much better than trying to
define them all in a regex of hex bytes in UTF-8 form.
--
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire
Re: Detect Emoticons in Subject
Posted by RW <rw...@googlemail.com>.
On Thu, 20 May 2021 18:30:03 +0100
RW wrote:
> Try this:
>
>
> header EMOTICON_IN_SUBJECT Subject =~
> /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
>
Actually that's only the original block, but it probably works most of
the time
Re: Detect Emoticons in Subject
Posted by RW <rw...@googlemail.com>.
On Thu, 20 May 2021 18:34:54 +0200
Bert Van de Poel wrote:
> We've started getting lots of spam with emoji in the subject too the
> past few weeks, so I've looked into this as well. As mentioned by RW,
> you would need to create some kind of UTF8 regex header Subject rule.
> As I'm not too excited about writing such a regex, it's way at the
> bottom of my todo list to contemplate whether an SA plugin could be
> written for that and to then reach out to the SA developers to see
> whether that would be something upstream would accept. But honestly,
> I won't be able to any time soon (I don't have the time). Still,
> thought I'd mention it, since it might be relevant to your question.
> If you do end up figuring out a regex that works out and isn't an
> extreme length, I think plenty of people on this list would love to
> know!
Try this:
header EMOTICON_IN_SUBJECT Subject =~ /\xF0\x9F(?:\x98[\x80-\xFF]|\x99[\x00-x8F])/
Re: Detect Emoticons in Subject
Posted by Bert Van de Poel <be...@ulyssis.org>.
We've started getting lots of spam with emoji in the subject too the
past few weeks, so I've looked into this as well. As mentioned by RW,
you would need to create some kind of UTF8 regex header Subject rule. As
I'm not too excited about writing such a regex, it's way at the bottom
of my todo list to contemplate whether an SA plugin could be written for
that and to then reach out to the SA developers to see whether that
would be something upstream would accept. But honestly, I won't be able
to any time soon (I don't have the time). Still, thought I'd mention it,
since it might be relevant to your question. If you do end up figuring
out a regex that works out and isn't an extreme length, I think plenty
of people on this list would love to know!
Bert
On 20/05/2021 18:19, RW wrote:
> On Thu, 20 May 2021 11:42:59 -0400
> Clive Jacques wrote:
>
>> Hi,
>>
>> I've been using SA a long time. Lately, I'm getting more and more
>> spam with emoticons in the subject line. I'd say about 90% of my
>> emails with emoticons in the subject are spam. I'd like to create a
>> local rule which scores email with emoticons in the subject.
>> # Local Rule for Emoticons in subject
>> subject EMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/
> The rule should start with "header", that's what's causing the lint
> failure.
>
> However, AFAIK, the rule still won't work because \p{Emoticons}
> isn't supported in spamassassin, which works on byte sequences. You
> need to rewrite it to match UTF-8 bytes.
Re: Detect Emoticons in Subject
Posted by RW <rw...@googlemail.com>.
On Thu, 20 May 2021 11:42:59 -0400
Clive Jacques wrote:
> Hi,
>
> I've been using SA a long time. Lately, I'm getting more and more
> spam with emoticons in the subject line. I'd say about 90% of my
> emails with emoticons in the subject are spam. I'd like to create a
> local rule which scores email with emoticons in the subject.
> # Local Rule for Emoticons in subject
> subject EMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/
The rule should start with "header", that's what's causing the lint
failure.
However, AFAIK, the rule still won't work because \p{Emoticons}
isn't supported in spamassassin, which works on byte sequences. You
need to rewrite it to match UTF-8 bytes.
Re: Detect Emoticons in Subject: CHAOS
Posted by Benny Pedersen <me...@junc.eu>.
On 2021-05-20 22:33, Clive Jacques wrote:
> Here is a good example of such an email (attached, stripped of
> identifying info).
This attachment is suspicious because its type doesn't match the type
declared in the message. If you do not trust the sender, you shouldn't
open it in the browser because it may contain malicious contents.
Expected: text/plain (.txt); found: message/rfc822 (.eml)
should i ignore roundcube warnings ? :)
Re: Detect Emoticons in Subject: CHAOS
Posted by Clive Jacques <we...@gmail.com>.
Here is a good example of such an email (attached, stripped of identifying
info).
On Thu, May 20, 2021 at 4:03 PM RW <rw...@googlemail.com> wrote:
> On Thu, 20 May 2021 15:35:21 -0400
> Jared Hall wrote:
>
> > Clive Jacques wrote:
>
> > > # Local Rule for Emoticons in subject
> > > subject EMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/
>
> >
> > The following regex will detect a good amount of Emojis:
> >
> >
> |/[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}]/ug
>
> > |
> That doesn't work in SA for the same reason that \p{Emoticons}
> doesn't work.
>
Re: Detect Emoticons in Subject: CHAOS
Posted by RW <rw...@googlemail.com>.
On Thu, 20 May 2021 15:35:21 -0400
Jared Hall wrote:
> Clive Jacques wrote:
> > # Local Rule for Emoticons in subject
> > subject EMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/
>
> The following regex will detect a good amount of Emojis:
>
> |/[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}]/ug
> |
That doesn't work in SA for the same reason that \p{Emoticons}
doesn't work.
Re: Detect Emoticons in Subject: CHAOS
Posted by Jared Hall <ja...@jaredsec.com>.
Clive Jacques wrote:
> Hi,
>
> I've been using SA a long time. Lately, I'm getting more and more
> spam with emoticons in the subject line. I'd say about 90% of my
> emails with emoticons in the subject are spam. I'd like to create a
> local rule which scores email with emoticons in the subject. I saw a
> previous discussion on this in the archive, but it was focused on
> whether such emails were /always /spam. I think an emoticon rule, in
> combination with other rules, will help my installation. I've tried
> to match as follows, but it won't lint. I'm not really a perl
> programmer. I've written several other more conventional local rules,
> but here I'm a bit out of my depth. I'd appreciate some guidance.
>
> # Local Rule for Emoticons in subject
> subject EMOTICON_IN_SUBJECT Subject =~ /\p{Emoticons}/
> score EMOTICON_IN_SUBJECT 3.0
> describe EMOTICON_IN_SUBJECT Subject Line Has Emoticons
>
> -CJ
The following regex will detect a good amount of Emojis:
|/[\u{1f300}-\u{1f5ff}\u{1f900}-\u{1f9ff}\u{1f600}-\u{1f64f}\u{1f680}-\u{1f6ff}\u{2600}-\u{26ff}\u{2700}-\u{27bf}\u{1f1e6}-\u{1f1ff}\u{1f191}-\u{1f251}\u{1f004}\u{1f0cf}\u{1f170}-\u{1f171}\u{1f17e}-\u{1f17f}\u{1f18e}\u{3030}\u{2b50}\u{2b55}\u{2934}-\u{2935}\u{2b05}-\u{2b07}\u{2b1b}-\u{2b1c}\u{3297}\u{3299}\u{303d}\u{00a9}\u{00ae}\u{2122}\u{23f3}\u{24c2}\u{23e9}-\u{23ef}\u{25b6}\u{23f8}-\u{23fa}]/ug
|
Ref:
https://stackoverflow.com/questions/43242440/javascript-unicode-emoji-regular-expressions/45138005#45138005
But it is not the greatest thing if you want to get a count out of that.
<toot>
However, I may have a solution for you with the CHAOS plugin:
https://github.com/telecom2k3/CHAOS
You can get (but shouldn't) Emojis even in From names, like this actual one:
DHL☺com
CHAOS will also help you with Unicode Character spoofs, via its
UniBabble rulesets:
ᴀмαzσи ᴘ𝔯𝔦𝔪ё
𝘼𝔪𝔞𝘻𝙤𝘯 𝘾𝘶𝘴𝙩𝙤𝘮𝘦𝘳 𝙎𝔢𝘳𝙫𝘪𝘤𝔢
Amαzoɴ Priⅿë
🅰🅼🅰🆉🅾🅽 🆂🅴🆁🆅🅸🅲🅴
𝐀𝐦𝐚𝐳𝐨𝐧 𝐍𝐨𝐭𝐢𝐜𝐞
...
...
CHAOS will run on PERL 5.18 and later.
</toot>
-- Jared Hall