You are viewing a plain text version of this content. The canonical link for it is here.
Posted to infrastructure-issues@apache.org by "Chris Lambertus (JIRA)" <ji...@apache.org> on 2016/09/21 18:30:20 UTC

[jira] [Commented] (INFRA-11880) add charset spam filter

    [ https://issues.apache.org/jira/browse/INFRA-11880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15510772#comment-15510772 ] 

Chris Lambertus commented on INFRA-11880:
-----------------------------------------

I've been saving up some non-english spam, and I'm finding that all the samples I've looked at (about 10 so far) do NOT set any kind of language header. The closest they get is base64 encoded UTF-8. These have been in Spanish, Arabic, Farsi, Chinese, and Polish. They typically only provide the content-type headers:

Content-Type: multipart/alternative; charset="UTF-8"; boundary="b1_647d3116d671c1c3b62bbc49d7fc1042"
Content-Transfer-Encoding: 8bit


Unless there's a viable alternative to TextCat for language detection, I'm not seeing a way to solve this. 

> add charset spam filter
> -----------------------
>
>                 Key: INFRA-11880
>                 URL: https://issues.apache.org/jira/browse/INFRA-11880
>             Project: Infrastructure
>          Issue Type: Planned Work
>          Components: Mailing Lists
>            Reporter: Greg Stein
>            Assignee: Chris Lambertus
>         Attachments: email-1.txt, email-2.txt, email-3.txt, email-4.txt, email-5.txt
>
>
> Much of the recent spam to the lists that I moderate use cyrillic or some middle eastern character set (hebrew? arabian?). We do not allow non-English on *most* of our mailing lists, so such messages should have a seriously high spam score. The same would apply to most/all non-latin charsets (eg. also chinese/japanese/etc).
> AOO may have lists where such is allowed, so it may be necessary to have an "escape hatch" for the filter on a per-list basis.
> In email, Chris mentioned that SpamAssassin appears to have some language-based filters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)