You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@spamassassin.apache.org on 2020/10/26 01:45:41 UTC

[Bug 7866] New: TextCat: Improper language classification on URIs in plain/text

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7866

            Bug ID: 7866
           Summary: TextCat: Improper language classification on URIs in
                    plain/text
           Product: Spamassassin
           Version: 3.4.4
          Hardware: PC
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Plugins
          Assignee: dev@spamassassin.apache.org
          Reporter: jad@aesir.com
  Target Milestone: Undefined

Textcat can improperly classify text including URI in the plain text portion of
a message. Here is a sample that was tagged as UNWANTED_BODY_TEXT (classified
as sk & cs for this example):

-------

Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

A Weekly Review from AWS

Featured Announcements

Amazon Aurora enables dynamic resizing for database storage space          =
                                                                           =
                      =20
<https://email.awscloud.com/dc/KwqiTCOQ16Q1JCi3MdelD5Wf1a0xWVUzLJ8TEakWNHdl=
A7N3nIa2aQWviXRrQW0g0Nzk3qf9Jbd_7Br-VcC96_vVOrK4bJqlew1KGbdQmMLIlhsNLtVFFTo=
o0oG_f9iDbFtXhfHuZSrhIpoERCR4a4jOBbGd629KotGGay-7-sKFDTCWVGisnhbxOeaG-rBvct=
WHIpIaAIuHyhj21BdtQbvqu9vEkOLb4i9f5WJzjdvSttMrYY5mQQiiAxDzWx90K16R7A5hk3kuc=
4mmg5ogqliI9wKd7lBG1qX0Uis2H9tTvIbdKJhEU2XcxTXIVK0l2bb1qlYvipE7NL9dS516_m6n=
76Y0b_DoQp07kQfyQE3Cm-s5tpwt4oOzzjMzZvKLprHcLw3Lb7I6Pp_5WIyD-ze1ZmT5cFKEF7D=
C_c-BH24c5m2mByYrBLRlvBsCPRNQhAPdJZi-geOf6Jf9J_oeDFtxQwo1yU94FVAb3AB9MBU=3D=
/hOoW4pZ0lT000kthk000MCE>

--------

A fix would be to strip URIs from the body text before classification. 

  my $body = $msg->get_rendered_body_text_array();
  $body = join("\n", @{$body});
  $body =~ s/^Subject://i;

  # %%% Make sure that there are no URIs to be evaluated here.
  $body =~ s/https?:\/\/\S+//g;    # BUG fix

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7866] TextCat: Improper language classification on URIs in plain/text

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7866

Giovanni Bechis <gi...@paclan.it> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
                 CC|                            |giovanni@paclan.it
             Status|NEW                         |RESOLVED

--- Comment #1 from Giovanni Bechis <gi...@paclan.it> ---
A similar fix was present in trunk, now backported to 3.4 tree in r1883069.

-- 
You are receiving this mail because:
You are the assignee for the bug.