You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2006/05/29 18:38:50 UTC

[Bug 4927] New: Suggesting a rule to test for double Subject or double From

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4927

           Summary: Suggesting a rule to test for double Subject or double
                    From
           Product: Spamassassin
           Version: 3.1.2
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: minor
          Priority: P5
         Component: Rules
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: Mark.Martinec@ijs.si


After noticing a spam message with two Subject header fields that
got through, I tested all our site's mail traffic for couple of days,
watching for message with multiple occurrences of header fields,
which (according to RFC 2822) may occur at most once.
Here is a suggested new rule:

header __DOUBLE_SUBJ  ALL =~ /^Subject:.*^Subject:/smi
header __DOUBLE_FROM  ALL =~ /^From:.*^From:/smi
meta     DOUBLE_SUBJ_OR_FROM  __DOUBLE_SUBJ || __DOUBLE_FROM
describe DOUBLE_SUBJ_OR_FROM Contains more than one Subject or From header
score    DOUBLE_SUBJ_OR_FROM 2.0

Here is the analysis.
First, looking at messages counts with multiple header fields:

  count  multiple header fields present
  -----  ------------------------------
  160    Subject  
  173    From
  122    From AND Subject
  333    From OR  Subject
  37     Subject AND NOT From
  52     From AND NOT Subject
  47     Message-ID
  6      Reply-To
  5      Sender
  6      To
  0      Cc

Seems line multiple Cc, To, Sender and Reply-To are infrequent
and probably not worth the trouble.

Multiple Message-ID occur more frequently, but according to attached
diagram seem to occur in non-spam mail as well(?), so it seems it can
trigger false positives (but it may be useful to re-evaluate this).

Presence of multiple From or multiple Subject header fields seem to be
a very good indication of spam, with not a single FP in my three-day
sample. The two messages that did score below 5 were manually re-checked
and turned out to be spam or a crippled spam message.

A remaining question is how to combine __DOUBLE_SUBJ and __DOUBLE_FROM
tests. To score each one individually, or to score on a metarule on some
combination of the two (OR, AND, AND NOT).

Manually checking messages that match 'Subject AND NOT From'
as well as 'From AND NOT Subject' doesn't make me believe these
two would be more useful that each rule individually.

Although 'From AND Subject' hits quite frequently, it doesn't have
less false positives or improved hit rate. Seems like 'From OR Subject'
covers most cases with good quality, which makes me suggest a single 
DOUBLE_SUBJ_OR_FROM metarule, in favour of scoring each individual 
DOUBLE_SUBJ / DOUBLE_FROM rules.

It would be interesting how automatic score assignment evaluates the rule.

As an illustration, there are two diagrams attached, the second one
is just a magnified left-hand side detail of the first one.
X-axis shows distribution (centiles) of all mail which hits each rule,
and y-axis is a score that SA assigned to a message (SA 3.1.2, all usual
network tests enables, bayes, razor, dcc, common SARE rules).



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4927] Suggesting a rule to test for double Subject or double From

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4927





------- Additional Comments From Mark.Martinec@ijs.si  2006-05-29 16:40 -------
Created an attachment (id=3528)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3528&action=view)
Diagram: spam score vs. distribution for each proposed rule




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4927] Suggesting a rule to test for double Subject or double From

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4927


felicity@apache.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|                            |FIXED




------- Additional Comments From felicity@apache.org  2006-11-22 19:30 -------
This one already existed, for Content-Type:
  0.018   0.0205   0.0000    1.000   0.71    1.34  HEADER_COUNT_CTYPE

Here's my test rules:
  0.115   0.1333   0.0000    1.000   1.00    1.00  HEADER_COUNT_SUBJECT
  0.015   0.0000   0.1064    0.000   0.00    1.00  HEADER_COUNT_FROM

So I'm adding in the SUBJECT rule, but leaving out the FROM rule.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4927] Suggesting a rule to test for double Subject or double From

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4927


vseerror@lehigh.edu changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |vseerror@lehigh.edu




------- Additional Comments From vseerror@lehigh.edu  2006-11-02 12:54 -------
in addition, see bug 1239. spammers are using multiple subject lines to fool
some mail clients that don't consistently use the first subject line.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4927] Suggesting a rule to test for double Subject or double From

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4927





------- Additional Comments From peter@unlikejam.dreamhost.com  2006-05-29 22:46 -------
Related to this rule, currently emails that have illegal multiple "from" headers can cause dud 
entries in the AWL table.  These entries may have spaces or special characters like <.  I tracked 
down the related code at one stage but don't have it handy (think it related to a join, with spaces, 
where it assumed there was only one array entry for the requested header).  I guess this should be 
a seperate AWL related bug.

mysql> select email from awl where email like "% %";
meung@ultraz.dk <meung meung@ultraz.dk
_pickofthemonth2972@awplay.com <pickofthemonth2972_@inpuj.com 
pickofthemonth2972@awplay.com
_jason _@example.com
_cedric roman stagy_@umpire.com
_ommonwealth bankcustomersupport_@no.domain.name.given
5 rows in set (0.10 sec)

Sample headers from an offending email:
To: superjoker <Su...@ultraz.dk>
Subject: Meung@ultraz.dk
From: Meung@ultraz.dk <Meung
Content-Type: multipart/mixed; boundary=13642f531fca8abc354f6fadb0b9b531
MIME-Version: 1.0
From: Meung@ultraz.dk
Subject: The hottest end of the year issue




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4927] Suggesting a rule to test for double Subject or double From

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4927





------- Additional Comments From Mark.Martinec@ijs.si  2006-05-29 16:41 -------
Created an attachment (id=3529)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3529&action=view)
detail of the previous attachment, magnified low-end scores




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Re: [Bug 4927] New: Suggesting a rule to test for double Subject or double From

Posted by Szalay Attila <sa...@balabit.hu>.
Hi All!

On Mon, 2006-05-29 at 16:38 +0000,
bugzilla-daemon@bugzilla.spamassassin.org wrote:
> http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4927
> 
>   count  multiple header fields present
>   -----  ------------------------------
>   160    Subject                 a
>   173    From                    b
>   122    From AND Subject        c
>   333    From OR  Subject        d
>   37     Subject AND NOT From    e
>   52     From AND NOT Subject    f

I know that it's not too important (for the original discussion) but
this numbers must be wrong.
Because there are some link between this numbers. I put a letter after
each line and I write this connections.

c + e = a (122 + 37 = 160 => 159 = 160)
c + f = b (122 + 52 = 173 => 174 = 173)
e + f + c = d (37 + 52 + 122 = 333 => 211 = 333)

The third one is why I write this letter. i could write some other
connection too (d - a = f, etc.) but those are just some reincarnation
of this three.

-- 
Szalay Attila                     BalaBit IT BiztonsƔgtechnikai Kft.
tel:(36-1)-371-05-40              1116 Bp. Csurgoi ut 20/b
fax:(36-1)-208-08-75              http://www.balabit.hu/

[Bug 4927] Suggesting a rule to test for double Subject or double From

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4927


felicity@apache.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |3.2.0






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.