You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2004/01/27 09:54:00 UTC

[Bug 2970] New: "longwords" rules

http://bugzilla.spamassassin.org/show_bug.cgi?id=2970

           Summary: "longwords" rules
           Product: Spamassassin
           Version: SVN Trunk (Latest Devel Version)
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: Rules
        AssignedTo: spamassassin-dev@incubator.apache.org
        ReportedBy: jm@jmason.org


Robert Menschel <Robert at Menschel.net> said:

Date: Sun, 25 Jan 2004 22:37:03 -0800
Subject: [SAtalk] Longwords

Received an email this morning which reminded me about my longwords
rules, which apparently got lost when I migrated my mass-check system
from my mail server to my PC.

This was my exploration of the random words spammers have been including
at the bottom of their emails, or in their text portions, or in their
invisible text, to confuse some anti-spam software. (I call these words
Bayes Fodder, since over time it seems they are helping my Bayes identify
spam better and better and better.)

Anyway, I rebuilt, reran, refined, and:

Section 3 -- Frequencies Log
(First numeric frequencies, followed by percentage frequencies)

OVERALL     SPAM      HAM     S/O   SCORE  NAME
  91714    74113    17601    0.808   0.00    0.00  (all messages)
   7431     7429        2    0.999   1.00   3.00  RM_bpt_longwords68a
   6596     6595        1    0.999   0.98   1.00  RM_bpt_longwords69a
   4163     4163        0    1.000   0.71   2.00  RM_bpt_longwords78a
   8761     8753        8    0.996   0.51   3.00  RM_bpt_longwords59a
   2950     2950        0    1.000   0.48   1.00  RM_bpt_longwords79a
   1162     1162        0    1.000   0.15   4.00  RM_bpt_longwords96a
   1025     1025        0    1.000   0.13   4.00  RM_bpt_longwords88a
    590      590        0    1.000   0.05   1.00  RM_bpt_longwords89a
    545      545        0    1.000   0.04   3.00  RM_bpt_longwords97
    442      442        0    1.000   0.02   1.00  RM_bpt_longwords98
    330      330        0    1.000   0.00   1.00  RM_bpt_longwords99

OVERALL%   SPAM%     HAM%     S/O    RANK   SCORE  NAME
  91714    74113    17601    0.808   0.00    0.00  (all messages)
100.000  80.8088  19.1912    0.808   0.00    0.00  (all messages as %)
  8.102  10.0239   0.0114    0.999   1.00    3.00  RM_bpt_longwords68a
  7.192   8.8986   0.0057    0.999   0.98    1.00  RM_bpt_longwords69a
  4.539   5.6171   0.0000    1.000   0.71    2.00  RM_bpt_longwords78a
  9.553  11.8103   0.0455    0.996   0.51    3.00  RM_bpt_longwords59a
  3.217   3.9804   0.0000    1.000   0.48    1.00  RM_bpt_longwords79a
  1.267   1.5679   0.0000    1.000   0.15    4.00  RM_bpt_longwords96a
  1.118   1.3830   0.0000    1.000   0.13    4.00  RM_bpt_longwords88a
  0.643   0.7961   0.0000    1.000   0.05    1.00  RM_bpt_longwords89a
  0.594   0.7354   0.0000    1.000   0.04    3.00  RM_bpt_longwords97
  0.482   0.5964   0.0000    1.000   0.02    1.00  RM_bpt_longwords98
  0.360   0.4453   0.0000    1.000   0.00    1.00  RM_bpt_longwords99

Scores of course are set to my 9.0 required hits, so you'll probably want
to lower these scores. Depending on your system, an initial score of 0.5
or 1.0 for each rule might be worth while, and then you can increase the
scores slowly if these spam continue to sneak past your system.

In my 19k corpus, one ham matches three of these rules, two of which I've
scored at 3.0, and so that ham gets a score of 7.0 of 9. I may be
reducing those rules to 2.5 or 2.0 instead of 3.0 once I complete my next
global mass-check. So yes, caution is advised.

Bob Menschel

body     RM_bpt_longwords68a /\b(?:[a-z]{6,}\s+){8}/
describe RM_bpt_longwords68a Long string of long words
score    RM_bpt_longwords68a 3.000  # 7429s/2h of 91714 corpus (74113s/17601h)
01/23/04
                                    # ham: userid list, 
                                    # "improving compatibility between computer
platforms demands certain levels "
body     RM_bpt_longwords69a /\b(?:[a-z]{6,}\s+){9}/
describe RM_bpt_longwords69a Long string of long words
score    RM_bpt_longwords69a 1.000  # type=max:1 (add to 59a,68a) - 6595s/1h of
91714 corpus (74113s/17601h) 01/23/04
                                    # ham: userid list
body     RM_bpt_longwords78a /\b(?:[a-z]{7,}\s+){8}/
describe RM_bpt_longwords78a Long string of long words
score    RM_bpt_longwords78a 2.000 # type=max:2 (add to 68a) - 4163s/0h of 91714
corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords59a /\b(?:[a-z]{5,}\s+){9}/
describe RM_bpt_longwords59a Long string of long words
score    RM_bpt_longwords59a 3.000  # 8753s/8h of 91714 corpus (74113s/17601h)
01/23/04
                                    # ham: userid list
body     RM_bpt_longwords79a /\b(?:[a-z]{7,}\s+){9}/
describe RM_bpt_longwords79a Long string of long words
score    RM_bpt_longwords79a 1.000  # type=max:1 (add to 78a) - 2950s/0h of
91714 corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords96a /\b(?:[a-z]{9,}\s+){6}/
describe RM_bpt_longwords96a Long string of long words
score    RM_bpt_longwords96a 4.000  # 1162s/0h of 91714 corpus (74113s/17601h)
01/23/04
body     RM_bpt_longwords88a /\b(?:[a-z]{8,}\s+){8}/
describe RM_bpt_longwords88a Long string of long words
score    RM_bpt_longwords88a 4.000  # 1025s/0h of 91714 corpus (74113s/17601h)
01/23/04
body     RM_bpt_longwords89a /\b(?:[a-z]{8,}\s+){9}/
describe RM_bpt_longwords89a Long string of long words
score    RM_bpt_longwords89a 1.000  # type=max:1 (add to 88a) - 590s/0h of 91714
corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords97 /\b(?:\w{9,}\s+){7}/
describe RM_bpt_longwords97 Long string of long words
score    RM_bpt_longwords97 3.000  # 545s/0h of 91714 corpus (74113s/17601h)
01/23/04
body     RM_bpt_longwords98 /\b(?:\w{9,}\s+){8}/
describe RM_bpt_longwords98 Long string of long words
score    RM_bpt_longwords98 1.000  # type=max:1 (add to 97) - 442s/0h of 91714
corpus (74113s/17601h) 01/23/04
body     RM_bpt_longwords99 /\b(?:\w{9,}\s+){9}/
describe RM_bpt_longwords99 Long string of long words
score    RM_bpt_longwords99 1.000  # type=max:1 (add to 98) - 330s/0h of 91714
corpus (74113s/17601h) 01/23/04




Given a hitrate of 10% with an S/O of 0.999, we gotta apply them ;)
adding to SVN now.

(PS: as I read the new Apache 2.0 license, we no longer need to verify CLA
receipt for patches/new rules sent by non-committers.  right?)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.