You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2005/01/24 09:47:41 UTC

[Bug 4100] New: The ranking measure in hit-frequencies is suspicious

http://bugzilla.spamassassin.org/show_bug.cgi?id=4100

           Summary: The ranking measure in hit-frequencies is suspicious
           Product: Spamassassin
           Version: unspecified
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Masses
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: qa@ccert.edu.cn


Dear colleagues,

In the source code of masses/hit-frequencies, the ranking measure is based on 
the following formula:

#             sum                                    P(X = x ^ C = c)
# IG(X,C) = x in [0, 1]    P(X = x ^ C = c) . log2( ------------------- )
#           c in [Ch, Cs]                           P(X = x) . P(C = c)

This formula may be useful for a general categorization problem, but not for 
SpamAssassin. The reasons are:

1. For general categorization problem, we are interested in words which can 
guest either class ham or spam. However in SpamAssassin, we are only interested 
in words which can guest class spam. (I assumed that ham-rules are not a good 
choice)

2. For general clategorization problem, we are interested in the present as 
well as the absence of a word in a e-mail. However in SpamAssassin, we are only 
interested in the present of a word in a e-mail.

In my opinion, the measure should be:

$rank =

                           P(X = 1 ^ C = Cs)
P(X = 1 ^ C = Cs) . log2( ------------------- ) /
                         P(X = 1) . P(C = Cs)

                           P(X = 1 ^ C = Ch)
(P(X = 1 ^ C = Ch) . log2( ------------------- ))
                         P(X = 1) . P(C = Ch)

Just a suggestion.

Best Regards,
Quang-Anh Tran



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4100] The ranking measure in hit-frequencies is suspicious

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4100





------- Additional Comments From jm@jmason.org  2007-04-19 02:13 -------
(just found this old bug)

we should try Quang-Anh's suggestion... the IG ranking measure has never worked
very well compared to the simpler one, but maybe his variant would be more
useful.  I'll try it.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4100] The ranking measure in hit-frequencies is suspicious

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4100





------- Additional Comments From jm@jmason.org  2007-04-19 04:22 -------
well, that *really* didn't improve the output I'm afraid ;)
Given these logs:

ham.log:
Y  1 /file2 B,D time=1157091898,scantime=6,format=m,reuse=yes
Y  1 /file2 D time=1157091898,scantime=6,format=m,reuse=yes
Y  1 /file2 D time=1157091898,scantime=6,format=m,reuse=yes
Y  1 /file2 D time=1157091898,scantime=6,format=m,reuse=yes
Y  1 /file2 D time=1157091898,scantime=6,format=m,reuse=yes

spam.log:
Y  1 /file1 A,B,C time=1157091898,scantime=6,format=m,reuse=yes
Y  1 /file1 A,B,C time=1157091898,scantime=6,format=m,reuse=yes
Y  1 /file1 A,B,C time=1157091898,scantime=6,format=m,reuse=yes
Y  1 /file1 A,B time=1157091898,scantime=6,format=m,reuse=yes
Y  1 /file1 A,B time=1157091898,scantime=6,format=m,reuse=yes




Here's what the old (non-IG) rank measure output (./hit-frequencies -x -p 
-c=/dev/null):

OVERALL    SPAM%     HAM%     S/O    RANK   SCORE  NAME
      0        5        5    0.500   0.00    0.00  (all messages)
0.00000  50.0000  50.0000    0.500   0.00    0.00  (all messages as %)
 50.000  100.0000   0.0000    1.000   1.00    0.00  A
 30.000  60.0000   0.0000    1.000   0.75    0.00  C
 60.000  100.0000  20.0000    0.833   0.75    0.00  B
 50.000   0.0000  100.0000    0.000   0.00    0.00  D

as you can see, A is the best rule so should be at the top.  B should
probably be higher than C.  D should be last (since it's a very good ham
rule).
old IG output (./hit-frequencies -x -p -i -c=/dev/null):

OVERALL    SPAM%     HAM%     S/O      IG   SCORE  NAME
      0        5        5    0.500   0.00    0.00  (all messages)
0.00000  50.0000  50.0000    0.500   0.00    0.00  (all messages as %)
 50.000  100.0000   0.0000    1.000   1.00    0.00  A
 50.000   0.0000  100.0000    0.000   1.00    0.00  D
 60.000  100.0000  20.0000    0.833   0.35    0.00  B
 30.000  60.0000   0.0000    1.000   0.00    0.00  C

not bad -- there's a bug in that D is treated as equally good as A;
really, it should be at the end.  but B is listed higher than C.

new algorithm (same cmdline):

OVERALL    SPAM%     HAM%     S/O      IG   SCORE  NAME
      0        5        5    0.500   0.00    0.00  (all messages)
0.00000  50.0000  50.0000    0.500   0.00    0.00  (all messages as %)
 50.000  100.0000   0.0000    1.000   1.00    0.00  A
 30.000  60.0000   0.0000    1.000   0.60    0.00  C
 50.000   0.0000  100.0000    0.000   0.00    0.00  D
 60.000  100.0000  20.0000    0.833   0.00    0.00  B

there's a problem here in that B is listed last.  not sure why...
I'll attach the patch to hit-frequencies if anyone wants to have 
a look.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4100] The ranking measure in hit-frequencies is suspicious

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4100





------- Additional Comments From quinlan@pathname.com  2005-01-24 02:36 -------
Note: the IG function is not used by default.  The rank used is simply:

   RANK(rule) = (percentile(wanted) + percentile(unwanted))/2

It's important that the rank function factor in both the wanted hits as well
as the unwanted ones, whatever it is.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4100] The ranking measure in hit-frequencies is suspicious

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4100





------- Additional Comments From qa@ccert.edu.cn  2005-01-25 22:20 -------
Since ham-rules (rules with score < 0) should be avoid, the fomula should be:

RANK(rule) = (percentile(wanted) - percentile(unwanted))/2



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4100] The ranking measure in hit-frequencies is suspicious

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4100


Bob@Menschel.net changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|triage                      |
   Target Milestone|Undefined                   |Future






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4100] The ranking measure in hit-frequencies is suspicious

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4100


Bob@Menschel.net changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |triage




------- Additional Comments From Bob@Menschel.net  2005-04-07 23:25 -------
> Ham rules are okay if they cannot be used by spammers.  We only have a few of
those.

Actually, "we" have lots of those.  I use several dozen myself.  As long as
they're private, mostly domain- or system-specific, and not readily forgeable,
then they work fine. 

And in those cases, flagged with tflags nice, the hit frequencies ranking
algorithms work just fine for me. 

Questions to the devs: If the simpler ranking algorithm is used, should the IG
algorithm be removed from the program? 

If the IG algorithm is used in specific cases, should it be replaced with
Quang-Anh's suggestion for those cases? 



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4100] The ranking measure in hit-frequencies is suspicious

Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4100





------- Additional Comments From jm@jmason.org  2007-04-19 04:23 -------
Created an attachment (id=3914)
 --> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3914&action=view)
patch




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 4100] The ranking measure in hit-frequencies is suspicious

Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4100





------- Additional Comments From quinlan@pathname.com  2005-01-26 00:54 -------
Subject: Re:  The ranking measure in hit-frequencies is suspicious

> Since ham-rules (rules with score < 0) should be avoid, the fomula
> should be:

Ham rules are okay if they cannot be used by spammers.  We only have a
few of those.





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.