You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@bugzilla.spamassassin.org on 2005/01/24 09:47:41 UTC
[Bug 4100] New: The ranking measure in hit-frequencies is suspicious
http://bugzilla.spamassassin.org/show_bug.cgi?id=4100
Summary: The ranking measure in hit-frequencies is suspicious
Product: Spamassassin
Version: unspecified
Platform: Other
OS/Version: other
Status: NEW
Severity: normal
Priority: P5
Component: Masses
AssignedTo: dev@spamassassin.apache.org
ReportedBy: qa@ccert.edu.cn
Dear colleagues,
In the source code of masses/hit-frequencies, the ranking measure is based on
the following formula:
# sum P(X = x ^ C = c)
# IG(X,C) = x in [0, 1] P(X = x ^ C = c) . log2( ------------------- )
# c in [Ch, Cs] P(X = x) . P(C = c)
This formula may be useful for a general categorization problem, but not for
SpamAssassin. The reasons are:
1. For general categorization problem, we are interested in words which can
guest either class ham or spam. However in SpamAssassin, we are only interested
in words which can guest class spam. (I assumed that ham-rules are not a good
choice)
2. For general clategorization problem, we are interested in the present as
well as the absence of a word in a e-mail. However in SpamAssassin, we are only
interested in the present of a word in a e-mail.
In my opinion, the measure should be:
$rank =
P(X = 1 ^ C = Cs)
P(X = 1 ^ C = Cs) . log2( ------------------- ) /
P(X = 1) . P(C = Cs)
P(X = 1 ^ C = Ch)
(P(X = 1 ^ C = Ch) . log2( ------------------- ))
P(X = 1) . P(C = Ch)
Just a suggestion.
Best Regards,
Quang-Anh Tran
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4100] The ranking measure in hit-frequencies is suspicious
Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4100
------- Additional Comments From jm@jmason.org 2007-04-19 02:13 -------
(just found this old bug)
we should try Quang-Anh's suggestion... the IG ranking measure has never worked
very well compared to the simpler one, but maybe his variant would be more
useful. I'll try it.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4100] The ranking measure in hit-frequencies is suspicious
Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4100
------- Additional Comments From jm@jmason.org 2007-04-19 04:22 -------
well, that *really* didn't improve the output I'm afraid ;)
Given these logs:
ham.log:
Y 1 /file2 B,D time=1157091898,scantime=6,format=m,reuse=yes
Y 1 /file2 D time=1157091898,scantime=6,format=m,reuse=yes
Y 1 /file2 D time=1157091898,scantime=6,format=m,reuse=yes
Y 1 /file2 D time=1157091898,scantime=6,format=m,reuse=yes
Y 1 /file2 D time=1157091898,scantime=6,format=m,reuse=yes
spam.log:
Y 1 /file1 A,B,C time=1157091898,scantime=6,format=m,reuse=yes
Y 1 /file1 A,B,C time=1157091898,scantime=6,format=m,reuse=yes
Y 1 /file1 A,B,C time=1157091898,scantime=6,format=m,reuse=yes
Y 1 /file1 A,B time=1157091898,scantime=6,format=m,reuse=yes
Y 1 /file1 A,B time=1157091898,scantime=6,format=m,reuse=yes
Here's what the old (non-IG) rank measure output (./hit-frequencies -x -p
-c=/dev/null):
OVERALL SPAM% HAM% S/O RANK SCORE NAME
0 5 5 0.500 0.00 0.00 (all messages)
0.00000 50.0000 50.0000 0.500 0.00 0.00 (all messages as %)
50.000 100.0000 0.0000 1.000 1.00 0.00 A
30.000 60.0000 0.0000 1.000 0.75 0.00 C
60.000 100.0000 20.0000 0.833 0.75 0.00 B
50.000 0.0000 100.0000 0.000 0.00 0.00 D
as you can see, A is the best rule so should be at the top. B should
probably be higher than C. D should be last (since it's a very good ham
rule).
old IG output (./hit-frequencies -x -p -i -c=/dev/null):
OVERALL SPAM% HAM% S/O IG SCORE NAME
0 5 5 0.500 0.00 0.00 (all messages)
0.00000 50.0000 50.0000 0.500 0.00 0.00 (all messages as %)
50.000 100.0000 0.0000 1.000 1.00 0.00 A
50.000 0.0000 100.0000 0.000 1.00 0.00 D
60.000 100.0000 20.0000 0.833 0.35 0.00 B
30.000 60.0000 0.0000 1.000 0.00 0.00 C
not bad -- there's a bug in that D is treated as equally good as A;
really, it should be at the end. but B is listed higher than C.
new algorithm (same cmdline):
OVERALL SPAM% HAM% S/O IG SCORE NAME
0 5 5 0.500 0.00 0.00 (all messages)
0.00000 50.0000 50.0000 0.500 0.00 0.00 (all messages as %)
50.000 100.0000 0.0000 1.000 1.00 0.00 A
30.000 60.0000 0.0000 1.000 0.60 0.00 C
50.000 0.0000 100.0000 0.000 0.00 0.00 D
60.000 100.0000 20.0000 0.833 0.00 0.00 B
there's a problem here in that B is listed last. not sure why...
I'll attach the patch to hit-frequencies if anyone wants to have
a look.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4100] The ranking measure in hit-frequencies is suspicious
Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4100
------- Additional Comments From quinlan@pathname.com 2005-01-24 02:36 -------
Note: the IG function is not used by default. The rank used is simply:
RANK(rule) = (percentile(wanted) + percentile(unwanted))/2
It's important that the rank function factor in both the wanted hits as well
as the unwanted ones, whatever it is.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4100] The ranking measure in hit-frequencies is suspicious
Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4100
------- Additional Comments From qa@ccert.edu.cn 2005-01-25 22:20 -------
Since ham-rules (rules with score < 0) should be avoid, the fomula should be:
RANK(rule) = (percentile(wanted) - percentile(unwanted))/2
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4100] The ranking measure in hit-frequencies is suspicious
Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4100
Bob@Menschel.net changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords|triage |
Target Milestone|Undefined |Future
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4100] The ranking measure in hit-frequencies is suspicious
Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4100
Bob@Menschel.net changed:
What |Removed |Added
----------------------------------------------------------------------------
Keywords| |triage
------- Additional Comments From Bob@Menschel.net 2005-04-07 23:25 -------
> Ham rules are okay if they cannot be used by spammers. We only have a few of
those.
Actually, "we" have lots of those. I use several dozen myself. As long as
they're private, mostly domain- or system-specific, and not readily forgeable,
then they work fine.
And in those cases, flagged with tflags nice, the hit frequencies ranking
algorithms work just fine for me.
Questions to the devs: If the simpler ranking algorithm is used, should the IG
algorithm be removed from the program?
If the IG algorithm is used in specific cases, should it be replaced with
Quang-Anh's suggestion for those cases?
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4100] The ranking measure in hit-frequencies is suspicious
Posted by bu...@bugzilla.spamassassin.org.
http://issues.apache.org/SpamAssassin/show_bug.cgi?id=4100
------- Additional Comments From jm@jmason.org 2007-04-19 04:23 -------
Created an attachment (id=3914)
--> (http://issues.apache.org/SpamAssassin/attachment.cgi?id=3914&action=view)
patch
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 4100] The ranking measure in hit-frequencies is suspicious
Posted by bu...@bugzilla.spamassassin.org.
http://bugzilla.spamassassin.org/show_bug.cgi?id=4100
------- Additional Comments From quinlan@pathname.com 2005-01-26 00:54 -------
Subject: Re: The ranking measure in hit-frequencies is suspicious
> Since ham-rules (rules with score < 0) should be avoid, the fomula
> should be:
Ham rules are okay if they cannot be used by spammers. We only have a
few of those.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.