You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Gary Funck <ga...@intrepid.com> on 2004/01/14 18:35:26 UTC

Yet Another Scoring Method (YASM) - Linear Programming

I don't have time to develop this idea further, but thought I'd offer it
for consideration. Let me begin by saying that I think that Henry Stern's
proposal likely has much more promise,
http://bugzilla.spamassassin.org/show_bug.cgi?id=2910
and that I'm describing this idea just in case someone wants to
pursue it further.

The motivation for an alternative scoring technique is to make it more
tractable for users who write their own custom rules and want to
rescore the entire collection of rules, or who want to rescore the rules
based upon their own collection of ham and spam. The current GA framework
as I understand it, isn't set up to allow a quick, effective score
calculation.

A linear programming model generally consists of a set of linear equations
with
inequalities (constraints), and an objective function that is to be
minimized
(or maximized).

I'll sketch the idea out here, but it definitely needs some work to make
this into something useable.

1) constraint: no false positive. For each ham message, find the rules
that hit on that message. Call them R1, R2 ... RN. Generate the following
constraint:
   R1*S1 + R2*S2 ... RN*SN <= 0.8 * C
where S1, S2 ... SN are the respective scores for those rules and C is our
cutoff value (In SA, by default, C = 5.0). The reason for using only 80% as
the cutoff (i.e., 4.0) is to leave some room for error when the rules and
scores are deployed.

2) constraint: no single rule can tag a spam. For all rules, establish the
following constraint: Ri <= 0.8 * C (i.e., Ri <= 4.0), for all i.

3) constraint:  no false negatives. For each spam message which triggers
rules R1, R2 ... RN, establish the following constraint:
   R1*S1 + R2*S2 ... RN*SN >= 1.2 * C
(in SA, C defaults to 5.0, so 1.2 * C is 6.0).

Note that if a set of rules {R1,R2 ... RN} appeared together in both a spam
and ham context, then this constraint and constraint (1) are mutually
exclusive,
and the respective scores couldn't be determined, and those rules must be
excluded entirely. One way to deal with this might be to leave out the spam
constraint entirely and accept false negatives in this case.

4) constraint: no spam feature can be scored less than 0. This would be true
for all
rules which are intended to trigger on spam and not ham. For rules like the
Habeas rule, the constraint can be changed to allow/encourage a negative
score.
for all i in the set of spam detection rules, Ri >= 0.

The objective function:

Maximize Sum(Si*Pi) for all i, where Pi is the probability that rule Ri
is not a false positive. Said differently, if 95% of the time, Rule Ri
appears in spam messages, and thus in 5% of the messages it appears in
ham, then Pi = 0.95. There may be better objective functions, but this
seemed
like a start.

There would be some fall out from the rules above. Because of the rather
severe constraints intended to eliminate false positives, this approach
might have a tendency to set more scores to 0, or a very small number than
the present GA, and other proposed "fuzzy" and non-linear methods of
scoring.

A possible advantage is that LP solvers already exist and should be easy to
plug into the existing framework. The LP solvers should also be fast, esp.
compared to the existing GA framework.

RE: Yet Another Scoring Method (YASM) - Linear Programming

Posted by Henry Stern <he...@stern.ca>.

I did some experiments with linear programming last August.  It doesn't work
very well when you have 900 variables and 500000 constraints, not all of
whom can be satisfied (even one misclassified ham or spam can cause that,
not to mention the overlap).  The LP solvers were also horrendously slow.

Since last July, I have experimented with most of the other basic learning
algorithms without success.  I have done other experiments with linear
programming, ID3 and C4.5 decision trees (too easy to fool), support vector
machines (doesn't learn the cost-sensitive margin in the way we'd want) and
naïve Bayesian networks.  Scott Long was going to look at simulated
annealing but I'm not sure what became of that.

As my schedule permits*, I'll be looking at using domain decomposition to
develop multiple score sets for SpamAssassin.  The justification behind this
is that a message coming from North America and "complying with CAN-SPAM"
can be looked at differently than a message coming from an offshore server
with forged headers and other tricks up the wazoo.

Henry

* The research project that I'm working on finishes up in 3 months.  If some
nice company out there would like to fund my research, I could dedicate more
time to this (up to 16 hours per week).  It's probably tax deductible too!
;)

> -----Original Message-----
> From: Gary Funck [mailto:gary@intrepid.com]
> Sent: January 14, 2004 1:35 PM
> To: Spam Assassin Dev
> Cc: Henry Stern
> Subject: Yet Another Scoring Method (YASM) - Linear Programming
> 
> 
> I don't have time to develop this idea further, but thought I'd offer it
> for consideration. Let me begin by saying that I think that Henry Stern's
> proposal likely has much more promise,
> http://bugzilla.spamassassin.org/show_bug.cgi?id=2910
> and that I'm describing this idea just in case someone wants to
> pursue it further.
> 
> The motivation for an alternative scoring technique is to make it more
> tractable for users who write their own custom rules and want to
> rescore the entire collection of rules, or who want to rescore the rules
> based upon their own collection of ham and spam. The current GA framework
> as I understand it, isn't set up to allow a quick, effective score
> calculation.
> 
> A linear programming model generally consists of a set of linear equations
> with
> inequalities (constraints), and an objective function that is to be
> minimized
> (or maximized).
> 
> I'll sketch the idea out here, but it definitely needs some work to make
> this into something useable.
> 
> 1) constraint: no false positive. For each ham message, find the rules
> that hit on that message. Call them R1, R2 ... RN. Generate the following
> constraint:
>    R1*S1 + R2*S2 ... RN*SN <= 0.8 * C
> where S1, S2 ... SN are the respective scores for those rules and C is our
> cutoff value (In SA, by default, C = 5.0). The reason for using only 80%
> as
> the cutoff (i.e., 4.0) is to leave some room for error when the rules and
> scores are deployed.
> 
> 2) constraint: no single rule can tag a spam. For all rules, establish the
> following constraint: Ri <= 0.8 * C (i.e., Ri <= 4.0), for all i.
> 
> 3) constraint:  no false negatives. For each spam message which triggers
> rules R1, R2 ... RN, establish the following constraint:
>    R1*S1 + R2*S2 ... RN*SN >= 1.2 * C
> (in SA, C defaults to 5.0, so 1.2 * C is 6.0).
> 
> Note that if a set of rules {R1,R2 ... RN} appeared together in both a
> spam
> and ham context, then this constraint and constraint (1) are mutually
> exclusive,
> and the respective scores couldn't be determined, and those rules must be
> excluded entirely. One way to deal with this might be to leave out the
> spam
> constraint entirely and accept false negatives in this case.
> 
> 4) constraint: no spam feature can be scored less than 0. This would be
> true
> for all
> rules which are intended to trigger on spam and not ham. For rules like
> the
> Habeas rule, the constraint can be changed to allow/encourage a negative
> score.
> for all i in the set of spam detection rules, Ri >= 0.
> 
> The objective function:
> 
> Maximize Sum(Si*Pi) for all i, where Pi is the probability that rule Ri
> is not a false positive. Said differently, if 95% of the time, Rule Ri
> appears in spam messages, and thus in 5% of the messages it appears in
> ham, then Pi = 0.95. There may be better objective functions, but this
> seemed
> like a start.
> 
> There would be some fall out from the rules above. Because of the rather
> severe constraints intended to eliminate false positives, this approach
> might have a tendency to set more scores to 0, or a very small number than
> the present GA, and other proposed "fuzzy" and non-linear methods of
> scoring.
> 
> A possible advantage is that LP solvers already exist and should be easy
> to
> plug into the existing framework. The LP solvers should also be fast, esp.
> compared to the existing GA framework.