You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Daniel Quinlan <qu...@pathname.com> on 2005/07/22 21:08:50 UTC

Stanford logistic regression results

I'm not sure if I forwarded these before, but as part of being here at
CEAS in Stanford, we've been talking a bit to Mike Brzozowski and Andrew
Y. Ng about these results from last year on 3.0.

I think the (very preliminary) logistic regression results warrant
further testing.

Mike is going to dig up his code for us.

------- start of cut text --------------
From: Mike Brzozowski
To: Henry Stern
Cc: Daniel Quinlan, Mike Brzozowski, Andrew Y. Ng, Rajat Raina, David Willcox,
    Thomas Trappenberg
Subject: Re: initial findings

Hi everyone,

After rerunning our logistic regression trial with 90-10 splits I was
able to obtain similar values:

(threshold = 0: min confidence of spam = 50%)
false positive: mean=0.149% std=0.157%
false negative: mean=0.559% std=0.211%

(threshold = -1: min confidence of spam = 27%)
false positive: mean=0.238% std=0.193%
false negative: mean=0.330% std=0.210%

LR seems to be more resilient to the bias in the feature set towards
spam features if we allow negative weights; in test runs an average of
26% of the features had negative weights; 73% had positive weights.

We also tried applying a "cost" to make misclassifying ham more
expensive. This takes the form of modifying the objective function to
scale the certainty of ham classifications by a constant factor
/alpha/. With alpha = 0.25 we've been able to derive similar results:

(threshold = 0: min confidence of spam = 50%)
false positive: mean=0% std=0%
false negative: mean=0.619% std=0.311%

I've attached an FP-FN plot comparing logistic regression for various
alphas to the numbers in cvs for SA 3.0's feature training. We're still
looking into why alpha's effect is non-monotonic.

Do you have any older data sets? We were hoping to do a longitudinal
study on adversarial spam techniques.

Thanks,
Mike
------- end ----------------------------

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: Stanford logistic regression results

Posted by Sidney Markowitz <si...@sidney.com>.

Daniel,

If you get a chance to ask Andrew about his thoughts about Support Vector
Machines (SVM) I would be interested in hearing his answer. I would like to
compare the results of logistic regression with SVM. Perhaps we can try both
on the same data used for the 3.1 perceptron scoring and compare the results
of all three.

 -- sidney