You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Duncan Findlay <du...@debian.org> on 2007/04/23 23:18:39 UTC

Score Generation for Apache SpamAssassin

Hi everybody,

As you may already know, Steven Birk and I have been working on our
4th year undergraduate project in Math and Engineering at Queen's
University.

The goal of our project was to examine the use of logistic regression
as a potential replacement for the Perceptron/GA currently used by the
SpamAssassin project.

It's now done, and it's available here:
http://people.apache.org/~duncf/FindlayBirkThesis.pdf

Basically, we've found a technique that shows promise as a possible
replacement, but requires some modifications in order to handle some
of the restrictions the SpamAssassin projects puts on scores.

I hope to try to make those modifications in the next month or so, but
I have no idea how well it will turn out, or how easy it will be.

The paper may be an interesting read for people not too familiar with
the way the scoring process works now, as it discusses many of the
issues that differentiate the scoring process from most other machine
learning problems. (Then again, it might just be boring.)

Enjoy!

-- 
Duncan Findlay

R: Score Generation for Apache SpamAssassin

Posted by Giampaolo Tomassoni <g....@libero.it>.
> -----Messaggio originale-----
> Da: Duncan Findlay [mailto:duncf@debian.org]
>
> ...omissis...
> 
> (Then again, it might just be boring.)

It didn't seem this to me. It "smells" like a good work and I'm interested
in seeing the true results against the 1M messages corpus.

I'm just a SA user, but let me thank you and Birk for your efforts in
besting both ham and spam detection rates.

Giampaolo


> Enjoy!
> 
> --
> Duncan Findlay


R: Score Generation for Apache SpamAssassin

Posted by Giampaolo Tomassoni <g....@libero.it>.
> -----Messaggio originale-----
> Da: Duncan Findlay [mailto:duncf@debian.org]
>
> ...omissis...
> 
> (Then again, it might just be boring.)

It didn't seem this to me. It "smells" like a good work and I'm interested
in seeing the true results against the 1M messages corpus.

I'm just a SA user, but let me thank you and Birk for your efforts in
besting both ham and spam detection rates.

Giampaolo


> Enjoy!
> 
> --
> Duncan Findlay


Re: Score Generation for Apache SpamAssassin

Posted by Duncan Findlay <du...@debian.org>.
On Thu, Apr 26, 2007 at 12:15:52PM +0100, Justin Mason wrote:
> thanks Duncan -- a great read, and looks promising!

> Would it help btw if we came up with a spec for what a score-generation
> tool needs to generate, in terms of score ranges and so on?
> This would also be useful for the future (I'm sure there'll be
> more... ;)

Probably not to me, but it might be useful to others. (I think I
already know what needs to be done.) Also, it might limit creativity
in possible solutions. We need a score ranges mechanism, we don't need
the specific one we have now.


-- 
Duncan Findlay

Re: Score Generation for Apache SpamAssassin

Posted by Duncan Findlay <du...@debian.org>.
On Thu, Apr 26, 2007 at 12:15:52PM +0100, Justin Mason wrote:
> thanks Duncan -- a great read, and looks promising!

> Would it help btw if we came up with a spec for what a score-generation
> tool needs to generate, in terms of score ranges and so on?
> This would also be useful for the future (I'm sure there'll be
> more... ;)

Probably not to me, but it might be useful to others. (I think I
already know what needs to be done.) Also, it might limit creativity
in possible solutions. We need a score ranges mechanism, we don't need
the specific one we have now.


-- 
Duncan Findlay