You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by "Gabriel M. Wachman" <ga...@tufts.edu> on 2006/03/08 01:21:34 UTC

location of neural net file

It says in the SpamAssassin FAQ that version 3.x uses a neural network
to learn scores of messages. Where is the state of this neural network
saved? In other words, how does SpamAssassin keep track of the neural
network from one invocation to another?


Re: location of neural net file

Posted by Gabriel Wachman <ga...@tufts.edu>.
Theo Van Dinter wrote:
> On Tue, Mar 07, 2006 at 09:41:53PM -0500, Gabriel Wachman wrote:
>   
>> The motivation for this is that I'm comparing a filter a colleague wrote 
>> to various other filters (including SpamAssassin) and I want to make 
>> sure that the summary I give of SpamAssassin in my paper is accurate. 
>> Neural net vs. perceptron is a large distinction in our community, so I 
>> wouldn't want to be wrong about it.
>>     
>
> Unfortunately, most of us aren't qualified to go into the details of
> machine learning techniques, myself included.  There's more information
> on the wiki and in the source tree about how the perceptron works for
> us in some more depth:
>
> http://wiki.apache.org/spamassassin/Perceptron
> http://spamassassin.apache.org/full/3.0.x/dist/masses/README.perceptron
>
> Hope this helps. :)
>
>   
Thank you. That second link is exactly what I was looking for.

Thanks again,
Gabriel

Re: location of neural net file

Posted by Theo Van Dinter <fe...@apache.org>.
On Tue, Mar 07, 2006 at 09:41:53PM -0500, Gabriel Wachman wrote:
> The motivation for this is that I'm comparing a filter a colleague wrote 
> to various other filters (including SpamAssassin) and I want to make 
> sure that the summary I give of SpamAssassin in my paper is accurate. 
> Neural net vs. perceptron is a large distinction in our community, so I 
> wouldn't want to be wrong about it.

Unfortunately, most of us aren't qualified to go into the details of
machine learning techniques, myself included.  There's more information
on the wiki and in the source tree about how the perceptron works for
us in some more depth:

http://wiki.apache.org/spamassassin/Perceptron
http://spamassassin.apache.org/full/3.0.x/dist/masses/README.perceptron

Hope this helps. :)

-- 
Randomly Generated Tagline:
"Now that Windows NT 5.0 has been renamed Windows 2000, we should
 reconsider the rumor that the number that appears after the word 'Windows'
 is a minimum memory requirement in megabytes."   - Craig Milo Rogers

Re: location of neural net file

Posted by Gabriel Wachman <ga...@tufts.edu>.
Theo Van Dinter wrote:
> On Tue, Mar 07, 2006 at 08:44:59PM -0500, Gabriel M. Wachman wrote:
>   
>>> The perceptron (form of neural net used  in SA 3.0.0 and higher) is used by the
>>> developers to generate the scores prior to release. 99.9% of end-users do not
>>> ever use the perceptron.
>>>
>>>       
>> By "do not use" do you mean that it is completely ignored during
>> classification, or that only the fixed pre-trained neural net is used
>>     
>
> The output from the perceptron are scores (weights) which are used during
> classification.  As Matt said, users tend not to generate their own scores,
> and so therefore don't run the perceptron, they just use the output from when
> it's run pre-release.
>   
OK, I think I see where the confusion is; is it a perceptron or a neural 
net? For anyone who doesn't know, a perceptron is a single element 
neural net if one wanted to call it that, but really it's just a linear 
classifier. There are two reasons why it seems highly unlikely that 
SpamAssassin was trained on a neural net. 1) Back-propagation is an 
algorithm used on multi-layer neural nets and so does not really make 
sense in the context of training a perceptron (there's nothing to 
back-propagate to). 2) You can't save "scores" from a multilayer neural 
net as "if feature X is 1, add Y to the score." Neural nets compute 
complex functions that aren't simple conjunctions of features (and if 
they are simple conjunctions of features, just use a perceptron). That 
may be the crux of my confusion, since if there is a neural net 
somewhere, it needs to be running inside SpamAssassin during 
classification (even if it does not update itself). If it's just a 
perceptron, then I see how this works.

The motivation for this is that I'm comparing a filter a colleague wrote 
to various other filters (including SpamAssassin) and I want to make 
sure that the summary I give of SpamAssassin in my paper is accurate. 
Neural net vs. perceptron is a large distinction in our community, so I 
wouldn't want to be wrong about it.

Thanks again,
Gabriel

Re: location of neural net file

Posted by Theo Van Dinter <fe...@apache.org>.
On Tue, Mar 07, 2006 at 08:44:59PM -0500, Gabriel M. Wachman wrote:
> > The perceptron (form of neural net used  in SA 3.0.0 and higher) is used by the
> > developers to generate the scores prior to release. 99.9% of end-users do not
> > ever use the perceptron.
> > 
> By "do not use" do you mean that it is completely ignored during
> classification, or that only the fixed pre-trained neural net is used

The output from the perceptron are scores (weights) which are used during
classification.  As Matt said, users tend not to generate their own scores,
and so therefore don't run the perceptron, they just use the output from when
it's run pre-release.

> and the end-user does not change it? If it's not used at all, why does
> the FAQ state, "In SpamAssassin 3.x, the scores are assigned using a
> neural network trained with error back propagation?"

Because the scores are assigned by the perceptron.  Your confusion seems to be
related to how SpamAssassin works in general:

A mail is sent to SpamAssassin through some means (spamassassin, spamc/spamd,
third-party tool, etc.)  SpamAssassin reads in all of the config files,
including the scores (as generated by the perceptron), and runs all of the
rules over the message.  At the end, the scores for all rules that matched are
summed and the result is used to determine ham vs spam (by default if the
score is >= 5, the message is considered spam).

Hopefully this helps.

> Let me make sure I understand: the Bayes database is the primary form of
> customization done by default, although the underlying pre-trained
> neural net is the primary method for weighting scores? If you just run
> spamassassin with the default settings, is it not using the neural net
> to weight the various test scores? Or is the neural net itself its own
> separate test?

The neural net (perceptron) generates the default scores.  The Bayes database
is used by the BAYES_* rules to determine statistically if the message content
is spam or not.

-- 
Randomly Generated Tagline:
"BUGS: This manpage is confusing."         - man page for getopt

Re: location of neural net file

Posted by "Gabriel M. Wachman" <ga...@tufts.edu>.
Matt Kettler wrote:
> Gabriel M. Wachman wrote:
>> It says in the SpamAssassin FAQ that version 3.x uses a neural network
>> to learn scores of messages. Where is the state of this neural network
>> saved? In other words, how does SpamAssassin keep track of the neural
>> network from one invocation to another?
> 
> It doesn't keep a separate neural network for each site. Period.
> 
OK
> The perceptron (form of neural net used  in SA 3.0.0 and higher) is used by the
> developers to generate the scores prior to release. 99.9% of end-users do not
> ever use the perceptron.
> 
By "do not use" do you mean that it is completely ignored during
classification, or that only the fixed pre-trained neural net is used
and the end-user does not change it? If it's not used at all, why does
the FAQ state, "In SpamAssassin 3.x, the scores are assigned using a
neural network trained with error back propagation?"

> Most users stick to just doing a bayes database. This generally offers all the
> custom per-site training most places need. By default the bayes database lives
> in the home directory of the user invoking SA,(~/.spamassassin/bayes_*) but it
> can be configured to a single site-wide file, or a SQL server.
Let me make sure I understand: the Bayes database is the primary form of
customization done by default, although the underlying pre-trained
neural net is the primary method for weighting scores? If you just run
spamassassin with the default settings, is it not using the neural net
to weight the various test scores? Or is the neural net itself its own
separate test?


Thank you for the information,
Gabriel

Re: location of neural net file

Posted by Matt Kettler <mk...@evi-inc.com>.
Gabriel M. Wachman wrote:
> It says in the SpamAssassin FAQ that version 3.x uses a neural network
> to learn scores of messages. Where is the state of this neural network
> saved? In other words, how does SpamAssassin keep track of the neural
> network from one invocation to another?

It doesn't keep a separate neural network for each site. Period.

The perceptron (form of neural net used  in SA 3.0.0 and higher) is used by the
developers to generate the scores prior to release. 99.9% of end-users do not
ever use the perceptron.

If for some reason you want to re-generate your own scoreset, you can do so
using the tools located in the /masses subdirectory of the tarball. However,
this will require a considerably large corpus of spam and nonspam to work
against. I don't recommend it unless your sites email is extremely unusual.

Most users stick to just doing a bayes database. This generally offers all the
custom per-site training most places need. By default the bayes database lives
in the home directory of the user invoking SA,(~/.spamassassin/bayes_*) but it
can be configured to a single site-wide file, or a SQL server.