You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Bookworm <qm...@bkwm.com> on 2005/05/01 18:37:41 UTC

Question about Bayes training - mozilla specifically

I've read through the archives several times, and hoped that over the 
last year or so someone would build the functionality, or at least 
mention it one way or another - I haven't seen it.

Is there any way to take an already trained Mozilla bayes structure and 
hand it directly off to SpamAssassin?  For me, at least, that would 
eliminate almost all of the spam my server is receiving - Mozilla spots 
it instantly, but SpamAssassin is missing at least half.

Troy Belding
Bookworm Computing

Re: Question about Bayes training - mozilla specifically

Posted by Bookworm <qm...@bkwm.com>.
Jo wrote:

> Bookworm wrote:
>
>> I've read through the archives several times, and hoped that over the 
>> last year or so someone would build the functionality, or at least 
>> mention it one way or another - I haven't seen it.
>>
>> Is there any way to take an already trained Mozilla bayes structure 
>> and hand it directly off to SpamAssassin?  For me, at least, that 
>> would eliminate almost all of the spam my server is receiving - 
>> Mozilla spots it instantly, but SpamAssassin is missing at least half.
>>
>> Troy Belding
>> Bookworm Computing
>
>
> Mozilla stores its mail in mbox format, so you can simply use your 
> good folders (one mbox each) for training HAM and your Junk folders 
> for training SPAM. Just go and have a look in the file system, where 
> Mozilla stores its files. mbox-files typically don't have an extension.
>
> Jo
>
>
The issue is not so much that - I've dumped all my ham/spam through
spamassassin - it's still not as good.  The only thing I can see that's
different is that Mozilla MUST have it's own bayes database that isn't
dependant upon the actual email folders themselves. (I stopped storing
all the junk mail when I reached about 15,000).  I have no clue where
that is, but I thought maybe someone here did, and knew how to convert
it to something that spamassassin could use.

Oh well - I'll try the mbox deal later.  I only have about 80,000 emails
I could process through..

Thanks!
Troy



Re: Question about Bayes training - mozilla specifically

Posted by Jo <ml...@winfix.IT>.
Bookworm wrote:

> I've read through the archives several times, and hoped that over the 
> last year or so someone would build the functionality, or at least 
> mention it one way or another - I haven't seen it.
>
> Is there any way to take an already trained Mozilla bayes structure 
> and hand it directly off to SpamAssassin?  For me, at least, that 
> would eliminate almost all of the spam my server is receiving - 
> Mozilla spots it instantly, but SpamAssassin is missing at least half.
>
> Troy Belding
> Bookworm Computing

Mozilla stores its mail in mbox format, so you can simply use your good 
folders (one mbox each) for training HAM and your Junk folders for 
training SPAM. Just go and have a look in the file system, where Mozilla 
stores its files. mbox-files typically don't have an extension.

Jo

Re: Question about Bayes training - mozilla specifically

Posted by Stuart Johnston <st...@ebby.com>.
Michael Parker wrote:
> On Mon, May 02, 2005 at 03:44:25PM -0500, Stuart Johnston wrote:
> 
>>Bookworm wrote:
>>
>>>I've read through the archives several times, and hoped that over the 
>>>last year or so someone would build the functionality, or at least 
>>>mention it one way or another - I haven't seen it.
>>>
>>>Is there any way to take an already trained Mozilla bayes structure and 
>>>hand it directly off to SpamAssassin?  For me, at least, that would 
>>>eliminate almost all of the spam my server is receiving - Mozilla spots 
>>>it instantly, but SpamAssassin is missing at least half.
>>
>>Here is a project that will export the Mozilla Bayes tokens which would 
>>at least be the first step.  I'm not sure how hard it would be to then 
>>import them into SA.
>>
>>http://bayesjunktool.mozdev.org/
>>
> 
> 
> The bayes backup/restore format is fairly stable and it is pretty easy
> to create a restore file from alternate sources (that is one of the
> reasons it was written).  It's possibly not documented as well as it
> should be, but no one has ever asked before so....
> 
> You will need the following bits of information:
> 
> 1) The Raw Token (which needs to be turned into an SHA1 and then into
> a hex representation, which is probably too simple of an explanation
> for what is actually going on, so probably needs some more detail and
> maybe a helper function in the SA code for those that might want to
> attempt such a thing, not to mention a period in this sentence
> somewhere.)
> 
> 2) The atime value for that token - SA bayes works off access times
>    for tokens, so you need to know the last time it was useful, in a
>    pinch you can use current time but it is not optimal.
> 
> 3) The ham count for the token
> 
> 4) The spam count for the token
> 
> 5) Number of spam msgs learned
> 
> 6) Number of ham msgs learned
> 
> 7) List of msg ids and if they were learned as ham or spam (this can
>    be optional but no optimal since it would allow for re-learning of
>    msgs which could throw off your spam/ham counts)
> 
> One you have all that, you throw it into a formatted restore file and
> then run sa-learn --restore and you are all set.
> 
> If someone has a dump of one of these files, and it's got all the
> required information I'd be happy to take a look to see how feasible
> it would be.

There are some examples in XML format here:

http://bayesjunktool.mozdev.org/installation.html

Here's a sample:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE tokenfile SYSTEM "trainer_xml.dtd"><tokenfile>
	<good_msgs>38</good_msgs>
	<bad_msgs>320</bad_msgs>
	<token>
		<name>$</name>
		<good>4</good>
		<bad>18</bad>

	</token>
...


atimes and msgids are not included.

Re: Question about Bayes training - mozilla specifically

Posted by Michael Parker <pa...@pobox.com>.
On Mon, May 02, 2005 at 03:44:25PM -0500, Stuart Johnston wrote:
> Bookworm wrote:
> >I've read through the archives several times, and hoped that over the 
> >last year or so someone would build the functionality, or at least 
> >mention it one way or another - I haven't seen it.
> >
> >Is there any way to take an already trained Mozilla bayes structure and 
> >hand it directly off to SpamAssassin?  For me, at least, that would 
> >eliminate almost all of the spam my server is receiving - Mozilla spots 
> >it instantly, but SpamAssassin is missing at least half.
> 
> Here is a project that will export the Mozilla Bayes tokens which would 
> at least be the first step.  I'm not sure how hard it would be to then 
> import them into SA.
> 
> http://bayesjunktool.mozdev.org/
> 

The bayes backup/restore format is fairly stable and it is pretty easy
to create a restore file from alternate sources (that is one of the
reasons it was written).  It's possibly not documented as well as it
should be, but no one has ever asked before so....

You will need the following bits of information:

1) The Raw Token (which needs to be turned into an SHA1 and then into
a hex representation, which is probably too simple of an explanation
for what is actually going on, so probably needs some more detail and
maybe a helper function in the SA code for those that might want to
attempt such a thing, not to mention a period in this sentence
somewhere.)

2) The atime value for that token - SA bayes works off access times
   for tokens, so you need to know the last time it was useful, in a
   pinch you can use current time but it is not optimal.

3) The ham count for the token

4) The spam count for the token

5) Number of spam msgs learned

6) Number of ham msgs learned

7) List of msg ids and if they were learned as ham or spam (this can
   be optional but no optimal since it would allow for re-learning of
   msgs which could throw off your spam/ham counts)

One you have all that, you throw it into a formatted restore file and
then run sa-learn --restore and you are all set.

If someone has a dump of one of these files, and it's got all the
required information I'd be happy to take a look to see how feasible
it would be.

Michael

Re: Question about Bayes training - mozilla specifically

Posted by Bookworm <qm...@bkwm.com>.
Stuart Johnston wrote:

> Here is a project that will export the Mozilla Bayes tokens which 
> would at least be the first step.  I'm not sure how hard it would be 
> to then import them into SA.
>
> http://bayesjunktool.mozdev.org/
>
>
Wonderful!  Thank you very much.  I'll see about exporting my Mozilla 
bayes database into all of the various formats, and upload to one of my 
web sites.  Anyone that wants a copy, let me know and I'll give them the 
link(s). 

Michael - I'll directly email you the links if you want to look them 
over.  You obviously have more knowledge about it than I do.   (I'm 
surprised this hasn't come up before, with both being Bayesian filters - 
and there might be a lot of folks with several years worth of filter 
built up that they could use to prime the pump of their SA install)

Thanks!

Troy

Re: Question about Bayes training - mozilla specifically

Posted by Stuart Johnston <st...@ebby.com>.
Bookworm wrote:
> I've read through the archives several times, and hoped that over the 
> last year or so someone would build the functionality, or at least 
> mention it one way or another - I haven't seen it.
> 
> Is there any way to take an already trained Mozilla bayes structure and 
> hand it directly off to SpamAssassin?  For me, at least, that would 
> eliminate almost all of the spam my server is receiving - Mozilla spots 
> it instantly, but SpamAssassin is missing at least half.

Here is a project that will export the Mozilla Bayes tokens which would 
at least be the first step.  I'm not sure how hard it would be to then 
import them into SA.

http://bayesjunktool.mozdev.org/