You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Steve <li...@webvivant.com> on 2005/05/30 18:28:21 UTC

Learning from spam - system-wide

I've been using SpamAssassin successfully for some time, but having recently 
reconfigured my home network and could do with some help getting to grips 
with how best to use SA.

I have a Linux box (scoop) running Fetchmail, Postfix & Qpopper acting as our 
mail server. Mail is delivered into users' mail directories where .forward 
and Procmail run it through SA (2.64).

Now, scoop used to be my main desktop machine, too, and most of the spam that 
was coming in was targeted at me. So I ran sa-learn on that machine each day 
to learn from messages that I'd dropped into a 'new spam' folder.

Now things are a little different. Scoop is just a server and we have three 
workstations all receiving mail (and spam!).

What I'm having trouble understanding is this: when SA learns from new 
messages, are the benefits of this learning applied system-wide? Eg, if 
'steve' on scoop runs sa-learn, are the bayesian filters thus produced 
applied also when user 'trish' runs messages through SA? Or does only 'steve' 
benefit? (My reading on the subject suggests the latter).

What I'd like to do is have all users dropping 'new spam' messages into a 
common folder (easily done by having a symbolic link in each machine's Kmail 
folder pointing to a folder on a share on the server). I'd then set up a cron 
job to use these messages with sa-learn. But I'd want the filters created to 
benefit every user. Is this possible?

(BTW, I'm also running Amavisd-new for anti-virus on scoop).

@+
Steve

-- 
@+

Re: Learning from spam - system-wide

Posted by Bruno Delbono <br...@mail.ac>.

On Mon, 30 May 2005 13:40:44 -0700, Steve <li...@webvivant.com> wrote:

> On Monday 30 May 2005 19:25, mouss wrote:
>> run SA from amavisd, and run sa-learn with the same uid as amavisd.
>
> Okay, ignore my previous message. I'm working on getting amavisd to run  
> SA.
> Currently, amavisd seems to be running as user 'vscan' (UID 65). How do  
> I run
> sa-learn as this user and where would it put the bayesian DB?
>
> As you can see, I'm new to this stuff, so help is appreciated.

Make sure that the user vscan has a proper home directory setup.

finger vscan
cd ~vscan/.spamassassin

and make your changes there. You can setup a shell for vscan and simply su  
- to do most admin work you might need to do.

Re: Learning from spam - system-wide

Posted by mouss <us...@free.fr>.

Steve wrote:
> On Monday 30 May 2005 19:25, mouss wrote:
> 
>>run SA from amavisd, and run sa-learn with the same uid as amavisd.
> 
> 
> Okay, ignore my previous message. I'm working on getting amavisd to run SA. 
> Currently, amavisd seems to be running as user 'vscan' (UID 65). How do I run 
> sa-learn as this user and where would it put the bayesian DB?

as root, run
	su vscan -c sa-learn ....
make sure the message file or dir is read by root, not by vscan since in 
a correct setup vscan doesn't have read permission here. so use 
something like (I'm typing over my nose here. check before use):
	for f in `find $spamfolder -type f`; do
		(su $amavisuser -c sa-learn --spam ...) < $f
		mv $f $killfolder
	done
this assumes a maildir setup. mbox requires more work...

> 
> As you can see, I'm new to this stuff, so help is appreciated.
> 

A simple setup is to use imap and maildir format (courier-imap or 
dovecot). then tell your users to create some folders for sa-learn. for 
instance, they create Junk/Miss to move the missed messages and 
Junk/Innocent to copy legit messages classified as spam. feel free to 
create other folders for other things.

then have a script that runs sa-learn as vscan but again, the mail file 
  isn't readbale by vscan, so you'll need to read the maildir file by 
file and pass the output to sa-learn. while there is no problem 
chmod-ing the spam folder, this is less obvious for the ham folder.

of course, all this stuff assumes you want to use a site-wide bayes db. 
you need to be careful when using the classification of your users 
(unless you trust them to do the right classification). on the other 
hand, site-wide has the advantages of simplicity (only one db to care 
for), fewer storage, fewer cpu/ram (multi-rcpt mail gets parsed once), 
disposition coherence (in the case of multi-rcpt mail, the message is 
either spam or ham, it is not spam for a group and ham for others. the 
latter may cause problems like "but you've got that mail like I 
did..."), faster learning (gets more messages), and "spam-experience" 
sharing between the users. Now, a lot of people here (and google) will 
tell you the benefits of per-user db, so I'll stop here. It really 
depends on your situation.

Re: Learning from spam - system-wide

Posted by Steve <li...@webvivant.com>.

On Monday 30 May 2005 19:25, mouss wrote:
> run SA from amavisd, and run sa-learn with the same uid as amavisd.

Okay, ignore my previous message. I'm working on getting amavisd to run SA. 
Currently, amavisd seems to be running as user 'vscan' (UID 65). How do I run 
sa-learn as this user and where would it put the bayesian DB?

As you can see, I'm new to this stuff, so help is appreciated.

-- 
@+

Re: Learning from spam - system-wide

Posted by mouss <us...@free.fr>.

Steve wrote:
> I've been using SpamAssassin successfully for some time, but having recently 
> reconfigured my home network and could do with some help getting to grips 
> with how best to use SA.
> 
> I have a Linux box (scoop) running Fetchmail, Postfix & Qpopper acting as our 
> mail server. Mail is delivered into users' mail directories where .forward 
> and Procmail run it through SA (2.64).
> 
> Now, scoop used to be my main desktop machine, too, and most of the spam that 
> was coming in was targeted at me. So I ran sa-learn on that machine each day 
> to learn from messages that I'd dropped into a 'new spam' folder.
> 
> Now things are a little different. Scoop is just a server and we have three 
> workstations all receiving mail (and spam!).
> 
> What I'm having trouble understanding is this: when SA learns from new 
> messages, are the benefits of this learning applied system-wide? Eg, if 
> 'steve' on scoop runs sa-learn, are the bayesian filters thus produced 
> applied also when user 'trish' runs messages through SA? Or does only 'steve' 
> benefit? (My reading on the subject suggests the latter).

SA uses the bayes db for the uid that runs it. you probably have a uid 
per mailbox, which then means procmail/.forward run as different uids.


> 
> What I'd like to do is have all users dropping 'new spam' messages into a 
> common folder (easily done by having a symbolic link in each machine's Kmail 
> folder pointing to a folder on a share on the server). I'd then set up a cron 
> job to use these messages with sa-learn. But I'd want the filters created to 
> benefit every user. Is this possible?
> 
> (BTW, I'm also running Amavisd-new for anti-virus on scoop).
>

run SA from amavisd, and run sa-learn with the same uid as amavisd.

Re: Learning from spam - system-wide

Posted by jdow <jd...@earthlink.net>.

From: "Steve" <li...@webvivant.com>

> On Tuesday 31 May 2005 05:24, jdow wrote:
> > Trish and Steve may have quite different concepts of "spam". Many of
> > the complaints about Bayes being ineffective seem to come from people
> > trying to use one master Bayes database.
>
> Ah! I'll confess that it hadn't occurred to me that using a centralised
Bayes
> database might be a *bad* idea. I'm simply trying to simplify the whole SA
> setup as much as possible.
>
> Okay, here's another idea, and feel free to point out any stupidities...
> (caveat: this is a low-volume system, so performance is not an issue):
>
> The mail server scoop already has a user, with home dir, matching each
user on
> the other machines. I create 'new-spam' and 'ham' dirs in each home dir on
> the server and place symbolic links to these from the Kmail directories
from
> the client machines, so that when someone drops a new spam message into a
> directory they see as 'spam'new' in their Kmail dirs, it actually gets
> dropped into the dir on the server. Then I can run sa-learn as each
> individual user on the server. It's a bit more configuration, but maybe
less
> effort overall (I just tried modifying my postfix master.cf and main.cf
> settings, and amavis.conf to try to get amavis working with SA but it
screwed
> up - I *could* spend more time sorting this, but maybe the scheme above is
> simpler and, ultimately, a little more flexible). How does that sound?

My setup is probably more complex than is required. (A recent message
here moots some of the gyrations I take.) In essence I use fetchmail to
get to my server from my ISP. That is when SpamAssassin is run. For your
setup it sounds like this step is a simple sendmail receive through
SA and other tools.

I pull mail from my server via pop3. It happens I am using Outlook
Express on a Windows machine. That tools file formats are indecipherable
so it is utterly impractical to try to use them for training. So I
loaded up a simple old IMAP (not the fancied up Cyrus things) for the
training tools. Of course, that means one per user. For each user I
create "ham", "spam", "old ham", and "old spam" directories. (I also
created "older ham" and "older spam" as a speedup trick. When the "old"
directories get kinda full I move it over to the "older" stack.

Then I had to find a way to read the IMAP "mbox" format files with the
IMAP's initial mbox entry in the way. I played about 20 minutes with C
and built a little futility to drag off the messages from the raw IMAP
mbox file, tack it on to "old spam", and also save it to a "temp" spam
file. I feed the temp file to salearn. Repeat for ham and we're done.

Someone posted a utility recently to simply download from the IMAP
server and salearn on the fly. A little tweak on that to put the
processed files into the "old spam" file so that you have it for
future reference if ever needed.

THe only spam I place into the IMAP spam directory is the missed spam
that needs to be used for training, of course. And periodically I toss
bunches of ham into the IMAP ham directory for roughly balanced training.

{^_^}

Re: Learning from spam - system-wide

Posted by Steve <li...@webvivant.com>.

On Tuesday 31 May 2005 05:24, jdow wrote:
> Trish and Steve may have quite different concepts of "spam". Many of
> the complaints about Bayes being ineffective seem to come from people
> trying to use one master Bayes database.

Ah! I'll confess that it hadn't occurred to me that using a centralised Bayes 
database might be a *bad* idea. I'm simply trying to simplify the whole SA 
setup as much as possible.

Okay, here's another idea, and feel free to point out any stupidities... 
(caveat: this is a low-volume system, so performance is not an issue):

The mail server scoop already has a user, with home dir, matching each user on 
the other machines. I create 'new-spam' and 'ham' dirs in each home dir on 
the server and place symbolic links to these from the Kmail directories from 
the client machines, so that when someone drops a new spam message into a 
directory they see as 'spam'new' in their Kmail dirs, it actually gets 
dropped into the dir on the server. Then I can run sa-learn as each 
individual user on the server. It's a bit more configuration, but maybe less 
effort overall (I just tried modifying my postfix master.cf and main.cf 
settings, and amavis.conf to try to get amavis working with SA but it screwed 
up - I *could* spend more time sorting this, but maybe the scheme above is 
simpler and, ultimately, a little more flexible). How does that sound?

-- 
@+

Re: Learning from spam - system-wide

Posted by jdow <jd...@earthlink.net>.

From: "Steve" <li...@webvivant.com>

> What I'm having trouble understanding is this: when SA learns from new
> messages, are the benefits of this learning applied system-wide? Eg, if
> 'steve' on scoop runs sa-learn, are the bayesian filters thus produced
> applied also when user 'trish' runs messages through SA? Or does only
'steve'
> benefit? (My reading on the subject suggests the latter).

Trish and Steve may have quite different concepts of "spam". Many of
the complaints about Bayes being ineffective seem to come from people
trying to use one master Bayes database.

{^_^}