You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Peter Marshall <pe...@caris.com> on 2005/04/07 14:29:38 UTC

WHich is better

I am looking for opinions.

I have been building a new mailserver to replace my old one.
The new one has postfix, Cyrus-imap, anomy, spamassassin.  I am trying 
to set up the bays auto-learn stuff.  Each user has a home directory on 
the server (they can not log onto the server).  I am using the Maildir 
format.

Is it better to have a cron job run by a single user (say root) to do 
the ham / spam learning for everyone, or should I run a cron for each 
individual user.  All users belong to the same company.

Problem I have thought of with the latter.
1.  There would be approximitly 130 cron jobs running sa-learn at the 
same time .... or it would run constantly if I staggered it for every 
user.  What kind of load will that have on  my 850 with 756 MB of ram ?

Problems I have with both:
1.  What is the best method of obtaining the spam / ham.  I have the 
server create a spam folder for each user when the user is created. 
spamassassin will automatically put all mail marked as spam in this 
folder.  Obviously I will use this folder to run salearn on for spam.  I 
will also instruct users to move mail that is spam that was not marked 
as spam to this folder.  My problem is, where do I run salearn for ham. 
  If I run it on the INBOX, then I could potentially be running it spam 
mail that has not yet been moved to the spam directory.

2. How often should I run sa-learn ?  Users here for the most part get 
mail in their inbox and then after reading it move it to some other sub 
folder ... (of which everyones is different, and some have over 100).


Are there any downfalls to running a site wide one ?  What is the best 
method of doing this if this is a better method.  Currently I plan to 
use this to learn the spam.  Does anyone see any problems.
(Note:  this assumes it is being run as a particular user.)

/usr/bin/sa-learn --spam --dir ~/Maildir/.Spam/new
/usr/bin/sa-learn --spam --dir ~/Maildir/.Spam/cur
mv ~/Maildir/.Spam/new/* ~/Maildir/.Trash
mv ~/Maildir/.Spam/cur/* ~/Maildir/.Trash

Thanks for the input,

Peter









-- 
Peter Marshall, BCS
System Administrator, CARIS

CARIS 2005 - Mapping a Seamless Society
10th International User Group Conference and Educational Sessions
Halifax, NS, Canada
E-mail caris2005@caris.com for more.

Re: WHich is better

Posted by Edward Shornock <ed...@crazeecanuck.homelinux.net>.
Peter Marshall wrote:

> I am looking for opinions.
>
> I have been building a new mailserver to replace my old one.
> The new one has postfix, Cyrus-imap, anomy, spamassassin.  I am trying
> to set up the bays auto-learn stuff.  Each user has a home directory
> on the server (they can not log onto the server).  I am using the
> Maildir format.

Doesn't Cyrus-Imap use it's "own" spool format or am I mistaken?  From
all of the documentation that I've read, you are NOT supposed to access
the Cyrus mail spool directly....

Re[2]: WHich is better

Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Peter,

Friday, April 8, 2005, 4:24:50 AM, you wrote:

PM> Hi Robert,

PM> Thank you very much for your detailed reply.  It was very helpful.  I
PM> just have one question.  Why can you not run sa-learn on spam already
PM> flagged as spam.  ...

You can.  I do.

My email system captures almost all emails that pass through it, and I
store those in "confirmed ham", "confirmed spam", "likely ham",
"likely spam", and "undetermined" buckets. **ALL** emails in the two
confirmed buckets are manually fed to sa-learn, regardless of whether
they were auto-learned.

PM> I thought spamassassin would rip out any headers it
PM> already added.  If that is the case then what is the harm in re learning
PM>   the spam as spam ...

You are right about that.  There's no harm, and indeed, usually no
re-learning (emails already known as spam will not be re-learned as
spam -- they'll be ignored rather than processed again).

Your question wasn't about re-learning. My caution was to make sure
that everything that went into sa-learn was manually determined to be
either spam or not-spam by some human. Do not automatically sa-learn
anything -- have a human make that determination.

If you automatically sa-learn emails other than the conservative
auto-learn used by SA, you very likely /will/ garbage up your Bayes
database, causing it to mis-classify emails.

Bob Menschel





Re: WHich is better

Posted by Peter Marshall <pe...@caris.com>.
Hi Robert,

Thank you very much for your detailed reply.  It was very helpful.  I 
just have one question.  Why can you not run sa-learn on spam already 
flagged as spam.  I thought spamassassin would rip out any headers it 
already added.  If that is the case then what is the harm in re learning 
  the spam as spam ... (I am just asking .. not trying to argue ... just 
curious).

Thank you again for your help,

Peter

Robert Menschel wrote:
> Hello Peter,
> 
> Thursday, April 7, 2005, 5:29:38 AM, you wrote:
> 
> PM> I have been building a new mailserver to replace my old one.
> PM> The new one has postfix, Cyrus-imap, anomy, spamassassin.  I am trying
> PM> to set up the bays auto-learn stuff.  Each user has a home directory on
> PM> the server (they can not log onto the server).  I am using the Maildir
> PM> format.
> 
> PM> Is it better to have a cron job run by a single user (say root) to do
> PM> the ham / spam learning for everyone, or should I run a cron for each
> PM> individual user.  All users belong to the same company.
> 
> Best, if you have the disk space for the multitude of Bayes databases,
> is to run ham/spam learning as each user. I'd recommend the "running
> constantly if I staggered it for every user," something like:
> - run as cron
> - get cycle start time
> - identify list of active users
> - for each active user
>   - determine if anything to learn; skip to next user if not
>   - su to that user's id
>   - sa-learn
> - if not yet 30 min since start of this cycle, sleep 15 min
> - loop to next cycle.
> 
> PM> Problem I have thought of with the latter.
> PM> 1.  There would be approximitly 130 cron jobs running sa-learn at the
> PM> same time .... or it would run constantly if I staggered it for every
> PM> user.  What kind of load will that have on  my 850 with 756 MB of ram ?
> 
> running constantly, staggered, will work better on that system (IMO)
> than allowing multiple executions at the same time.
> 
> PM> Problems I have with both:
> PM> 1.  What is the best method of obtaining the spam / ham.  I have the
> PM> server create a spam folder for each user when the user is created.
> PM> spamassassin will automatically put all mail marked as spam in this
> PM> folder.  Obviously I will use this folder to run salearn on for spam.
> 
> NO. NO. NO. NO.
> 
> Do not run sa-learn on automatically flagged emails. SA does this
> itself somewhat conservatively (though not conservatively enough --
> I suggest lowering the ham auto-learn threshold).
> 
> Provide instead a "missed-spam" folder and a "not-spam" folder. Have
> your people copy/move miscategorized emails into those, and learn from
> those folders.
> 
> PM> 2. How often should I run sa-learn ?  Users here for the most part get
> PM> mail in their inbox and then after reading it move it to some other sub
> PM> folder ... (of which everyones is different, and some have over 100).
> 
> On single-domain systems I normally run it hourly.
> 
> PM> Are there any downfalls to running a site wide one ?  What is the best
> PM> method of doing this if this is a better method.  Currently I plan to
> PM> use this to learn the spam.  Does anyone see any problems.
> PM> (Note:  this assumes it is being run as a particular user.)
> 
> Some people prefer system-wide, others domain-wide, others
> user-specific.  YMMV. Feasibility might be the more important
> criteria, since all three can work.
> 
> Bob Menschel
> 
> 
> 

-- 
Peter Marshall, BCS
System Administrator, CARIS

CARIS 2005 - Mapping a Seamless Society
10th International User Group Conference and Educational Sessions
Halifax, NS, Canada
E-mail caris2005@caris.com for more.

Re: WHich is better

Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Peter,

Thursday, April 7, 2005, 5:29:38 AM, you wrote:

PM> I have been building a new mailserver to replace my old one.
PM> The new one has postfix, Cyrus-imap, anomy, spamassassin.  I am trying
PM> to set up the bays auto-learn stuff.  Each user has a home directory on
PM> the server (they can not log onto the server).  I am using the Maildir
PM> format.

PM> Is it better to have a cron job run by a single user (say root) to do
PM> the ham / spam learning for everyone, or should I run a cron for each
PM> individual user.  All users belong to the same company.

Best, if you have the disk space for the multitude of Bayes databases,
is to run ham/spam learning as each user. I'd recommend the "running
constantly if I staggered it for every user," something like:
- run as cron
- get cycle start time
- identify list of active users
- for each active user
  - determine if anything to learn; skip to next user if not
  - su to that user's id
  - sa-learn
- if not yet 30 min since start of this cycle, sleep 15 min
- loop to next cycle.

PM> Problem I have thought of with the latter.
PM> 1.  There would be approximitly 130 cron jobs running sa-learn at the
PM> same time .... or it would run constantly if I staggered it for every
PM> user.  What kind of load will that have on  my 850 with 756 MB of ram ?

running constantly, staggered, will work better on that system (IMO)
than allowing multiple executions at the same time.

PM> Problems I have with both:
PM> 1.  What is the best method of obtaining the spam / ham.  I have the
PM> server create a spam folder for each user when the user is created.
PM> spamassassin will automatically put all mail marked as spam in this
PM> folder.  Obviously I will use this folder to run salearn on for spam.

NO. NO. NO. NO.

Do not run sa-learn on automatically flagged emails. SA does this
itself somewhat conservatively (though not conservatively enough --
I suggest lowering the ham auto-learn threshold).

Provide instead a "missed-spam" folder and a "not-spam" folder. Have
your people copy/move miscategorized emails into those, and learn from
those folders.

PM> 2. How often should I run sa-learn ?  Users here for the most part get
PM> mail in their inbox and then after reading it move it to some other sub
PM> folder ... (of which everyones is different, and some have over 100).

On single-domain systems I normally run it hourly.

PM> Are there any downfalls to running a site wide one ?  What is the best
PM> method of doing this if this is a better method.  Currently I plan to
PM> use this to learn the spam.  Does anyone see any problems.
PM> (Note:  this assumes it is being run as a particular user.)

Some people prefer system-wide, others domain-wide, others
user-specific.  YMMV. Feasibility might be the more important
criteria, since all three can work.

Bob Menschel