You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Chavdar Videff <ch...@mr-bricolage.bg> on 2005/07/11 12:40:14 UTC

simultaneous sa-learn processes

Hi List,

Our mailserver server serves about 100 users. Our config: 
Sendmail+Procmail+SpamAssassin.
The question is:
If I got it right, we should run sa-learn for each user in order to benefit 
from bayes. We intend to run a cron job for each user and do it at night by 
supplying a daily snapshot of our spam and ham collections to sa-learn.
Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)?
A weekly collection run for 1 user usually eats 100% of CPU load. My concern 
is whether the system is going to crash or just do the job slower and if you 
can point out how many sa-learn tasks could we run simultaneously with our 
setup.
All hints will be appreciated, for we scheduled an initial load for 16 users 
of the big collection of spam received so far.

Thanks guys

Chavdar Videff

Re: simultaneous sa-learn processes

Posted by jdow <jd...@earthlink.net>.
From: "Chavdar Videff" <ch...@mr-bricolage.bg>

> On Monday 11 July 2005 14:50, JamesDR wrote:
> > Chavdar Videff wrote:
> > > Hi List,
> > >
> > > Our mailserver server serves about 100 users. Our config:
> > > Sendmail+Procmail+SpamAssassin.
> > > The question is:
> > > If I got it right, we should run sa-learn for each user in order to
> > > benefit from bayes. We intend to run a cron job for each user and do
it
> > > at night by supplying a daily snapshot of our spam and ham collections
to
> > > sa-learn. Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)?
> > > A weekly collection run for 1 user usually eats 100% of CPU load. My
> > > concern is whether the system is going to crash or just do the job
slower
> > > and if you can point out how many sa-learn tasks could we run
> > > simultaneously with our setup.
> > > All hints will be appreciated, for we scheduled an initial load for 16
> > > users of the big collection of spam received so far.
> > >
> > > Thanks guys
> > >
> > > Chavdar Videff
> >
> > What kind of Bayes db are you using? We use MySQL here and haven't seen
> > SA-Learn use up that much cpu... I've run it manually up to 10 processes
> > at once without any noticeable slowing of the machine. (p2 450mhz,
256mb)
>
> I guess it is BerkeleyDB, the default installation on Debian. The
ineteresting
> part is that while testing cron on one user the cpu fall was not
noticeable.

If feeding individual user Bayes feed with ham samples and spam samples
submitted by the particular user for HER Bayes. If you have them all
working off the same Bayes corpus then there is little or no gain to
using per user Bayes.

{^_^}



Re: simultaneous sa-learn processes

Posted by Chavdar Videff <ch...@mr-bricolage.bg>.
On Monday 11 July 2005 14:50, JamesDR wrote:
> Chavdar Videff wrote:
> > Hi List,
> >
> > Our mailserver server serves about 100 users. Our config:
> > Sendmail+Procmail+SpamAssassin.
> > The question is:
> > If I got it right, we should run sa-learn for each user in order to
> > benefit from bayes. We intend to run a cron job for each user and do it
> > at night by supplying a daily snapshot of our spam and ham collections to
> > sa-learn. Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)?
> > A weekly collection run for 1 user usually eats 100% of CPU load. My
> > concern is whether the system is going to crash or just do the job slower
> > and if you can point out how many sa-learn tasks could we run
> > simultaneously with our setup.
> > All hints will be appreciated, for we scheduled an initial load for 16
> > users of the big collection of spam received so far.
> >
> > Thanks guys
> >
> > Chavdar Videff
>
> What kind of Bayes db are you using? We use MySQL here and haven't seen
> SA-Learn use up that much cpu... I've run it manually up to 10 processes
> at once without any noticeable slowing of the machine. (p2 450mhz, 256mb)

I guess it is BerkeleyDB, the default installation on Debian. The ineteresting 
part is that while testing cron on one user the cpu fall was not noticeable. 

Chavdar Videff

RE: simultaneous sa-learn processes

Posted by Sander Holthaus - Orange XL <in...@orangexl.com>.
JamesDR wrote:
> Chavdar Videff wrote:
>> Hi List,
>> 
>> Our mailserver server serves about 100 users. Our config:
>> Sendmail+Procmail+SpamAssassin.
>> The question is:
>> If I got it right, we should run sa-learn for each user in order to
>> benefit from bayes. We intend to run a cron job for each user and do
>> it at night by supplying a daily snapshot of our spam and ham
>> collections to sa-learn. Can our mailserver handle it (256 MB RAM,
>> Celeron 400 Mhz)?

Why would you want to setup Bayes on a per user basis if you are going to
feeed it system-wide hams and spams? Especially feeding it systemwide hams
is odd.
 
>> A weekly collection run for 1 user usually eats 100% of CPU load. My
>> concern is whether the system is going to crash or just do the job
>> slower and if you can point out how many sa-learn tasks could we run
>> simultaneously with our setup.

Systems shouldn't crash under high load, so that's not a real concern. If it
does happen, you have a more serious problems elswhere. What would be more
of a concern is how it is going to affect other processes running on your
system. Slower is not a problem, but if you really put the load on your box
from a lot of processes, you might start seeing time-outs.

>> All hints will be appreciated, for we scheduled an initial load for
>> 16 users of the big collection of spam received so far.

If your are going to simultaniously learn spam and ham for 16 users, and
want to keep running your mailserver/spamassassin too (it take you also have
a virusscanner running somewhere), I would consider at least running the
sa-learn processes under nice to keep them from stalling more essential
services. But, depending on your System setup (OS, DB, etc) you might want
to cut down a little on the number of processes run simultaniously. 

>> 
>> Thanks guys
>> 
>> Chavdar Videff
>> 
>> 
> What kind of Bayes db are you using? We use MySQL here and
> haven't seen SA-Learn use up that much cpu... I've run it
> manually up to 10 processes at once without any noticeable
> slowing of the machine. (p2 450mhz, 256mb)



Re: simultaneous sa-learn processes

Posted by JamesDR <ja...@trusswood.net>.
Chavdar Videff wrote:
> Hi List,
> 
> Our mailserver server serves about 100 users. Our config: 
> Sendmail+Procmail+SpamAssassin.
> The question is:
> If I got it right, we should run sa-learn for each user in order to benefit 
> from bayes. We intend to run a cron job for each user and do it at night by 
> supplying a daily snapshot of our spam and ham collections to sa-learn.
> Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)?
> A weekly collection run for 1 user usually eats 100% of CPU load. My concern 
> is whether the system is going to crash or just do the job slower and if you 
> can point out how many sa-learn tasks could we run simultaneously with our 
> setup.
> All hints will be appreciated, for we scheduled an initial load for 16 users 
> of the big collection of spam received so far.
> 
> Thanks guys
> 
> Chavdar Videff
> 
> 
What kind of Bayes db are you using? We use MySQL here and haven't seen 
SA-Learn use up that much cpu... I've run it manually up to 10 processes 
at once without any noticeable slowing of the machine. (p2 450mhz, 256mb)

-- 
Thanks,
James


Re: simultaneous sa-learn processes

Posted by Kai Schaetzl <ma...@conactive.com>.
Chavdar Videff wrote on Mon, 11 Jul 2005 16:13:44 +0300:

> If there is a way to set up a single bayes database I would prefer that

There is one, just look in the SA documentation. (documentation for 
local.cf should do.)

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org




Re: simultaneous sa-learn processes

Posted by Chavdar Videff <ch...@mr-bricolage.bg>.
On Monday 11 July 2005 15:31, Kai Schaetzl wrote:
> Chavdar Videff wrote on Mon, 11 Jul 2005 13:40:14 +0300:
> > If I got it right, we should run sa-learn for each user in order to
> > benefit from bayes. We intend to run a cron job for each user and do it
> > at night by supplying a daily snapshot of our spam and ham collections to
> > sa-learn.
>
> Do I understand you correctly? You use Bayes for each user, but you want to
> sa-learn each of them the same daily corpus? This means the only difference
> in the user's Bayes db's will be auto-learned mail or mail learned by those
> users (if anything of that is possible/allowed with your setup). Doesn't
> look too useful to me. If most of the db content is the same then you could
> just use a site-wide db. Also, Bayes gets better the more mail it gets. If
> your users don't get many mail their individual Bayes db's won't be very
> effective. I'm all for using site-wide Bayes unless you users get really a
> lot of mail (I'd say at least 100 mails per user per day).
>
> Kai
I thought it was installed site-wide, however the only bayes db's I find on 
the system are in each user's ~/.spamassassin folder. And indeed, the only 
way I can make bayes learn is by teaching it on a per-user basis. For quite a 
few months I collected spam, feeded it to sa-learn and finially reading this 
list relized that all I did was teach root's database. Everybody else did not 
benefit from bayes which was screwd because of autolearning a lot of spam to 
be ham. 
If there is a way to set up a single bayes database I would prefer that, for 
the scenario I am posting about does not make me happy (running 100 sa-learns 
at night).
Thanks
Chavdar


Re: simultaneous sa-learn processes

Posted by Kai Schaetzl <ma...@conactive.com>.
Chavdar Videff wrote on Mon, 11 Jul 2005 13:40:14 +0300:

> If I got it right, we should run sa-learn for each user in order to benefit 
> from bayes. We intend to run a cron job for each user and do it at night by 
> supplying a daily snapshot of our spam and ham collections to sa-learn.

Do I understand you correctly? You use Bayes for each user, but you want to 
sa-learn each of them the same daily corpus? This means the only difference in 
the user's Bayes db's will be auto-learned mail or mail learned by those users 
(if anything of that is possible/allowed with your setup). Doesn't look too 
useful to me. If most of the db content is the same then you could just use a 
site-wide db. Also, Bayes gets better the more mail it gets. If your users 
don't get many mail their individual Bayes db's won't be very effective. I'm 
all for using site-wide Bayes unless you users get really a lot of mail (I'd 
say at least 100 mails per user per day).

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org




Re: simultaneous sa-learn processes

Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Chavdar,

Monday, July 11, 2005, 3:40:14 AM, you wrote:

CV> Hi List,

CV> Our mailserver server serves about 100 users. Our config: 
CV> Sendmail+Procmail+SpamAssassin.
CV> The question is:
CV> If I got it right, we should run sa-learn for each user in order to benefit
CV> from bayes. We intend to run a cron job for each user and do it at night by
CV> supplying a daily snapshot of our spam and ham collections to sa-learn.
CV> Can our mailserver handle it (256 MB RAM, Celeron 400 Mhz)?
CV> A weekly collection run for 1 user usually eats 100% of CPU load. My concern
CV> is whether the system is going to crash or just do the job slower and if you
CV> can point out how many sa-learn tasks could we run simultaneously with our
CV> setup.
CV> All hints will be appreciated, for we scheduled an initial load for 16 users
CV> of the big collection of spam received so far.

As indicated in another email, doing a user-level learn of system-wide
collected ham/spam doesn't make much sense.  And if you take your
current system-wide collection and sa-learn it 100 times, you'll use
100 times more resources than learning it once.

On the other hand, if you meant that you'd sa-learn each individual
user's ham/spam for that user only, then move to the next, then
provided you do these one after the other sequentially (not all 100 at
once), you should not increase your system load at all.  (You will
increase your disk storage, since each user's database will take up
some disk space.)

As discussed in a couple of Bugzilla entries, you should probably
limit the size of your sa-learn runs -- limit them to a few hundred
emails at a time, or maybe a few meg combined size. A massive sa-learn
run of thousands of emails, dozens of meg in one run, can bring a
resource-limited system to its knees.

Bob Menschel