You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Linda Walsh <sa...@tlinx.org> on 2005/10/29 06:33:55 UTC

3.1 vs. 2.6x & 3.0x: Good; when to SQL; RFE's (to dev?)

Finally got the kinks worked out in my SA-3.1 setup last week. Filtered

out over 420 spams -- maybe 1 false positive, and it was borderline.

The speed on sa-learn has dropped, but that may be unavoidable.  But
I'm finally getting >= spam recognition than I had in 2.63.

I have no-online tests enabled as the online test databases are going
the way of "cddb"...becoming privatized. Sorta sad...maybe time to
start a "freezor" or some similar services.  I mean the spam services
collect data about what is spam from users who use the database.  Without
the users, they woudn't be nearly as effective.  Yet the users then are
encouraged to pay to access the body of data that was previously donated
for free.

I suppose one could look at the cost of "aggregation" and intelligent
processing of 1000's of user-spam inputs into a usable output format,
and while it might be manageable for a small community of users, it's
not so manageable if the database starts being used by a much larger
user-base than the original system was designed to run on.

Still -- I have yet to look at what is needed to convert my "db"s into
SQL form -- been sorta busy: car got crashed into last week and
was told this week it's totalled, that and was informed Tuesday
of a need for a root canal, on Wednesday, informed of need for 2nd
root canal & oral surgery.  *smile*  Life is just so _*!%fun!*%)_.

So am a bit behind in being on top of my ->SQL based conversion (I'm
assuming i'm in an older format.  I just ran the convert tool to convert
from 2.x format to 3.x. 

Assuming it is some sort of berkeley db format, what is a good
cut-over size as a "rule-of-thumb"...or is there?  What should I
expect in speeds  for "sa-learn" or spamc?  I.e. -- is there a
rough guideline for when it becomes more effective to use SQL
vs. the Berkeley DB?  Or rephrased, when it is worth the effort to
convert to SQL and ensure all the SQL software is setup and running?

Thanks...and thanks for the help/patience

BTW -- maybe this should go to the "sa-dev" list, but an RFE:

"spamassassin --lint":

   1) would be nice to mention if daemon is _RUNNING_ and  ready
to process messages; (user error: forgetting to restart daemon and
seeing no "--lint" message hinting that the daemon isn't running and
ready to process incoming mail--*duh*)
   2) Would be nice, especially in "--lint" to check for bogus
lock files left around in spam DB dir.  I don't know when these files
are used, but their presence really slows down sa-learn by about a
factor of 4-6x.

"sa-learn":
   1) RFE: have sa-learn issue warning about pre-existing lock-files,
or, better,  auto-remove bogus locks for processes that no longer exist.

Re: when to SQL; RFE's (to dev?)

Posted by Michael Monnerie <m....@zmi.at>.

On Montag, 31. Oktober 2005 03:15 Linda Walsh wrote:
> Still am not sure what size system (or user) db's should trigger
> usage of "SQL".  Any reason why user DB's would hurt performance
> over a system DB using Berkeley format?  Supposing I have no system
> DB and am only using user DB's?  What if it is a small group 3-4
> people? Is it an issue of having to read in the DB with each email /
> user and the system DB might hang around in memory?  Does the system
> DB get some preferential treatment?  I.e. if one user gets 80% of the
> email, will SA operate as though it is using a system DB?

There are so many variables, that you can't tell really. The "DB" format 
is easy to use, as it creates the files itself etc. To use SQL takes 
some effort, but afterwards scales better (I mean 100+ users). For a 
handful of people I guess the difference is peanuts, use what you like. 
As long as your hardware can make it, it doesn't matter.

I use DB until now, with lots of domains and users on several hosts. 
It's not a performance question, but an administrative one, that I will 
switch to SQL soon. I want to offer a webpage for each user to 
configure their own settings. Also, if you use cyrus IMAPd, you don't 
need Linux users anymore, so you need SQL to store personal 
preferences, as the user doesn't exist and therefore has no homedir to 
store the DB.

>     Still not so sure about why "sa-learn" would process emails so
> much more slowly than 2.6x, since for an individual user, it wouldn't
> be accessing a system DB, no?

I guess it's for other reasons to be slower - but does it hurt? For me, 
sa-learn is an automated job, and I don't really care if it takes 1 or 
100 seconds, as long as the machine still runs smoothly. But of course, 
the quicker the better :-)

mfg zmi
-- 
// Michael Monnerie, Ing.BSc  ---   it-management Michael Monnerie
// http://zmi.at           Tel: 0660/4156531          Linux 2.6.11
// PGP Key:   "lynx -source http://zmi.at/zmi2.asc | gpg --import"
// Fingerprint: EB93 ED8A 1DCD BB6C F952  F7F4 3911 B933 7054 5879
// Keyserver: www.keyserver.net                 Key-ID: 0x70545879

when to SQL; RFE's (to dev?)

Posted by Linda Walsh <sa...@tlinx.org>.


Michael Monnerie wrote:

>On Samstag, 29. Oktober 2005 06:33 Linda Walsh wrote:
>  
>
>>Assuming it is some sort of berkeley db format, what is a good
>>cut-over size as a "rule-of-thumb"...or is there?  What should I
>>expect in speeds  for "sa-learn" or spamc?  I.e. -- is there a
>>rough guideline for when it becomes more effective to use SQL
>>vs. the Berkeley DB?  Or rephrased, when it is worth the effort to
>>convert to SQL and ensure all the SQL software is setup and running?
>>    
>>
>
>I don't know whether this really is a performance question, but I 
>believe it's more of a "do I need it" question. For example, if you use 
>a system wide bayes db, you probably won't need SQL. I do this for now.
>  
>
---
    Still am not sure what size system (or user) db's should trigger
usage of "SQL".  Any reason why user DB's would hurt performance
over a system DB using Berkeley format?  Supposing I have no system
DB and am only using user DB's?  What if it is a small group 3-4 people?
Is it an issue of having to read in the DB with each email / user and
the system DB might hang around in memory?  Does the system DB get some
preferential treatment?  I.e. if one user gets 80% of the email, will
SA operate as though it is using a system DB?

    Still not so sure about why "sa-learn" would process emails so much
more slowly than 2.6x, since for an individual user, it wouldn't be
accessing a system DB, no?

>But if some users want/need their own bayes, or own settings, it starts 
>becoming easier to use SQL for all that things - it's quickly becoming 
>easier to manage, after 5 users or so need their special config. That's 
>why I'm thinking of switching to SQL.
>
>Does anybody know whether MySQL or PostgreSQL is better suited for the 
>job? I prefer PostgreSQL, but many times MySQL is better supported...
>
>mfg zmi
>  
>

Re: 3.1 vs. 2.6x & 3.0x: Good; when to SQL; RFE's (to dev?)

Posted by Michael Monnerie <m....@zmi.at>.

On Samstag, 29. Oktober 2005 06:33 Linda Walsh wrote:
> Assuming it is some sort of berkeley db format, what is a good
> cut-over size as a "rule-of-thumb"...or is there?  What should I
> expect in speeds  for "sa-learn" or spamc?  I.e. -- is there a
> rough guideline for when it becomes more effective to use SQL
> vs. the Berkeley DB?  Or rephrased, when it is worth the effort to
> convert to SQL and ensure all the SQL software is setup and running?

I don't know whether this really is a performance question, but I 
believe it's more of a "do I need it" question. For example, if you use 
a system wide bayes db, you probably won't need SQL. I do this for now.

But if some users want/need their own bayes, or own settings, it starts 
becoming easier to use SQL for all that things - it's quickly becoming 
easier to manage, after 5 users or so need their special config. That's 
why I'm thinking of switching to SQL.

Does anybody know whether MySQL or PostgreSQL is better suited for the 
job? I prefer PostgreSQL, but many times MySQL is better supported...

mfg zmi
-- 
// Michael Monnerie, Ing.BSc  ---   it-management Michael Monnerie
// http://zmi.at           Tel: 0660/4156531          Linux 2.6.11
// PGP Key:   "lynx -source http://zmi.at/zmi2.asc | gpg --import"
// Fingerprint: EB93 ED8A 1DCD BB6C F952  F7F4 3911 B933 7054 5879
// Keyserver: www.keyserver.net                 Key-ID: 0x70545879