You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by "Steven W. Orr" <st...@syslang.net> on 2009/09/24 20:00:49 UTC

Two more SA/MySQL questions.

I got the timestanp field added to the bayee_seen table.

I don't know why but

ALTER TABLE bayes_seen
    ADD lastupdate timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
CURRENT_TIMESTAMP;

did not work as it set all of the lastupdate fields to '0000-00-00 00:00:00'
which apparently is *not* NULL. Thanks to Jari for this:

UPDATE bayes_seen
SET lastupdate = CURRENT_TIMESTAMP
WHERE lastupdate = '0000-00-00 00:00:00' ;

which fixed me up. Maybe it would have worked with InnoDB and not with NyISAM.
All my tables are MyISAM.

I have two questions:

I. I am running MySQL version 5.067 which apparently is not enough to be
allowed to run InnoDB. The questions is this: I thought that InnoDB was going
to consume *more* resources because the purpose of it was to support
transactions. Am I wrong? If I convert to a higher rev of MySQL and get InnoDB
 will I get *better* performance?

II. The other question I asked earlier and didn't get an answer to, is...
Is there a consensus on the optimal time that the bayes_seen and awl tables
should be expired? Is 3 months good? Will 6 months not buy me anything but
hurt performance? Will two months make my bayes tables less accurate?

-- 
Time flies like the wind. Fruit flies like a banana. Stranger things have  .0.
happened but none stranger than this. Does your driver's license say Organ ..0
Donor?Black holes are where God divided by zero. Listen to me! We are all- 000
individuals! What if this weren't a hypothetical question?
steveo at syslang.net


Re: Two more SA/MySQL questions.

Posted by Benny Pedersen <me...@junc.org>.
On tor 24 sep 2009 20:00:49 CEST, "Steven W. Orr" wrote
> I. I am running MySQL version 5.067 which apparently is not enough
> to be allowed to run InnoDB.

see if you have skipinnodb in my.cnf

> The questions is this: I thought that InnoDB was going
> to consume *more* resources because the purpose of it was to
> support transactions. Am I wrong? If I convert to a higher
> rev of MySQL and get InnoDB will I get *better* performance?

myisam get more locking with prevents more multi updates to this high
update table and this will be slow to wait for lock to get each
updated this way, with innodb you just dont wait

> II. The other question I asked earlier and didn't get an answer to,
> is... Is there a consensus on the optimal time that the bayes_seen
> and awl tables should be expired? Is 3 months good? Will 6 months
> not buy me anything but hurt performance? Will two months make my
> bayes tables less accurate?

only rule is that you can have as long expire as your harddisk space
permit, and the longer the better, but if you get to much data to
expire it will also take longer to expire it

i have 6 months for a low msg mail server here, both on awl and bayes
seen

-- 
xpoint


Re: Two more SA/MySQL questions.

Posted by Benny Pedersen <me...@junc.org>.
On tor 24 sep 2009 22:57:13 CEST, LuKreme wrote
> Is there a write-up/how-to anyone's put together about setting up  
> bayes with MySQL?

please read some docs first

> Is it possible to migrate existing bayes to MySQL,

this is well explained in docs how to use --backup --restore in  
sa-learn --help

it just would be nice it also could do awl

> or do you simply start over? Does using MySQL bayes allow you to
> fake per-user bayes with MySQL-based users?

sa-learn --help see the --username

just dump as regulary user (NOT root) and restore into sql with  
--username should do it

-- 
xpoint


Re: Two more SA/MySQL questions.

Posted by Benny Pedersen <me...@junc.org>.
On tor 24 sep 2009 23:06:56 CEST, Jari Fredriksson wrote

> Bayes tables do not have user id or user name,so I guess they are  
> meant for global: no per user bayes no.


CREATE TABLE `bayes_token` (
   `id` int(11) NOT NULL default '0',
   `token` char(5) NOT NULL default '',
   `spam_count` int(11) NOT NULL default '0',
   `ham_count` int(11) NOT NULL default '0',
   `atime` int(11) NOT NULL default '0',
   PRIMARY KEY  (`id`,`token`),
   KEY `bayes_token_idx1` (`token`),
   KEY `bayes_token_idx2` (`id`,`atime`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;


CREATE TABLE `bayes_vars` (
   `id` int(11) NOT NULL auto_increment,
   `username` varchar(200) NOT NULL default '',
   `spam_count` int(11) NOT NULL default '0',
   `ham_count` int(11) NOT NULL default '0',
   `token_count` int(11) NOT NULL default '0',
   `last_expire` int(11) NOT NULL default '0',
   `last_atime_delta` int(11) NOT NULL default '0',
   `last_expire_reduce` int(11) NOT NULL default '0',
   `oldest_token_age` int(11) NOT NULL default '2147483647',
   `newest_token_age` int(11) NOT NULL default '0',
   PRIMARY KEY  (`id`),
   UNIQUE KEY `bayes_vars_idx1` (`username`)
) ENGINE=MyISAM  DEFAULT CHARSET=utf8 AUTO_INCREMENT=2 ;


id match in both tables gives the username in bayes vars

but yes some places there is only system wide bayes, its so in amavisd :(

but sa it self have pr user bayes awl userprefs

-- 
xpoint


Re: Two more SA/MySQL questions.

Posted by Jari Fredriksson <ja...@iki.fi>.

24.9.2009 23:57, LuKreme kirjoitti:
> On 24-Sep-2009, at 13:46, Jorn Argelo wrote:
>> Using InnoDB is absolutely vital and you cannot use MyISAM at all 
>> from my experience. My bayes_token table is 12 million rows and 
>> increasing every day, and performance is still just fine.
>
> Is there a write-up/how-to anyone's put together about setting up 
> bayes with MySQL? Is it possible to migrate existing bayes to MySQL, 
> or do you simply start over? Does using MySQL bayes allow you to fake 
> per-user bayes with MySQL-based users?
>

Sa-learn has export/import functionality, I guess that can be used when 
migrating.

Bayes tables do not have user id or user name,so I guess they are meant 
for global: no per user bayes no.


Re: Two more SA/MySQL questions.

Posted by Benny Pedersen <me...@junc.org>.
On fre 25 sep 2009 13:50:45 CEST, LuKreme wrote

> How does a shared UID separate the individual accounts in the MySQL
> map into individual IDs in the MySQL bayes?

it can only be done with unix users so far, eg have 2 unix users share one
id in spamassassin set same username in user_prefs

and this is imho why amavisd also can only do site wide bayes, not pr user

i dont use vpopmail so it might not be how it works in your setup

a dream is to have for me amavisd with spamassassin sql based user prefs

-- 
xpoint


Re: Two more SA/MySQL questions.

Posted by Kris Deugau <kd...@vianet.ca>.
LuKreme wrote:
> But that ID would simply be the vpopmail user, not the individual email 
> addresses that are in the MySQL map, right?

It maps to whatever username spamc is passed, or the *nix UID 
spamassassin/spamc finds automatically.  (I would *hope* you're using 
spamd/spamc in an environment where you're using vpopmail...  <g>)

If you don't specify a calling user, all of your spamc calls are to the 
same logical username (looked up via *nix UID in the system password 
file), and SA has no way to know which user you're filtering for.

If your spamc calls are at delivery and *do* refer to the "real" user 
(email address), you should already have per-user Bayes running by 
default.  (At least, as far as I can see in the docs, and from a mild 
panic here trying to figure out why the Bayes tables were growing wildly 
but didn't seem to be doing anything.)

> How does a shared UID separate the individual accounts in the MySQL map 
> into individual IDs in the MySQL bayes?

It only relies on *nix UID users if you don't explicitly pass a username 
to spamc.  spamd uses whatever username spamc passes in, and uses the 
listing in bayes_vars to restrict which bayes_tokens entries it looks at.

We call spamc with -u <recipient>, and run spamd with -x -q to enable 
SQL.  Historically, we haven't used userprefs on this system (and the 
preprocessed white/black entries handle most of that) but with some of 
the legacy systems we've imported I've started to add a few sets of 
userprefs in SQL.

IMO spamd has something of an awkward limitation for some setups where 
you can't have <X> as on-disk files, and <Y> as SQL, or you can't do 
certain types of virtual users with some types of userpref/Bayes/AWL/etc 
storage.  I thought of a specific edge case a while ago but I can't 
recall it at the moment.

-kgd

Re: Two more SA/MySQL questions.

Posted by LuKreme <kr...@kreme.com>.
On 25-Sep-2009, at 03:14, Benny Pedersen wrote:

> On fre 25 sep 2009 00:49:36 CEST, LuKreme wrote
>> Where that bayes_user let me store the email address for the MySQL/ 
>> postfixadmin users individually.
>
> id is a map to bayes_vars where you find username for the id

But that ID would simply be the vpopmail user, not the individual  
email addresses that are in the MySQL map, right?

> that way more then one email user can share one id user in bayes

Which is exactly what I want to avoid.

>> So, if I have user1@example.com and user1@example.org their bayes  
>> would be saved and checked versus only their own data.
>>
>> Make sense?
>
> in that case you need to make shared user id, not just sitewide id,  
> it have nothing to do with how postfixadmin see and manage things  
> for you

How does a shared UID separate the individual accounts in the MySQL  
map into individual IDs in the MySQL bayes?




-- 
Love seeketh not itself to please Nor for itself hath any care But
	for another gives its ease And builds a heaven in Hell's
	despair


Re: Two more SA/MySQL questions.

Posted by Benny Pedersen <me...@junc.org>.
On fre 25 sep 2009 00:49:36 CEST, LuKreme wrote
> Where that bayes_user let me store the email address for the  
> MySQL/postfixadmin users individually.

id is a map to bayes_vars where you find username for the id

that way more then one email user can share one id user in bayes

> So, if I have user1@example.com and user1@example.org their bayes  
> would be saved and checked versus only their own data.
>
> Make sense?

in that case you need to make shared user id, not just sitewide id, it  
have nothing to do with how postfixadmin see and manage things for you

-- 
xpoint


Re: Two more SA/MySQL questions.

Posted by LuKreme <kr...@kreme.com>.
On 24-Sep-2009, at 15:39, Kris Deugau wrote:

> LuKreme wrote:
>
>> Does using MySQL bayes allow you to fake per-user bayes with MySQL- 
>> based users?
>
> Mmm, after rereading this a few times I'm not sure what you're  
> asking.  By default it's per-user;  each calling user passed by  
> spamc or spamassassin is a "user" in the context of SpamAssassin.   
> There's an "extra" table in the SQL defs (bayes_vars) that maps the  
> calling user to an "id" field in the token table.

OK, currently all of my MySQL managed users are in /usr/local/virtual  
and all their mail belongs to the system users vpopmail. When I run  
bayes for those users, it all is added to the 'sitewide' bayes. I put  
sitewide in quotes because it is not really sitewide, it is vpopmail  
user wide, but since the majority of users are MySQL, they all share  
one database.

> We've overridden this with bayes_sql_override_username to make it  
> sitewide, since per-user Bayes tuning is....  not gonna happen in a  
> general ISP environment.

> If you're looking to actually use different SQL users for each SA  
> user...  don't think that's possible at all with stock SA.

No, what I was hoping was possible was something like this:

CREATE TABLE `bayes_seen` )
  `id` int(11) NOT NULL DEFAULT '0',
  `msgid` varchar(200) CHARACTER SET latin1 COLLATE latin1_bin NOT  
NULL DEFAULT '',
  `flag` char(1) NOT NULL DEFAULT '',
  `lastupdate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE  
CURRENT_TIMESTAMP,
  `bayes_user` varchar(256) CHARACTER SET latin1 COLLATE latin1_bin NO  
NULL DEFAULT 'vpopmail@example.com',
  PRIMARY KEY (`id`,`msgid`)
)

Where that bayes_user let me store the email address for the MySQL/ 
postfixadmin users individually.

So, if I have user1@example.com and user1@example.org their bayes  
would be saved and checked versus only their own data.

Make sense?

-- 
Commander: "Seems odd you'd name your ship after a battle
	you were on the wrong side of."
Mal: "May have been the losing side. Still not convinced
	it was the wrong one."


Re: Two more SA/MySQL questions.

Posted by Kris Deugau <kd...@vianet.ca>.
LuKreme wrote:
> Is there a write-up/how-to anyone's put together about setting up bayes 
> with MySQL?

Aside from the sparse docs that ship with SA, none I've seen.  I'm still 
fiddling the sitewide database size (number of tokens) a year and a half 
after the general configuration reached its current state.

> Is it possible to migrate existing bayes to MySQL, or do you 
> simply start over?

As mentioned, sa-learn's backup/restore functionality should work - may 
be troublesome to script though.

> Does using MySQL bayes allow you to fake per-user 
> bayes with MySQL-based users?

Mmm, after rereading this a few times I'm not sure what you're asking. 
  By default it's per-user;  each calling user passed by spamc or 
spamassassin is a "user" in the context of SpamAssassin.  There's an 
"extra" table in the SQL defs (bayes_vars) that maps the calling user to 
an "id" field in the token table.

We've overridden this with bayes_sql_override_username to make it 
sitewide, since per-user Bayes tuning is....  not gonna happen in a 
general ISP environment.

If you're looking to actually use different SQL users for each SA 
user...  don't think that's possible at all with stock SA.

FWIW, I don't think any of these are MySQL-specific, either.

-kgd

Re: Two more SA/MySQL questions.

Posted by LuKreme <kr...@kreme.com>.
On 24-Sep-2009, at 13:46, Jorn Argelo wrote:
> Using InnoDB is absolutely vital and you cannot use MyISAM at all  
> from my experience. My bayes_token table is 12 million rows and  
> increasing every day, and performance is still just fine.

Is there a write-up/how-to anyone's put together about setting up  
bayes with MySQL? Is it possible to migrate existing bayes to MySQL,  
or do you simply start over? Does using MySQL bayes allow you to fake  
per-user bayes with MySQL-based users?

-- 
Some books are undeservedly forgotten; none are undeservedly
	remembered


Re: Two more SA/MySQL questions.

Posted by Kris Deugau <kd...@vianet.ca>.
Jorn Argelo wrote:
> I can tell from experience that MyISAM is useless when it comes to 
> Bayes. As pointed out by Benny Pedersen, MySQL will do nothing more than 
> waiting on table locks. A single UPDATE query will take 30-90 seconds, 
> and even more when you are on a busy site, not to mention the load of 
> your MySQL server is going skyhigh. Your scantimes will increase 
> dramatically to over 2-3 minutes of a single e-mail. If you, like me, 
> get 25.000 emails a day to process you can't afford this.

Just goes to show what you can do when you throw hardware at the 
problem.  ;)

Our SA database bits are on one of four two-socket, quad-core Opteron 
machines each with 8G of RAM.  The directory for the SA database is a 
1.9G ramdisk;  the database is dumped to a real disk daily.  Table type 
*is* MyISAM;  I've just confirmed this with mysqldump -d.

The other three machines all run SpamAssassin itself - although except 
for occasional peaks, any one could probably handle the load OK.

> Using InnoDB is absolutely vital and you cannot use MyISAM at all from 
> my experience. My bayes_token table is 12 million rows and increasing 
> every day, and performance is still just fine.

What do you have for bayes_expiry_max_db_size?  Is this a sitewide Bayes 
setup?  I've got it set to 2M for sitewide use;  so far that seems to be 
close to optimal between daily churn and accuracy.

-kgd

Re: Two more SA/MySQL questions.

Posted by Jorn Argelo <jo...@wcborstel.com>.
Kris Deugau wrote:
> Steven W. Orr wrote:
>> I. I am running MySQL version 5.067 which apparently is not enough to be
>> allowed to run InnoDB.
>
> Strange;  IIRC InnoDB has been available since some late 3.something 
> versions, and most 4.x releases.  The MySQL docs should be the 
> authoritative reference for this though.
Yes, up to my knowledge MySQL 5.0.xx supports InnoDB just fine.
>
>> The questions is this: I thought that InnoDB was going
>> to consume *more* resources because the purpose of it was to support
>> transactions. Am I wrong? If I convert to a higher rev of MySQL and 
>> get InnoDB
>>  will I get *better* performance?
>
> Likely not;  MySQL gets a great deal of its speed from...  not doing 
> transactions.
>
> Not having done any detailed performance tuning or testing, I can't 
> say for sure - but from what I recall from my reading on this subject 
> that's one of the points formally documented by the MySQL devs 
> themselves.
I can tell from experience that MyISAM is useless when it comes to 
Bayes. As pointed out by Benny Pedersen, MySQL will do nothing more than 
waiting on table locks. A single UPDATE query will take 30-90 seconds, 
and even more when you are on a busy site, not to mention the load of 
your MySQL server is going skyhigh. Your scantimes will increase 
dramatically to over 2-3 minutes of a single e-mail. If you, like me, 
get 25.000 emails a day to process you can't afford this.

Using InnoDB is absolutely vital and you cannot use MyISAM at all from 
my experience. My bayes_token table is 12 million rows and increasing 
every day, and performance is still just fine.

Hope this helps some.

Jorn

Re: Two more SA/MySQL questions.

Posted by Kris Deugau <kd...@vianet.ca>.
Jari Fredriksson wrote:
> MyISAM locks whole table when it needs a lock, InnoDB has row locking, 
> thus operations on tables like awl and bayes_token will be much better 
> if there are multiple spamd connected to the database.

I stand corrected.  I've fought with MySQL in one way or another nearly 
every time I've had to deal with it on more than a passing "need to see 
what's in this table" basis and it's been a while since I've looked at 
specifics like this.

> One can find only new files for learning.

*nod*  That depends a lot on your processes;  I can't (easily) run 
sa-learn on a system that has access to any of the mailspools, so I snag 
messages to learn via IMAP.  I have a process on a system that *does* 
touch the mailspool that archives the (hand-sorted) reported spam.

I prefer to keep a certain amount of mail nominally already learned in 
the folder anyway, and I like to keep the 
reported-as-spam-but-is-really-ham around indefinitely because it's 
nearly impossible to get any ham for manual learning any other way.

-kgd

Re: Two more SA/MySQL questions.

Posted by Jari Fredriksson <ja...@iki.fi>.
24.9.2009 21:55, Kris Deugau wrote:
> Steven W. Orr wrote:
>
>> The questions is this: I thought that InnoDB was going
>> to consume *more* resources because the purpose of it was to support
>> transactions. Am I wrong? If I convert to a higher rev of MySQL and 
>> get InnoDB
>>  will I get *better* performance?
>
> Likely not;  MySQL gets a great deal of its speed from...  not doing 
> transactions.
>
> Not having done any detailed performance tuning or testing, I can't 
> say for sure - but from what I recall from my reading on this subject 
> that's one of the points formally documented by the MySQL devs 
> themselves.
>

MyISAM locks whole table when it needs a lock, InnoDB has row locking, 
thus operations on tables like awl and bayes_token will be much better 
if there are multiple spamd connected to the database.

> Cron'ed Bayes-feeding tasks that are known to read the same set of 
> messages for some amount of time are one reason to not simply "delete 
> from bayes_seen;".  (Unless you really *want* to relearn the same 
> messages over and over again - FWIW, I *have* found that cases of 
> "this spam was sent to half my userbase" make that something you might 
> want to do deliberately now and then.)
>
> We keep two weeks' worth for both bayes_seen and AWL;  the on-disk 
> files for MySQL run about 300-350M for AWL and ~150-200M for 
> bayes_seen.  This is a modest cluster that idles along filtering ~400K 
> of ~3M messages daily.  Autolearn is enabled - without it, bayes_seen 
> would likely not grow quite so fast.
>

One can find only new files for learning. This my cron job for learning 
new ham from my personal Maildir:

----------------------------------------------------------------------------------------------------------------------------------------------

#/bin/bash

if [ -f ~/.learnham.running ]; then exit; fi
touch ~/.learnham.running
trap "rm -f ~/.learnham.running ~/.sa-learn.ham.tmp" EXIT

cd ~
find Maildir/ -newer .sa-learn.ham -name \*`hostname`\* | grep -v -i 
spam | grep -v Trash>.sa-learn.ham.tmp
if test -s .sa-learn.ham.tmp; then
         mv .sa-learn.ham.tmp .sa-learn.ham
         /usr/bin/sa-learn -u spam --showdots --ham --folders=.sa-learn.ham
fi

----------------------------------------------------------------------------------------------------------------------------------------------

It uses 'find' to gets a list of new files (newer that the last run).



Re: Two more SA/MySQL questions.

Posted by Kris Deugau <kd...@vianet.ca>.
Steven W. Orr wrote:
> I. I am running MySQL version 5.067 which apparently is not enough to be
> allowed to run InnoDB.

Strange;  IIRC InnoDB has been available since some late 3.something 
versions, and most 4.x releases.  The MySQL docs should be the 
authoritative reference for this though.

> The questions is this: I thought that InnoDB was going
> to consume *more* resources because the purpose of it was to support
> transactions. Am I wrong? If I convert to a higher rev of MySQL and get InnoDB
>  will I get *better* performance?

Likely not;  MySQL gets a great deal of its speed from...  not doing 
transactions.

Not having done any detailed performance tuning or testing, I can't say 
for sure - but from what I recall from my reading on this subject that's 
one of the points formally documented by the MySQL devs themselves.

> II. The other question I asked earlier and didn't get an answer to, is...
> Is there a consensus on the optimal time that the bayes_seen and awl tables
> should be expired? Is 3 months good? Will 6 months not buy me anything but
> hurt performance? Will two months make my bayes tables less accurate?

Well, expiring from bayes_seen is essentially deleting the message-id's 
of messages that have been learned.  Those are *only* used to determine 
if a new message to be learned has been seen before, and can, generally, 
be deleted at will based on whatever criteria you please. ("Third 
character is a vowel, and today's Tuesday...  Delete that one!")

Cron'ed Bayes-feeding tasks that are known to read the same set of 
messages for some amount of time are one reason to not simply "delete 
from bayes_seen;".  (Unless you really *want* to relearn the same 
messages over and over again - FWIW, I *have* found that cases of "this 
spam was sent to half my userbase" make that something you might want to 
do deliberately now and then.)

We keep two weeks' worth for both bayes_seen and AWL;  the on-disk files 
for MySQL run about 300-350M for AWL and ~150-200M for bayes_seen.  This 
is a modest cluster that idles along filtering ~400K of ~3M messages 
daily.  Autolearn is enabled - without it, bayes_seen would likely not 
grow quite so fast.

We also have a disk space restriction in that we've put the SA database 
on a ramdisk to increase performance;  it gets dumped to a file on a 
real disk daily as a backup.

-kgd