You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Andrew Donkin <ar...@waikato.ac.nz> on 2006/03/23 23:10:58 UTC

Bayes rules taking minutes - solved by moving to innodb?

Hi, people.  This started as a plea for help but ended as a report of
an investigation, so hopefully it will be a useful addition to the
archives.

About 1% of my scans were taking more than 300 seconds.  Extra
debugging in spamd showed me that the Bayes checks were the culprit:

13:38:05 spamd[16852]: slow: run_eval_tests BAYES_40 took 773 seconds 
13:45:18 spamd[16852]: slow: run_eval_tests BAYES_80 took 427 seconds 
13:45:20 spamd[16852]: slow: do_body_eval_tests(0) took 1212 seconds 

I am using per-user Bayes (on the recommendation of half this list,
and against the recommendation of the other half :-), and perform
about 100,000 scans per day.  Bayes_seen was ~ 150M, and bayes_token ~
1.5G.  The bayes_token index was 4.7G.

MySQL's slow query log showed that the queries did not take long to
execute after they achieved a lock, but I suspected they were not
getting their locks in reasonable time:

 mysql> SHOW STATUS LIKE 'Table%';
 +-----------------------+--------+
 | Variable_name         | Value  |
 +-----------------------+--------+
 | Table_locks_immediate | 171036 |
 | Table_locks_waited    | 220999 |
 +-----------------------+--------+

In a healthy database, table_locks_waited is a small fraction of
table_locks_immediate.

I turned off bayes_auto_expire in case it was the expiry which caused
the contention, but no change.  I need bayes_auto_expire turned on
because as we've discussed before, there is no way to perform
expiration for every user in an SQL Bayes database.

Well, I started this email a week ago and now I've found that at peak
times, "SHOW PROCESSLIST" shows many threads -- like 100 -- locked on
"SELECT FROM bayes_token" and "INSERT INTO bayes_token".

So I tried to convert bayes_token to InnoDB to take advantage of its
row-level locking (this is advised by the developers but not reflected
in bayes_mysql.sql).  After MySQL worked on that for a few days I
stopped it, dropped the database (innodb was very confused), and
recreated the database and all tables using innodb and two-byte IDs.

It's early days, with only 7.6M tokens seen and few accounts over the
activation mark of 200 ham.  But I'm hoping my timeout problems are
over.

So my advice is:

 SHOW STATUS LIKE 'Table%';
 SHOW PROCESSLIST;
 Change to innodb
 ALTER TABLE bayes_token MODIFY id SMALLINT UNSIGNED NOT NULL,
                         MODIFY spam_count SMALLINT UNSIGNED NOT NULL,
                         MODIFY ham_count SMALLINT UNSIGNED NOT NULL;
 ALTER TABLE bayes_expire MODIFY id SMALLINT UNSIGNED NOT NULL;
 ALTER TABLE bayes_seen MODIFY id SMALLINT UNSIGNED NOT NULL;
 ALTER TABLE bayes_vars MODIFY id SMALLINT UNSIGNED NOT NULL AUTO_INCREMENT;

-- 
_________________________________________________________________________
Andrew Donkin                  Waikato University, Hamilton,  New Zealand

Re: Bayes rules taking minutes - solved by moving to innodb?

Posted by Bill Taroli <bi...@billsden.org>.
Andrew Donkin wrote:
> So I tried to convert bayes_token to InnoDB to take advantage of its
> row-level locking (this is advised by the developers but not reflected
> in bayes_mysql.sql).  After MySQL worked on that for a few days I
> stopped it, dropped the database (innodb was very confused), and
> recreated the database and all tables using innodb and two-byte IDs.
>   

I wonder if you could have saved yourself the effort of having to 
retrain in this case by wrapping the DB recreation in a sa-learn 
--backup / sa-learn --restore pair?

Bill