You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2007/12/03 18:14:46 UTC

Re: Mondo bayes_toks - millions of entries

Wes writes:
> The PostgreSQL experiment turned out to not be as stellar as I had hoped.
> With our volume, the disk write load for bayes auto-learn is extremely high,
> even with fsync disabled, share mem increased, etc.  I also ran into some
> severe concurrency issues - lots of waiting on locks, even with only one
> system doing updates and the others reading.  Auto-vacuum set to 60 seconds
> (every 3 minutes for the SpamAssassin table) appears to help a tremendously.
> I think we'd need a solid state disk, or SAN with a large buffer, to safely
> handle it with a larger number of tokens.  I'm also getting failed expires
> due to 'deadlock detected'.

Have you considered turning off autolearn to reduce the number of writes?

> Regrouping, I was looking at benchmarks for QDBM and see it is on the "we
> need volunteers" list.  Is this more than just changing the "tie" in the
> Bayes DBM store module?

It's that, and lots of testing ;)

--j.

Re: Mondo bayes_toks - millions of entries

Posted by "John D. Hardin" <jh...@impsec.org>.

On Thu, 6 Dec 2007, Wes wrote:

> We're going to switch to all-manual learning and hopefully
> convince enough users to send in spam and false positives to train
> it well.  Sufficient participation is a big question, but appears
> to be the only viable option at this point.

That could be automated somewhat. Hook into your delivery process for
selected users and bcc messages that fall outside your desired
thresholds to spam and ham boxes, then train from the boxes in bulk at
night and clear them. Sort of a middle ground between regular
autolearn and totally manual training. Batched autolearn?

--
 John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
 jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
 key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  Perfect Security is unattainable; beware those who would try to sell
  it to you, regardless of the cost, for they are trying to sell you
  your own slavery.
-----------------------------------------------------------------------
 9 days until Bill of Rights day

Re: Mondo bayes_toks - millions of entries

Posted by Wes <we...@msg.bt.com>.

On 12/3/07 11:14 AM, "Justin Mason" <jm...@jmason.org> wrote:

> Have you considered turning off autolearn to reduce the number of writes?

That is where I am at now.  Whether with a database or DBM, I have scaling
and concurrency problems.  I am also having problems with expire failing in
both - deadlock detected for the DB, or failure to acquire lock in DBM.  It
doesn't appear auto-learn is really buying us anything anyway, especially
considering the short token retention period necessary without adding very
serious disk hardware.

We're going to switch to all-manual learning and hopefully convince enough
users to send in spam and false positives to train it well.  Sufficient
participation is a big question, but appears to be the only viable option at
this point.

Wes