You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Jesse Norell <je...@kci.net> on 2017/07/12 15:59:14 UTC

txrep training performance

Hello,

  I have txrep data in a mysql database, and am working on a training
script to run sa-learn; with bayes also in MySQL and a corpus size of
5279 nspam and 849 nham, sa-learn takes a full 2 hours to run with txrep
enabled (use_txrep 1), but only 13 minutes with txrep disabled
(use_txrep 0).  One of my main gripes with the old AWL was that it
didn't learn/correct when training messages, so I love that txrep does
that, but does anyone have any tips to improve txrep training
performance?  Either tweaks/improvements on my end, or even a little
thought on logic redesign in that area?

Thanks,

-- 
Jesse Norell
Kentec Communications, Inc.
970-522-8107  -  www.kci.net

Re: txrep training performance

Posted by Jesse Norell <je...@kci.net>.

For anyone interested, I largely resolved the performance issues with
sa-learn training when using txrep with a little mysql server tuning.
As a reference point, training with ~6400 messages (most of which had
already been learned) took about 14 minutes for both txrep+bayes, and
about 3.5 minutes less with txrep disabled.  (you could do much better
with better hardware)


For those interested in improving txrep training performance, I wonder
if it couldn't be improved tremendously; I'm a little unclear on what it
does/doesn't track, and by this statement:

        https://apache.googlesource.com/spamassassin/+/trunk/lib/Mail/SpamAssassin/Plugin/TxRep.pm#1805

        The TxRep plugin currently does track each message individually,
        hence it does not detect when you learn the message repeatedly.
        It will add/subtract the penalty/bonus score each time the
        message is fed to the spam learner.

Is that a typo?  If it does track individual messages, it seems obvious
that it *should* detect learning a message repeatedly, and do nothing
when you try to re-learn a message as the same type.  (I have some
queries save, and such but eg. learning a message the first time issued
19 queries - relearning the same message as the same time issued 41
queries.)

My guess is the current state of things is:  could be improved, maybe
file an rfe   ?


Thanks...



On Wed, 2017-07-12 at 17:40 -0600, Jesse Norell wrote:
> One thing pointing to maybe a need for reworking the training logic is
> that I have txrep_track_messages at the default (1), and almost every
> message in my corpus has already been trained; each run brings in only a
> handful of new messages (usually 10-20, but often 0, and always < 100).
> It sure seems like a quick check to find out if it has already learned
> this message as the same type (ham/spam) would take a single query, then
> move on to the next message for those already seen; but I see sa-learn
> doing many INSERTS (usually failing with 'Duplicate entry') and UPDATEs
> of the txrep table.
> 
> 
> On Wed, 2017-07-12 at 09:59 -0600, Jesse Norell wrote:
> > Hello,
> > 
> >   I have txrep data in a mysql database, and am working on a training
> > script to run sa-learn; with bayes also in MySQL and a corpus size of
> > 5279 nspam and 849 nham, sa-learn takes a full 2 hours to run with txrep
> > enabled (use_txrep 1), but only 13 minutes with txrep disabled
> > (use_txrep 0).  One of my main gripes with the old AWL was that it
> > didn't learn/correct when training messages, so I love that txrep does
> > that, but does anyone have any tips to improve txrep training
> > performance?  Either tweaks/improvements on my end, or even a little
> > thought on logic redesign in that area?
> > 
> > Thanks,
> > 
> 
> 


-- 
Jesse Norell
Kentec Communications, Inc.
970-522-8107  -  www.kci.net

Re: txrep training performance

Posted by Jesse Norell <je...@kci.net>.

One thing pointing to maybe a need for reworking the training logic is
that I have txrep_track_messages at the default (1), and almost every
message in my corpus has already been trained; each run brings in only a
handful of new messages (usually 10-20, but often 0, and always < 100).
It sure seems like a quick check to find out if it has already learned
this message as the same type (ham/spam) would take a single query, then
move on to the next message for those already seen; but I see sa-learn
doing many INSERTS (usually failing with 'Duplicate entry') and UPDATEs
of the txrep table.

On Wed, 2017-07-12 at 09:59 -0600, Jesse Norell wrote:
> Hello,
> 
>   I have txrep data in a mysql database, and am working on a training
> script to run sa-learn; with bayes also in MySQL and a corpus size of
> 5279 nspam and 849 nham, sa-learn takes a full 2 hours to run with txrep
> enabled (use_txrep 1), but only 13 minutes with txrep disabled
> (use_txrep 0).  One of my main gripes with the old AWL was that it
> didn't learn/correct when training messages, so I love that txrep does
> that, but does anyone have any tips to improve txrep training
> performance?  Either tweaks/improvements on my end, or even a little
> thought on logic redesign in that area?
> 
> Thanks,
> 

-- 
Jesse Norell
Kentec Communications, Inc.
970-522-8107  -  www.kci.net