You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Daniel Staal <DS...@usa.net> on 2006/10/01 19:33:13 UTC
Re: sa-learn and "Caught" spams

--As of September 28, 2006 11:05:35 AM -0700, Kelson is alleged to have 
said:

> Daniel Staal wrote:
>> Depends on the setup.  For instance, given the explanations above, I'll
>> start a system to automatically learn from my 'checkspam' folder, but
>> not my 'highspam' folder.  I have procmail automatically sort my spam by
>> score, so I can pay extra attention to low-scoring spam.  (Which is more
>> likely to be ham which was misplaced than the high-scoring spam.)
>>
>> So, since I *already* have them separated out, I can avoid the
>> double-check.  ;)
>
> But the final score alone doesn't determine whether something gets
> autolearned.
>
> As Matt pointed out, there are a number of different factors, including
> the mix of head/body tests and the current Bayes score -- and it acts on
> what the score would have been if Bayes had been disabled.
>
> So unless you've filtered on the "autolearn=(ham|spam|no)" tag in the
> X-Spam-Status header, you could be missing some high-scoring spam that
> hasn't already been learned.
>
> You could probably filter your training folder to remove any messages
> where X-Spam-Status contains "autolearn=spam" (assuming, of course, that
> your server takes full control of that header).  That should be
> relatively fast and cut down on the resources used to identify duplicates.

--As for the rest, it is mine.

Just as an update, since I'm seeing something interesting...

As an experiment, I set procmail to copy all the 'highspam' that I get that 
*doesn't* get autolearned to a separate folder, and have been attempting to 
train on that folder daily.

I say 'attempting' because despite these *only* being the emails that had 
'autolearn=no' and were definitely spam, in three days sa-learn has yet to 
see any useful tokens in one of these messages.  Generally, upon 
examination, these messages already are receiving bayes scores of 99% or 
better, so it appears that the tokens found are already fully scored. 
(Though not all of them have had such high bayes scores.)

I'll be keeping it up for a while; three days isn't much of a test, after 
all.  But at this point it appears extra training on messages with scores 
over 10 (my high-spam cut-off) doesn't actually do anything.  All relevant 
tokens are already learned, at least in a fully-trained and well-tuned 
system.

Spam emails scored less than 10 do have a number of messages each day that 
have useful tokens, on my system.  Which is to be expected, after all.

Just thought this might be of interest.

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author.  Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes.  This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------