You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Steve Bergman <sb...@gmail.com> on 2014/07/02 03:36:13 UTC

Re: Bayes, Manual and Auto Learning Strategies

On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote:
>
> That's pretty bad practice. Fundamentally, you are implementing a custom
> auto-learn flavor, overruling the SA configurable auto-learn behavior

SA's autolearn behavior doesn't make much sense. I have no confidence in it.

This method shields the user from the worst of the spam, while giving 
them full control of what gets relearned as spam.

> and ignoring all safety concepts implemented by SA.

What safety concepts? autolearn is a complete joke. Even the docs 
explain that it's only there as a last resort method of kinda sorta 
training the spam filter.

> So if a user in a hurry simply deletes some spam, it will remain ham, as
> far as Bayes is concerned.

Same as with Thunderbird, I think. And it's working very well for them. 
If they act irresponsibly, they'll get more spam. It takes no longer to 
highlight the spam and click "Junk" than it does to highlight the spam 
and click "Delete".

I've pretty much decided at this point that if the users don't do what I 
tell them to do, repeatedly, then what results is not my responsibility.

And it's not.

The alternative is to not mark incoming mail as ham, and allow the SA 
Bayesian filter to remain inactive forever.

I opted to give the users the choice of being responsible for sorting, 
and reaping the benefits of that if they do. And yes, I know that some 
are not going to.

I'd be interested if you have a better solution in mind.

-Steve

Re: Bayes, Manual and Auto Learning Strategies

Posted by Karsten Bräckelmann <gu...@rudersport.de>.

On Tue, 2014-07-01 at 22:18 -0500, Steve Bergman wrote:
> On 07/01/2014 09:53 PM, Karsten Bräckelmann wrote:
> 
> > Frankly, it appears you don't understand what auto-learning is.
> 
> So please specify, explicitly, what it is. I asked some specific 
> questions about it. And I'm very interested in the answers.

If you want my opinion, please re-phrase your questions. I locally
deleted most of this previous (originally unrelated) thread.

> Is auto-learn still system-wide? I'd need it to apply to individual 
> users. Is it in-memory only? Or can I have it update the users' filedb 
> token databases?

SA itself never was system-wide, neither user-specific. It is both, can
be either. It depends on the context of calling SA.

> If it's now per user and uses the user databases, then I am more than 
> ready to reconsider my opinion. But I've not been able to get a clear 
> answer to this. I haven't had an opportunity to test. And I'd want 
> confirmation from someone in the know anyway, before I changed strategies.

It does not depend on SA, but on how you invoke SA. We cannot give you a
clear answer. It depends on your system, your SMTP, glue, system wide
calling of SA, and possibly per-user invocations even after system-wide.

To be clear: SA is a filter. It does nothing itself, other than
classification. Being called, and at which point, is outside the scope
of SA. Rejecting, deleting, delivering or any other kind of action is
outside the scope of SA. That's actions performed by the calling layer,
based on the result of SA evaluation.

> >> This method shields the user from the worst of the spam, while giving
> >> them full control of what gets relearned as spam.
> >
> > Wrong. It is not "this" (your) method, that shields the user from the
> > worst of the spam. That's SA. Not your style of auto-training.
> 
> Mine is not autotraining at all. it's giving the user a way of 
> explicitly training the backend spam filter.

Quoting your previous post, you "have a line in the users' default
.forward file to train incoming mail as ham". That is auto-training.

> > (Besides, you *are* doing auto-learning, which you just claimed to be a
> > complete joke.)
> 
> No. The messages are assumed ham until the user classifies it as spam. 
> It is explicit learning. Under user control,

Being "assumed" is not the same as being "treated and automatically
reinforced". The latter is what you do. (And btw, Yes. You are
auto-learning.)

> > At this point I won't get into details. It should suffice to highlight
> > that a default ham auto-learning threshold of 0.1 is part of the safety
> > concepts. (See the M::SA::Plugin::AutoLearnThreshold man-page for more.)
> 
> I really don't think you understand what it is I'm doing. Anything below 
> a score of 5.0 goes into their mailbox and learned as ham. If it's ham, 
> that's great. If it's spam, they move it to Junk and it gets learned as 
> spam. auto-learn is as brain dead as the defunct AWL.

I perfectly understood what you are doing.

You didn't understand why that is bad. Failing to explain might be my
bad, though I'll leave re-explaining for tomorrow my timezone. Or you
carefully re-reading my posts.

> > I never checked the TB internal Bayes implementation and auto-learn
> > strategy, but I'd be surprised if they do train on black/white, without
> > any gray area in between.
> 
> Optimally, I would have an "incoming folder" and then the user could 
> manually move the messages from there to spam or ham. But considering 

Which is basically what you came from, using Dovecot antispam plugin
with SA, and dedicated folders "where the user could manually move the
messages" to. Why didn't you just set that up?

(Hint: That's your set-up without auto-learning ham Inbox deliveries.)

> that this was not even remotely necessary with our old email provider, I 
> don't feel that I can put my users to that level of extra trouble that 
> they never even thought about having to deal with before, just because 
> SA is not performing as well as the spam filter they are used to. The 

Do initial manual training. Then get back to us.

> mail needs to go into the inbox directly. And for SA's bayesian tp work, 
> it needs to be assumed as ham initially.

No.

It seems your previous "email provider", whatever that might be, had
some sort of spam filtering service. Now you're on your own.

Which you are, unless you decide to ask for free (as in beer) support by
the community providing the software for free (as in speech) to help you
weed out the spam. You did ask, which is just fine, but your assumptions
are kind of hostile. Like your previous "email provider" would not use
SA internally. He most likely does.

> The only thing I see which might change my view would be explicit 
> details about where autolearn stores its data and how it is used on a 
> per user basis.

So the only thing that might change your view would be reading the docs.
Go read them.

Auto-learn stores its data exactly where Bayes generally stores its
data. In fact, it is the same. Just being triggered to learn
automatically (sic), rather than manually by invoking sa-learn.

Does that change your view?

Ah, no, changing your view also depends on how per-user learning is
done. Well, again, that depends on how you call SA. Or in other terms,
how you view SA depends on how you call SA. True...

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Bayes, Manual and Auto Learning Strategies

Posted by Steve Bergman <sb...@gmail.com>.

On 07/01/2014 09:53 PM, Karsten Bräckelmann wrote:

> Frankly, it appears you don't understand what auto-learning is.

So please specify, explicitly, what it is. I asked some specific 
questions about it. And I'm very interested in the answers.

Is auto-learn still system-wide? I'd need it to apply to individual 
users. Is it in-memory only? Or can I have it update the users' filedb 
token databases?

If it's now per user and uses the user databases, then I am more than 
ready to reconsider my opinion. But I've not been able to get a clear 
answer to this. I haven't had an opportunity to test. And I'd want 
confirmation from someone in the know anyway, before I changed strategies.

>
>> This method shields the user from the worst of the spam, while giving
>> them full control of what gets relearned as spam.
>
> Wrong. It is not "this" (your) method, that shields the user from the
> worst of the spam. That's SA. Not your style of auto-training.
>

Mine is not autotraining at all. it's giving the user a way of 
explicitly training the backend spam filter.

> And unless you disabled Bayes auto-learning in SA (dunno, might have
> been mentioned deep in the thread), the user does not have full control
> of what gets relearned as spam.
>

I have disabled autolearning. I thought I mentioned that to you.

> (Besides, you *are* doing auto-learning, which you just claimed to be a
> complete joke.)

No. The messages are assumed ham until the user classifies it as spam. 
It is explicit learning. Under user control,

>
> At this point I won't get into details. It should suffice to highlight
> that a default ham auto-learning threshold of 0.1 is part of the safety
> concepts. (See the M::SA::Plugin::AutoLearnThreshold man-page for more.)
>

I really don't think you understand what it is I'm doing. Anything below 
a score of 5.0 goes into their mailbox and learned as ham. If it's ham, 
that's great. If it's spam, they move it to Junk and it gets learned as 
spam. auto-learn is as brain dead as the defunct AWL.

> I never checked the TB internal Bayes implementation and auto-learn
> strategy, but I'd be surprised if they do train on black/white, without
> any gray area in between.

Optimally, I would have an "incoming folder" and then the user could 
manually move the messages from there to spam or ham. But considering 
that this was not even remotely necessary with our old email provider, I 
don't feel that I can put my users to that level of extra trouble that 
they never even thought about having to deal with before, just because 
SA is not performing as well as the spam filter they are used to. The 
mail needs to go into the inbox directly. And for SA's bayesian tp work, 
it needs to be assumed as ham initially.

The only thing I see which might change my view would be explicit 
details about where autolearn stores its data and how it is used on a 
per user basis.

-Steve

Re: Bayes, Manual and Auto Learning Strategies

Posted by Karsten Bräckelmann <gu...@rudersport.de>.

On Tue, 2014-07-01 at 20:36 -0500, Steve Bergman wrote:
> On 07/01/2014 07:32 PM, Karsten Bräckelmann wrote:
> >
> > That's pretty bad practice. Fundamentally, you are implementing a custom
> > auto-learn flavor, overruling the SA configurable auto-learn behavior
> 
> SA's autolearn behavior doesn't make much sense. I have no confidence in it.

The auto-learning feature is NOT meant to be a fully automated training
system. It's an aid for the user to eliminate the need to care about the
extremes, while focusing on the close-calls. There are options to tweak
to your specific needs, and there even is no single "SA autolearn
behavior" as you stated, but different flavors. And an option to turn it
off.

Frankly, it appears you don't understand what auto-learning is.

> This method shields the user from the worst of the spam, while giving 
> them full control of what gets relearned as spam.

Wrong. It is not "this" (your) method, that shields the user from the
worst of the spam. That's SA. Not your style of auto-training.

And unless you disabled Bayes auto-learning in SA (dunno, might have
been mentioned deep in the thread), the user does not have full control
of what gets relearned as spam.

> > and ignoring all safety concepts implemented by SA.
> 
> What safety concepts? autolearn is a complete joke. Even the docs 
> explain that it's only there as a last resort method of kinda sorta 
> training the spam filter.

You are doing (custom) auto-learning as ham of any message with a score
less than required_score of 5.0. *That* is a joke.

(Besides, you *are* doing auto-learning, which you just claimed to be a
complete joke.)

At this point I won't get into details. It should suffice to highlight
that a default ham auto-learning threshold of 0.1 is part of the safety
concepts. (See the M::SA::Plugin::AutoLearnThreshold man-page for more.)

> > So if a user in a hurry simply deletes some spam, it will remain ham, as
> > far as Bayes is concerned.
> 
> Same as with Thunderbird, I think.

I never checked the TB internal Bayes implementation and auto-learn
strategy, but I'd be surprised if they do train on black/white, without
any gray area in between.

You stated it. Please back up your claim.

> And it's working very well for them. 
> If they act irresponsibly, they'll get more spam. It takes no longer to 
> highlight the spam and click "Junk" than it does to highlight the spam 
> and click "Delete".

While I am aware I'm not the average user -- there's a "delete" action
key on my keyboard. There's no "junk" equivalent. Yes, I avoid using the
mouse if keyboard interaction is more productive...

> I've pretty much decided at this point that if the users don't do what I 
> tell them to do, repeatedly, then what results is not my responsibility.
> 
> And it's not.

Do you hate your users or your job? (Sorry, snide-remark I couldn't
resist. Feel free to ignore.)

> The alternative is to not mark incoming mail as ham, and allow the SA 
> Bayesian filter to remain inactive forever.

No. I can only guess, but it appears there are some mis-interpretations
in that conclusion.

The SA Bayesian classifier to "remain inactive forever" can only refer
to insufficient initial training. Manual training. Of at least 200 ham
and spam each (by default, you can lower that to 0). You will easily get
that by manual training of existing messages. And even default auto-
learning would eventually cross the ham number. Less than forever.

More importantly, SA still marks (classifies) incoming mail as ham. Just
because its overall score is less than 5.0. It just does not *learn* all
of them as ham. Because there's a chance it might not actually be ham,
but a FN.

That area, between (default) auto-learning as ham and classifying as
spam is the gray area, where actual user input is of much value. For
both, learning spam AND ham, for that matter. In particular, because
generally (and as SA principle), a FP is *much* worse than a FN.

Your approach of force learning those as ham, is biasing your Bayes DB.
At the very least temporarily (unless a fresh spam campaign has been
re-trained by your users on Monday). At worst, until you clear it.

Btw, is that per-user, or are you gambling a site-wide Bayes DB?

> I opted to give the users the choice of being responsible for sorting, 
> and reaping the benefits of that if they do. And yes, I know that some 
> are not going to.
> 
> I'd be interested if you have a better solution in mind.

Do not auto-learn ham every message that scores below required_score.

Introduce train-on-error for your users, with an extended manual
training option. Specific ham and spam folders, where moving or copying
mail into trains the Bayes classifier. Kind of optional for the user,
unless they feel there's too much mis-classification.

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}