You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Matt Corallo <sa...@mattcorallo.com> on 2021/09/21 20:11:25 UTC

Disabling autolearn on given rule

Hi!

I recently noticed my bayes was rarely matching any spam, and it turns out this was due to 
autolearn=ham'ing occurring on lots of list traffic that I only occasionally read, some of which was 
blatant spam. Sadly, list traffic can be pretty hard to categorize and ends up getting through due 
to good sending IP and domain reputation.

While correcting the filter through sa-learn solves this issue temporarily, I don't want to have to 
always read lists that I previously only occasionally read just to re-classify spam. Thus, I'd like 
to disable autolearn entirely for mails that match a given rule (eg MAILING_LIST_MULTI).

"tflags MAILING_LIST_MULTI noautolearn" doesn't seem like quite what I want, it just reduces the 
score used to decide whether to learn. There's some old bugzilla mentions asking for this feature, 
but it seems the response was "write a plugin". Is there a plugin available for this or how would 
one go about writing one?

Thanks,
Matt

Re: Disabling autolearn on given rule

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

On 21.09.21 13:11, Matt Corallo wrote:
>I recently noticed my bayes was rarely matching any spam, and it turns 
>out this was due to autolearn=ham'ing occurring on lots of list 
>traffic that I only occasionally read, some of which was blatant spam. 
>Sadly, list traffic can be pretty hard to categorize and ends up 
>getting through due to good sending IP and domain reputation.
>
>While correcting the filter through sa-learn solves this issue 
>temporarily, I don't want to have to always read lists that I 
>previously only occasionally read just to re-classify spam. Thus, I'd 
>like to disable autolearn entirely for mails that match a given rule 
>(eg MAILING_LIST_MULTI).

unfortunately there are no common rules designed to autolearn ham (wonder
why? :-), thus ham autolearning depends on a few negative scores, of which
most are DNS allowlists.

I use to mark them all as noautolearn because many business notifications
too close to spam hot autolearned.

>"tflags MAILING_LIST_MULTI noautolearn" doesn't seem like quite what I 
>want, it just reduces the score used to decide whether to learn. 

"tflags MAILING_LIST_MULTI noautolearn" means that score of
"MAILING_LIST_MULTI" won't be used tor autolearn decision.
It does not mean that mail hitting MAILING_LIST_MULTI won't be used for
autolearn.

>There's some old bugzilla mentions asking for this feature, but it 
>seems the response was "write a plugin". Is there a plugin available 
>for this or how would one go about writing one?



-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Despite the cost of living, have you noticed how popular it remains?

Re: Disabling autolearn on given rule

Posted by Bill Cole <sa...@billmail.scconsult.com>.

On 2021-09-22 at 05:19:48 UTC-0400 (Wed, 22 Sep 2021 11:19:48 +0200)
Bert Van de Poel <be...@ulyssis.org>
is rumored to have said:

> I for one have no idea how I would submit a fix to SA once I've 
> written it, to give a concrete example. I'm guessing I just paste the 
> patch to a Bugzilla comment and hope someone merges it?

Actual attachments of patch files to a bug report is vastly preferred 
over pasting into a BZ comment. There is some ASF overhead, as anything 
significant requires a contributor to submit a standard "Individual 
Contributor License Agreement" to the ASF Secretary, which takes all of 
about 15 minutes. Contributions of larger functional enhancements that 
are not just to address specific bugs can also be discussed and 
submitted via the dev list, which is entirely open to the public just 
like this list. Anyone making ongoing contributions (code or otherwise) 
is likely to be invited to become a committer. We work on a 'commit then 
review' model except when in the last stage of release prep, so if you 
don't watch the commit stream you won't see much of the activity that 
isn't discussed actively.

In my opinion, the low pace of activity in the SA project is organic. SA 
is mature software whose core code has been "good enough" for widespread 
use for a long time. As a result there is not a lot of quick-hit 
development work to be done on it. There's not a lot of places for 
people to get started working on the SA code where one can see 
meaningful improvement in a short time, outside of rule development. 
Henrik is by far the most active member of the project as far as 
non-rule code contributions in the recent past, but he is not alone. 
John is doing rule commits daily. There are about a half-dozen 
committers who have made commits in 2021. SA is (and always has been) a 
*community* project without a major corporate backer. As such, it is 
fully dependent on the capacity of *the community* to maintain it. 
Everyone reading this is potentially part of that capacity. Anyone who 
wants something fixed in or added to SA needs to be involved in getting 
it done, even if all that means is poking us here or opening a bug 
report and bumping it as needed if everyone ignores it. MAYBE it also 
means designing and implementing a fix. That's the nature of the 
project. No one (as far as I know) is funded specifically to work on it 
on an ongoing basis.


-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire

Re: Disabling autolearn on given rule

Posted by Bert Van de Poel <be...@ulyssis.org>.

This is complete news to me! Based on the activity on the dev list, I 
had assumed there were still 10-20 people devoting some of their time to 
developing SA. If you are the only one, that of course changes my view 
very much, and would be something worth communicating in some spot. When 
I asked about my Bayes bugs in this list a long time ago, I also got 
very mixed responses on whether my suggested solutions to the bugs I 
found through discussion on the list were actually the right ones, so I 
filed those bugs specifically to get feedback on whether my solutions 
were deemed acceptable by SA developers (assuming there was a whole team 
working on SA either in the evenings or as part of their job at a 
company that heavily uses SA). If the idea is that bugs will most 
probably never get resolved except if you write and submit patches to 
solve them, that's completely understandable if there are barely any 
developers or maintainers, but then people have to be told of course.

Maybe it would then also be a good idea to start some kind of bug review 
project, similar to how projects like Inkscape have been asking their 
community to retest *all* bugs, where members from the mailing list and 
other SA users are encouraged to go through a few bugs at a time, 
starting with the very oldest ones, to check whether they're still valid 
and otherwise close them. There are currently 373 unresolved bugs on 
bugzilla (if that counter can be trusted, it's the same amount of bugs I 
get under "my bugs", which seems suspicious), I wouldn't be surprised if 
over half of those were questions or about things that have long been 
resolved or become irrelevant. For example, I'm guessing 
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=5679 can be closed 
since if this problem had persisted, there would be a ton of reports of 
those still ongoing.

What do you think?

I would also like to point out, as a sort of PS, that while I do 
understand that Perl isn't rocket science, there is quite a barrier due 
to Perl's reputation and the decreasing number of people with experience 
in Perl. If I'm brutally honest, I would have probably already fixed 
those 4 bugs I reported myself if SA was on GitHub and written in 
Python, since I could most probably read the code more easily, and 
especially submit my changes more easily. I do understand that SA is 
like that for historic reasons, and I don't think a rewrite would be 
sensible at all, but I wouldn't underestimate how much of a deterrent 
the combination of Perl, Bugzilla, SVN and email patch submission is for 
new FOSS developers used to the newer languages and GitHub. I for one 
have no idea how I would submit a fix to SA once I've written it, to 
give a concrete example. I'm guessing I just paste the patch to a 
Bugzilla comment and hope someone merges it?

Anyway, this is way offtopic for Matt's initial issue, but probably 
still relevant since he's hoping to fix it himself.

On 22/09/2021 10:54, Henrik K wrote:
> On Wed, Sep 22, 2021 at 10:45:43AM +0200, Bert Van de Poel wrote:
>> I hope I'm not passing on too much of a negative message. It would be great
>> of someone had a look at the Bayes autolearn code. I think it would be a
>> great service to the community!
> The fact is that there really aren't any active developers around these
> days.  We are no different from any other semi-active open source project.
> I can only give so much of personal free time to "service the community".
> The community is supposed to try to take care of itself, so where are all
> the volunteers?  :-) Doing Perl is not rocket science, but getting familiar
> with SA internals can be daunting.  I can help with that, but someone needs
> to step up with decend effort.
>

Re: Disabling autolearn on given rule

Posted by Henrik K <he...@hege.li>.

On Wed, Sep 22, 2021 at 10:45:43AM +0200, Bert Van de Poel wrote:
>
> I hope I'm not passing on too much of a negative message. It would be great
> of someone had a look at the Bayes autolearn code. I think it would be a
> great service to the community!

The fact is that there really aren't any active developers around these
days.  We are no different from any other semi-active open source project. 
I can only give so much of personal free time to "service the community". 
The community is supposed to try to take care of itself, so where are all
the volunteers?  :-) Doing Perl is not rocket science, but getting familiar
with SA internals can be daunting.  I can help with that, but someone needs
to step up with decend effort.

Re: Disabling autolearn on given rule

Posted by Bert Van de Poel <be...@ulyssis.org>.

I think having a look at the code itself is a good idea. I'm not sure if 
it's up-to-date but you can find some information on 
https://cwiki.apache.org/confluence/display/SPAMASSASSIN/DevelopmentStuff

I've found that just reporting issues on SA's bugzilla is completely 
useless since it's just used as a fancy interface to display email 
conversations of the development list. Newly reported bugs or issues 
often go ignored by email and their status is never changed since no one 
uses the interface to manage bugs, this means that bugzilla is filled to 
the brim with hundreds of bugs marked as new, of which some are actual 
bugs and large parts are just questions or fixed problems that were 
never closed. Bugzilla is also very buggy, for example when I press "my 
bugs", I get a list of 373 bugs, some predating the existence of my 
account, and obviously I didn't take part in the discussion of almost 
all of them. So keep in mind that Bugzilla can be untrustworthy and that 
the dev mailing list mentioned on 
https://cwiki.apache.org/confluence/display/SPAMASSASSIN/mailinglists is 
connected to that.

If you're planning to work on the Bayes plugin, I can tell you there are 
several problems with it I've reported in the past that have gone ignored:
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7904
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7905
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7906
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7907
I assume many others have also reported valid bugs, but they can be hard 
to find between the many questions that have been asked on 
https://bz.apache.org/SpamAssassin/buglist.cgi?quicksearch=bayes&list_id=34478 
and I'm also not too sure we can trust the search functionality.

I hope I'm not passing on too much of a negative message. It would be 
great of someone had a look at the Bayes autolearn code. I think it 
would be a great service to the community!

Bert

On 22/09/2021 03:29, Matt Corallo wrote:
>
>
> On 9/21/21 18:01, Loren Wilton wrote:
>>> None of these seem to accomplish disabling learning for a specific rule
>>
>> I think the problem is that I believe Bayes works off of the total 
>> score, and probably only sees rule names as more tokens, if it sees 
>> them at all. If it indeed works off the total score, about all you 
>> can do is somehow tweak that score for a given rule or rule combination.
>
> Right, I expected roughly as much from the docs I could find. Two 
> things, then:
>
> (1) maybe time to revisit the old discussions of providing this as a 
> default feature?,
> (2) where would I go to look at building a plugin for this? Ideally 
> something that ends up upstream, but though I can write code, I know 
> no perl :).
>
> Matt

Re: Disabling autolearn on given rule

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

> On 9/22/2021 8:11 AM, Kevin A. McGrail wrote:
>>So I'd recommend a different take.  Autolearn is an abomination we 
>>never should have published.  It is, in effect, a switch to allow a 
>>inherent bias in the modelling to grow and continue.

On 22.09.21 10:39, Jared Hall wrote:
>Agreed, predictable Garbage Out (FP) becomes Cascading Garbage Out.

>>Disable autolearn, wipe your Bayes store, and manually train from 
>>hand classified ham and spam.

>1000% Correct, IMO.  If you must run Bayes, train it once and leave it 
>be.  Repeat as needed.

I noticed a few that repeated spam gets finally trained and gets BAYES_99.

the main problem is lack of safe rules with negative scores.

of course, nothing defeats manual training.
-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
(R)etry, (A)bort, (C)ancer

Re: Disabling autolearn on given rule

Posted by Jared Hall <ja...@jaredsec.com>.

On 9/22/2021 8:11 AM, Kevin A. McGrail wrote:
> Morning all,
>
> So I'd recommend a different take.  Autolearn is an abomination we 
> never should have published.  It is, in effect, a switch to allow a 
> inherent bias in the modelling to grow and continue.
>

Agreed, predictable Garbage Out (FP) becomes Cascading Garbage Out.

> Disable autolearn, wipe your Bayes store, and manually train from hand 
> classified ham and spam.

1000% Correct, IMO.  If you must run Bayes, train it once and leave it 
be.  Repeat as needed.

> Regards, KAM
>
-- Jared Hall

*
*

Re: Disabling autolearn on given rule

Posted by Benny Pedersen <me...@junc.eu>.

On 2021-09-22 14:11, Kevin A. McGrail wrote:
> Morning all,
> 
> So I'd recommend a different take.  Autolearn is an abomination we
> never should have published.  It is, in effect, a switch to allow a
> inherent bias in the modelling to grow and continue.
> 
> Disable autolearn, wipe your Bayes store, and manually train from hand
> classified ham and spam. Oh, and use Redis for the backend store. The
> difference is usually night and day.

tflag nice

should not be used on negative scores

if that rule is part of the problem with to much autolearn :/

we all have to live with badness sometimes, but i have posted how to 
reduce madness, users can listen or come up with another solution other 
then disable autolearn ?

its 2021 we still need things solved in spamassassin, or was it 
mimedefang ? :=)

i dont agre on redis is better then postgresql btw

remember spamassassin is open source and that means anyone can do what 
thay like with it, perfekt bummer :)

Re: Disabling autolearn on given rule

Posted by "Kevin A. McGrail" <km...@apache.org>.

Morning all,

So I'd recommend a different take.  Autolearn is an abomination we never
should have published.  It is, in effect, a switch to allow a inherent bias
in the modelling to grow and continue.

Disable autolearn, wipe your Bayes store, and manually train from hand
classified ham and spam. Oh, and use Redis for the backend store. The
difference is usually night and day.

Regards, KAM

On Wed, Sep 22, 2021, 06:18 Martin Gregorie <ma...@gregorie.org> wrote:

> On Tue, 2021-09-21 at 18:57 -0700, Loren Wilton wrote:
> >
> > Well, from the few I've seen, they all seem to have a relatively
> > constant structure. Someone pointed you to a plugin that is at least
> > dealing in this having a better suggestion.
> >
> > While I wrote a little Perl a decade ago I've forgotten many of the
> > pecularities, but there are some good web sites out there, and there
> > is one of the animal books on the subject. Perl is a bit pecular in
> > syntax and function compared to the C/C++ I did much of my career, but
> > I didn't have much trouble picking up enough to make some local SA
> > hacks long ago, so if you can program in most anything it probably
> > won't be too much trouble.
> >
> What Loren said. The book you need in "The Camel Book": Its an O'Reilly
> publication, "Programming Perl by Larry Wall, Tom Christiansen & Jon
> Orwant  - my copy is the 3rd edidtin, dated 2000, so there are probably
> more recent editions. Its well written and organised and, equally
> important, has a whole chapter on Perl regular expressions, which are
> not the same as,e.g C or Java regexes.
>
> I also know very little perl, but this book, together with an example SA
> plugin, were enough to let me write an SA plugin for doing lookups on a
> PostgreSQL database containing my mail archive I use this plugin to
> whitelist mail from anywhere I've previously sent mail to).
>
> Martin
>
>
>
>

Re: Disabling autolearn on given rule

Posted by Martin Gregorie <ma...@gregorie.org>.

On Tue, 2021-09-21 at 18:57 -0700, Loren Wilton wrote:
> 
> Well, from the few I've seen, they all seem to have a relatively
> constant structure. Someone pointed you to a plugin that is at least
> dealing in this having a better suggestion.
> 
> While I wrote a little Perl a decade ago I've forgotten many of the 
> pecularities, but there are some good web sites out there, and there
> is one of the animal books on the subject. Perl is a bit pecular in
> syntax and function compared to the C/C++ I did much of my career, but
> I didn't have much trouble picking up enough to make some local SA
> hacks long ago, so if you can program in most anything it probably
> won't be too much trouble.
> 
What Loren said. The book you need in "The Camel Book": Its an O'Reilly
publication, "Programming Perl by Larry Wall, Tom Christiansen & Jon
Orwant  - my copy is the 3rd edidtin, dated 2000, so there are probably
more recent editions. Its well written and organised and, equally
important, has a whole chapter on Perl regular expressions, which are
not the same as,e.g C or Java regexes.

I also know very little perl, but this book, together with an example SA
plugin, were enough to let me write an SA plugin for doing lookups on a
PostgreSQL database containing my mail archive I use this plugin to
whitelist mail from anywhere I've previously sent mail to).

Martin

Re: Disabling autolearn on given rule

Posted by Henrik K <he...@hege.li>.

On Tue, Sep 21, 2021 at 06:57:22PM -0700, Loren Wilton wrote:
>
> I guess one thing you might be able to do is implement a tflags flag of
> absolutely_no_autolearn or some such that would force-disable the autolearn
> decision if the rule had hit, but that might be something that would have to
> be put into the main SA code itself. Maybe Henrick will chime in here. This
> may be really trivial if you know where to look.

There is only "autolearn_force", though even it's not an absolute force..

https://spamassassin.apache.org/full/3.4.x/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html

I guess one could force shortcircuiting, so no learning will be done.

Re: Disabling autolearn on given rule

Posted by Loren Wilton <lw...@earthlink.net>.

> (2) where would I go to look at building a plugin for this? Ideally 
> something that ends up upstream, but though I can write code, I know no 
> perl :).

Well, from the few I've seen, they all seem to have a relatively constant 
structure. Someone pointed you to a plugin that is at least dealing in this 
general area, that might be a good starting point, barring anyone else 
having a better suggestion.

While I wrote a little Perl a decade ago I've forgotten many of the 
pecularities, but there are some good web sites out there, and there is one 
of the animal books on the subject. Perl is a bit pecular in syntax and 
function compared to the C/C++ I did much of my career, but I didn't have 
much trouble picking up enough to make some local SA hacks long ago, so if 
you can program in most anything it probably won't be too much trouble.

I don't recall if Bayes itself is called from a plugin or from the main SA 
code, but I'm pretty sure it is only called if an internal 'autolearn' token 
is true for the message. If you make a plugin that runs late in the rule 
evaluation it should be able to look at the score and rule hits and items in 
the message header and body and decide if it wants to turn off the autolearn 
flag for the message. Hopefully there isn't something in main SA code that 
determines the value of this flag after all of the rules have run.

I guess one thing you might be able to do is implement a tflags flag of 
absolutely_no_autolearn or some such that would force-disable the autolearn 
decision if the rule had hit, but that might be something that would have to 
be put into the main SA code itself. Maybe Henrick will chime in here. This 
may be really trivial if you know where to look.

        Loren


---
This email has been checked for viruses by AVG.
https://www.avg.com

Re: Disabling autolearn on given rule

Posted by Matt Corallo <sa...@mattcorallo.com>.

On 9/21/21 18:01, Loren Wilton wrote:
>> None of these seem to accomplish disabling learning for a specific rule
> 
> I think the problem is that I believe Bayes works off of the total score, and probably only sees 
> rule names as more tokens, if it sees them at all. If it indeed works off the total score, about all 
> you can do is somehow tweak that score for a given rule or rule combination.

Right, I expected roughly as much from the docs I could find. Two things, then:

(1) maybe time to revisit the old discussions of providing this as a default feature?,
(2) where would I go to look at building a plugin for this? Ideally something that ends up upstream, 
but though I can write code, I know no perl :).

Matt

Re: Disabling autolearn on given rule

Posted by Loren Wilton <lw...@earthlink.net>.

> None of these seem to accomplish disabling learning for a specific rule

I think the problem is that I believe Bayes works off of the total score, 
and probably only sees rule names as more tokens, if it sees them at all. If 
it indeed works off the total score, about all you can do is somehow tweak 
that score for a given rule or rule combination.

        Loren


---
This email has been checked for viruses by AVG.
https://www.avg.com

Re: Disabling autolearn on given rule

Posted by Matt Corallo <sa...@mattcorallo.com>.

On 9/21/21 15:53, Benny Pedersen wrote:
> On 2021-09-21 22:11, Matt Corallo wrote:
> 
>> "tflags MAILING_LIST_MULTI noautolearn" doesn't seem like quite what I
>> want, it just reduces the score used to decide whether to learn.
>> There's some old bugzilla mentions asking for this feature, but it
>> seems the response was "write a plugin". Is there a plugin available
>> for this or how would one go about writing one?
> 
> https://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html

None of these seem to accomplish disabling learning for a specific rule - I don't particularly want 
to change the bayes learn thresholds, as I think they seem to work quite well for non-list mail. For 
list mail, I'd prefer to disable the bayes learning entirely (though I suppose somehow magically 
forcing it in between the bayes thresholds would work too, if there were a way to do that without 
impacting non-bayes scoring).

Matt

Re: Disabling autolearn on given rule

Posted by Benny Pedersen <me...@junc.eu>.

On 2021-09-21 22:11, Matt Corallo wrote:

> "tflags MAILING_LIST_MULTI noautolearn" doesn't seem like quite what I
> want, it just reduces the score used to decide whether to learn.
> There's some old bugzilla mentions asking for this feature, but it
> seems the response was "write a plugin". Is there a plugin available
> for this or how would one go about writing one?

https://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Plugin_AutoLearnThreshold.html

set ham learn lower then default with that plugin

# bayes_auto_learn_threshold_nonspam n.nn (default: 0.1)

bayes_auto_learn_threshold_nonspam -10

# bayes_auto_learn_threshold_spam n.nn (default: 12.0)

bayes_auto_learn_threshold_spam 7.5

# bayes_auto_learn_on_error (0 | 1) (default: 0)

bayes_auto_learn_on_error 1