You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by bu...@spamassassin.apache.org on 2022/05/07 17:23:40 UTC

[Bug 7987] New: DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

            Bug ID: 7987
           Summary: DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of
                    rule_pending and rule_ready
           Product: Spamassassin
           Version: 4.0.0
          Hardware: All
                OS: All
            Status: NEW
          Severity: minor
          Priority: P2
         Component: Plugins
          Assignee: dev@spamassassin.apache.org
          Reporter: sa-mst@lrz.de
  Target Milestone: Undefined

Rules that perform asynchronous DNS lookups via bgsend_and_start_lookup do not
need to be manually flagged via rule_pending and rule_read, as everything is
done automatically via additions and deletions to pending_rules. The end game,
the call to finish_meta_tests, is prepared by the call to
harvest_dnsbl_queries, which in turn calls abort_remaining_lookups. Therefore
the calls of rule_pending and rule_ready can and should be deleted.

If rule_pending/rule_ready would be needed, then all registered eval functions
of DNSEval.pm would need these calls, which is at the moment not the case.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

Henrik Krohns <ap...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Attachment #5779|0                           |1
        is obsolete|                            |

--- Comment #10 from Henrik Krohns <ap...@hege.li> ---
Created attachment 5784
  --> https://bz.apache.org/SpamAssassin/attachment.cgi?id=5784&action=edit
Eval functions returning undef for async

As discussed on list, here's a patch to change all plugins async eval functions
to return undef.

Care needs to be taken to check all bgsend* return values, so there is no
unnecessary async marking.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #12 from Henrik Krohns <ap...@hege.li> ---
Of course we can vote if got_miss() etc would be better name for rule_ready().
And documentation needs some more polishing.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #15 from Henrik Krohns <ap...@hege.li> ---
(In reply to Henrik Krohns from comment #12)
> Of course we can vote if got_miss() etc would be better name for
> rule_ready(). And documentation needs some more polishing.

What I don't like about got_miss(), is that it doesn't necessarily mark a
"miss". It's possible to call got_hit() after it, especially in
multiple-lookups-for-one-rule scenario, where it would be complex for the
plugin itself to track all lookups. Much simpler to call rule_ready on every
callback. Though xurrent logic is not perfect either.. if the last lookup
timeouts, it ends up as unrun rule.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987
Bug 7987 depends on bug 7735, which changed state.

Bug 7735 Summary: Meta rules need to handle missing/unrun dependencies
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7735

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |---

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #3 from Michael Storz <sa...@lrz.de> ---
Well, it took me more than a day of carefully reviewing the call diagram and
implementation of bgsend_and_start_lookup to come to this conclusion. My first
thought was that maybe some tests are not working properly and therefore
overlooked that some calls to rule_pending are missing. Only afterwards did I
realize that it was the other way around.

We both know how complicated all the rule processing is. Therefore, from my
point of view, it is extremely important to remove everything superfluous so
that we have a chance to understand how the different algorithms work. This has
nothing to do with cosmetic details, but is essential.

Next, I'm going to look at asynchronous tag processing. After your second
change to FromNameSpoof.pm, I'm pretty sure there's a general processing error
here. But I'll have to check that out in more detail.

The use of the rule_pending and rule_ready routines you introduced should be
clearly described and also implemented accordingly. From the current
description I understand that the routines are to be used in principle with all
asynchronous eval functions, so that the evaluation of the meta routines works.
I am of the opinion that this description is wrong and does not correspond to
your implementation. 

If the processing of the asynchronous calls, i.e. the queueing, polling and
dequeueing is done centrally, then the handling must be done automatically. The
end handling should then also be done centrally. This applies to both the
central mechanism via bgsend_and_start_lookup and via action_depends_on_tags.
For bgsend_and_start_lookup, I'm pretty sure that's how you implemented it. And
that's a big step forward.

rule_pending/rule_ready should only become necessary if the processing is
decentralized and the periodic queries are each triggered via check_tick. This
is the case with DCC.pm, Pyzor.pm and Razor2.pm. Here it also makes sense to
have the end handling done by the respective plugin and triggered by
check_cleanup. And also here, I think, you have implemented it exactly like
that.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #13 from Henrik Krohns <ap...@hege.li> ---
What comes to: "a better name for tests_already_hit would be
tests_already_finished or tests_finished"

.. I would say renaming ancient stuff for cosmetic purposes only is always a
bit sus. But it is more or less internal variable, atleast Google doesn't find
any third part plugins using it ..

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #8 from Henrik Krohns <ap...@hege.li> ---
Should be a bit better now.

- fix body rules considered unrun when using sa-compile
- fix check_rbl_sub rules considered unrun and other DNSEval cleanups
- improve rule_pending/rule_ready/got_hit() logic
- rename $pms->get_pending_lookups to get_async_pending_rules
- other minor async cleanups
- test and documentation improvements

Committed revision 1900849.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #18 from Henrik Krohns <ap...@hege.li> ---
(In reply to Michael Storz from comment #17)
> If that
> could happen the logic of the eval rule is too complex and should be changed.

A DNSBL can do this:
- It parses for example 4 IP addresses from Received
- For each of the IP it starts a separate bgsend_and_start_lookup, expecting 4
separate callbacks

Obviously if got_hit() is called, we know rule is ready regardless of any
pending lookups.

What things should the callback do if there is no hit?

Current logic:

sub callback {
  rule_ready(thisrule)
  foreach (answer) {
    if (match) got_hit(thisrule)
  }
}

Using got_miss() would probably be:

sub callback {
  foreach (answer) {
    if (match) got_hit(thisrule)
  }
  if (!tests_already_hit(thisrule) && !get_async_pending_rules(thisrule))
    got_miss(thisrule)
}

I guess we could brind back the deprecated Dns/is_rule_complete function, which
can do the checks. So would this be the most elegant flow?

sub callback {
  foreach (answer) {
    if (match) got_hit(thisrule)
  }
  if (is_rule_complete(thisrule)) got_miss(thisrule)
}

-- 
You are receiving this mail because:
You are the assignee for the bug.

Re: [Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by Michael Storz <Mi...@lrz.de>.
Am 2022-05-15 21:19, schrieb Henrik K:
> On Sun, May 15, 2022 at 06:58:27PM +0200, Michael Storz wrote:
>> 
>>    THIS IS WRONG.
>> 
> 
> LOL.
> 
> You could have just summarized it in one or two lines without all the
> "error" hyperbole, and I would have understood it much faster.
> 
> Summary:
> 
> - Modify eval functions to return undef if they are async, after that
> rule_pending() and {tests_pending} are redundant and can be removed,
> tests_already_hit is enough to check status, as rule_ready() does not 
> mark
> it until last DNS lookup is finished.
> 
> Quite elegant solution.  Will post patch to Bug 7987 soon for 
> verifying.

Ok, ok, I will try to be more brief in the future ;-)

Michael

Re: [Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by Henrik K <he...@hege.li>.
On Sun, May 15, 2022 at 06:58:27PM +0200, Michael Storz wrote:
> 
>    THIS IS WRONG.
> 

LOL.

You could have just summarized it in one or two lines without all the
"error" hyperbole, and I would have understood it much faster.

Summary:

- Modify eval functions to return undef if they are async, after that
rule_pending() and {tests_pending} are redundant and can be removed,
tests_already_hit is enough to check status, as rule_ready() does not mark
it until last DNS lookup is finished.

Quite elegant solution.  Will post patch to Bug 7987 soon for verifying.


Re: [Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by Michael Storz <Mi...@lrz.de>.
Am 2022-05-14 20:01, schrieb Henrik K:
> On Sat, May 14, 2022 at 07:15:45PM +0200, Michael Storz wrote:
>> 
>> Despite all these changes, however, I still see room for improvement. 
>> At the
>> moment I'm not very happy about how the do_meta_tests subroutine is
>> implemented. It seems to work now, but the query for the dependency of 
>> the
>> meta rules looks too complicated to me:
>> 
>> foreach my $r (@{$md->{$rulename}|[]}) {
>>   next RULE unless exists $h->{$r} && !$tp->{$r} && !$pl{$r};
>> }
>> 
>> Instead of three dependencies, there should really only be one query 
>> at this
>> point for "is the rule ready or not". If we knew exactly the state of 
>> a rule
>> at each point in time, then it would also be possible to move from a 
>> brute
>> force algorithm to a deterministic algorithm, where a rule that 
>> becomes
>> ready automatically sets all meta rules to ready if that rule was the 
>> last
>> dependency analogous to the algorithm for tags. do_meta_tests would 
>> then
>> simply process a queue of ready meta rules. If a new meta rule is 
>> ready it
>> will be added to the end of the queue. If the queue is empty, all 
>> rules of a
>> priority class are processed.
> 
> I heartily agree with everything.  The problem is that I don't have any
> formal programming training or knowledge of fancy algorithms or
> deterministic stuff.  Frankly I'm not even interested in most of that 
> stuff,
> I just script and try to make things do stuff correctly with some 
> common
> sense and logic, would be happy if someone showed how to make it 
> simpler.
> 
> Glad I was able to make it work with maybe 10% performance loss, but 
> it's
> balanced back by dumb metas not needing to wait for async stuff for no
> reason etc.

I'm glad you agree with my analysis. Then let's have one last try. I 
think I found the logical error in the evaluation that makes everything 
so complicated. The error already exists before 3.4.6.

Let's recap:

At the beginning of the evaluation, ALL rules are in the pending state 
by definition. When rules are run, they change from state pending to 
state finished. How is this state change accomplished?

- The normal case is the indirect change for the synchronous standard 
rules. After they are run, they return either 0, which means it was a 
miss, or a positive integer, which means it was a hit.

- Then there is the case of synchronous rules with explicit state 
change. These rules call got_hit for a hit and should call got_miss if 
it was a miss. got_miss is similar to rule_ready, it is just the 
statement

   $self->{tests_already_hit}->{$rule} ||= 0;

   BTW, a better name for tests_already_hit would be 
tests_already_finished or tests_finished.

- And then we have the asynchronous rules, which must be divided into 
two parts: the synchronous part and the asynchronous part. The 
synchronous part is the call of the eval function, which in turn submits 
the asynchronous part to a queue function. At the end, the synchronous 
part is supposed to return a result. And here comes the error: it 
returns the value 0, which means that the rule is finished and the 
result is a miss.

    THIS IS WRONG.

The rule is not yet finished and the result is not yet known. Therefore, 
it must NOT return 0, it must return undef. This means that it is still 
pending (no state change) and the asynchronous part will decide if it is 
a hit or a miss.

Now, let's test a correction:

- First, add a debug statement to the foreach loop of sub do_meta_tests 
to see the values of the three conditions. The idea is to find out how 
many and which rules depend not only on $h->{$deprule}, but also on 
$tp->{$deprule} and $pl{$deprule}. If there are none left, we can 
eliminate the two conditions.

   The states of the three conditions are

   result $h $tp $pl
     OK   0   0   0
     OK   0   1   0
     OK   0   0   1
     OK   0   1   1
     OK   1   0   0
     NO   1   1   0
     NO   1   0   1
     NO   1   1   1

   NO is the state we do not want to have: there is a hit/the rule is 
finished, but $tp and/or $pl are true which means the rule has not yet 
finished.

     # Meta is not ready if some dependency has not run yet
     if (exists $h->{$deprule} && ($tp->{$deprule} || $pl{$deprule})) {
       dbg("rules: NO: \$h exists, \tp=" . ($tp->{$deprule} ? 'true' : 
'false') . "\$pl=" . ($pl->{$deprule} ? 'true' : 'false'));
     } else {
       dbg("rules: OK: \$h" . (exists $h->{$deprule} ? '' : ' not') . " 
exists, \tp=" .
         ($tp->{$deprule} ? 'true' : 'false') . "\$pl=" . 
($pl->{$deprule} ? 'true' : 'false'));
     }
     foreach my $deprule (@{$md->{$rulename}||[]}) {
       if (!exists $h->{$deprule} || $tp->{$deprule} || $pl{$deprule}) {
         next RULE;
       }
     }

   Then run a test and see how many NO we get.

- Second, change sub run_eval_tests

in
     # Make sure rule is marked ready for meta rules using $hitsptr
     $evalstr .= '
     if ($scoresptr->{q{'.$rulename.'}}) {
       $hitsptr->{q{'.$rulename.'}} ||= 0;
       $rulename = q#'.$rulename.'#;

delete $hitsptr->{q{'.$rulename.'}} ||= 0;

change

     $evalstr .= '
       if ($result) {
         $self->got_hit($rulename, $prepend2desc, ruletype => "eval", 
value => $result);
         '.$dbgstr.'
       }
     }

to

     $evalstr .= '
       if (defined $result) {
         if ($result) {
           $self->got_hit($rulename, $prepend2desc, ruletype => "eval", 
value => $result);
           '.$dbgstr.'
         } else { # got a miss, if not set by an explizit got_hit
           $hitsptr->{q{'.$rulename.'}} ||= 0;
         }
       }


- Third, change all registered eval functions in DNSEval.pm to return 
undef instead of 0

- Then make a new test and check if the eval functions have disappeared 
from the debug statements with NO. If this is the case, the error is 
correctly identified.

Regards,
Michael

Re: [Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by Henrik K <he...@hege.li>.
On Sat, May 14, 2022 at 07:15:45PM +0200, Michael Storz wrote:
>
> Despite all these changes, however, I still see room for improvement. At the
> moment I'm not very happy about how the do_meta_tests subroutine is
> implemented. It seems to work now, but the query for the dependency of the
> meta rules looks too complicated to me:
> 
> foreach my $r (@{$md->{$rulename}|[]}) {
>   next RULE unless exists $h->{$r} && !$tp->{$r} && !$pl{$r};
> }
> 
> Instead of three dependencies, there should really only be one query at this
> point for "is the rule ready or not". If we knew exactly the state of a rule
> at each point in time, then it would also be possible to move from a brute
> force algorithm to a deterministic algorithm, where a rule that becomes
> ready automatically sets all meta rules to ready if that rule was the last
> dependency analogous to the algorithm for tags. do_meta_tests would then
> simply process a queue of ready meta rules. If a new meta rule is ready it
> will be added to the end of the queue. If the queue is empty, all rules of a
> priority class are processed.

I heartily agree with everything.  The problem is that I don't have any
formal programming training or knowledge of fancy algorithms or
deterministic stuff.  Frankly I'm not even interested in most of that stuff,
I just script and try to make things do stuff correctly with some common
sense and logic, would be happy if someone showed how to make it simpler.

Glad I was able to make it work with maybe 10% performance loss, but it's
balanced back by dumb metas not needing to wait for async stuff for no
reason etc.


Re: [Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by Michael Storz <Mi...@lrz.de>.
Am 2022-05-13 08:37, schrieb bugzilla-daemon@spamassassin.apache.org:
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987
> 
> Henrik Krohns <ap...@hege.li> changed:
> 
>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>              Status|NEW                         |RESOLVED
>          Resolution|---                         |FIXED
> 
> --- Comment #9 from Henrik Krohns <ap...@hege.li> ---
> Feel free to discuss here or on dev-list if something is still unclear.

First of all, I would like to thank Henrik for all the work he has done 
on SpamAssassin. The further I get with my code review, the more I see 
all the places where he has made minor or major changes that have made 
the code faster, more readable and thus more maintainable.

Despite all these changes, however, I still see room for improvement. At 
the moment I'm not very happy about how the do_meta_tests subroutine is 
implemented. It seems to work now, but the query for the dependency of 
the meta rules looks too complicated to me:

foreach my $r (@{$md->{$rulename}|[]}) {
   next RULE unless exists $h->{$r} && !$tp->{$r} && !$pl{$r};
}

Instead of three dependencies, there should really only be one query at 
this point for "is the rule ready or not". If we knew exactly the state 
of a rule at each point in time, then it would also be possible to move 
from a brute force algorithm to a deterministic algorithm, where a rule 
that becomes ready automatically sets all meta rules to ready if that 
rule was the last dependency analogous to the algorithm for tags. 
do_meta_tests would then simply process a queue of ready meta rules. If 
a new meta rule is ready it will be added to the end of the queue. If 
the queue is empty, all rules of a priority class are processed.

The next possible step would be the implementation of short-circuit 
behavior of && and || analogous to Perl itself. E.g. the rule

meta BITCOIN_SPAM_10 __BITCOIN_ID && ( HTML_IMAGE_ONLY_04 || 
HTML_IMAGE_ONLY_08 )

would immediately evaluate to false if __BITCOIN_ID is false. 
HTML_IMAGE_ONLY_04 and HTML_IMAGE_ONLY_08 would then no longer need to 
be evaluated.

Recently the question was asked why Check.pm is a plugin if it is not 
optional. Check.pm is a plugin so that you can implement more than one 
check plugin. Currently SpamAssassin still assumes that filtering 
happens postqueue, where you can respond to the different wishes of 
individual users. If you use SpamAssassin in a prequeue filtering, then 
only the decision rejection or acceptance is possible for ALL 
recipients. A consideration of different recipient wishes is not 
possible.

Due to the possibility to switch between admin and user rules per 
evaluation of an email, one accepts that the evaluation of the rules 
must be rebuilt for each email. What we need instead is the possibility 
to build the rules once at the start of the daemon and then use it to 
evaluate any number of emails.

You can go one step further and first run a SpamAssassin instance with a 
lightweight ruleset in prequeue mode to be able to decide as quickly as 
possible whether to reject an email and then run another instance with a 
heavyweight ruleset that makes a more precise distinction between 
marking the email as spam or ham. The second instance can then also 
evaluate user rules. There should be a feedback loop between the 
instances so that the first instance can use the results of the second 
instance, e.g. to quickly reject emails via a local blocklist using the 
HashBL.pm plugin. Currently we use a feedback loop between the 
SpamAssassin instance in prequeue mode and Postfix to (temporarily) 
reject emails for 24 hours.

In any case, SpamAssassin should arrive in 2022 and offer a highly 
optimized version of the analysis for MTAs that process millions of 
emails per day.

These are my general remarks about the evaluation of rules. An email 
with some minor cosmetic changes will follow.

Michael


[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

Henrik Krohns <ap...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |FIXED

--- Comment #9 from Henrik Krohns <ap...@hege.li> ---
Feel free to discuss here or on dev-list if something is still unclear.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #17 from Michael Storz <sa...@lrz.de> ---
(In reply to Henrik Krohns from comment #15)
> (In reply to Henrik Krohns from comment #12)
> > Of course we can vote if got_miss() etc would be better name for
> > rule_ready(). And documentation needs some more polishing.
> 
> What I don't like about got_miss(), is that it doesn't necessarily mark a
> "miss". It's possible to call got_hit() after it, especially in
> multiple-lookups-for-one-rule scenario, where it would be complex for the
> plugin itself to track all lookups. Much simpler to call rule_ready on every
> callback. Though xurrent logic is not perfect either.. if the last lookup
> timeouts, it ends up as unrun rule.

I think it is the responsibility of the plugin to decide if a rule is a hit or
a miss. It would be really bad, if it called got_miss first, which means I'm
done and it is a miss, and then later to decide to call got_hit. If that could
happen the logic of the eval rule is too complex and should be changed.

However, if you think rule_ready is a tool which could support the decision
making of the rule, then it is ok. I haven't thought much about that scenario.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #19 from Henrik Krohns <ap...@hege.li> ---
(In reply to Henrik Krohns from comment #18)
>
> I guess we could brind back the deprecated Dns/is_rule_complete function,
> which can do the checks. So would this be the most elegant flow?
> 
> sub callback {
>   foreach (answer) {
>     if (match) got_hit(thisrule)
>   }
>   if (is_rule_complete(thisrule)) got_miss(thisrule)
> }

Even still, that is confusing naming to me, since is_rule_complete() or
got_miss() must ignore any previous got_hit(). You can't be sure that all
developers make 100% correct logic in plugins. In that sense rule_ready()
atleast doesn't convey anything about rule hitting or missing. Rule missing
should be assumed as a default.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

Michael Storz <sa...@lrz.de> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |sa-mst@lrz.de

--- Comment #1 from Michael Storz <sa...@lrz.de> ---
Created attachment 5779
  --> https://bz.apache.org/SpamAssassin/attachment.cgi?id=5779&action=edit
deletion of rule_pending, rule_ready

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987
Bug 7987 depends on bug 7735, which changed state.

Bug 7735 Summary: Meta rules need to handle missing/unrun dependencies
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7735

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|---                         |FIXED

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #6 from Henrik Krohns <ap...@hege.li> ---
Added mention of bgsend to rule_pending() docs, Revision 1900680

And then I found some bugs in the logic..

For example when URIDNSBL is launching and finishing a rule fast, and only
after that eval check_uridnsbl is called, it marks the rule again as
rule_pending(), and the rule will end up considered unrun.. duh..

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #25 from Henrik Krohns <ap...@hege.li> ---
Looking at my old plugin and popular things like SH.pm, it's clear the latest
change will heavily break backwards compatibility.

So here's a small fix that mostly mitigates things:

- Use rule_ready() in run_eval_tests to allow async even for "return 0"
- Bring back async pending check in do_meta_tests

Committed revision 1900974.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

Henrik Krohns <ap...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Depends on|                            |7735

--- Comment #30 from Henrik Krohns <ap...@hege.li> ---
- Fix eval functions returning unintended "undef"

Committed revision 1901403.


Referenced Bugs:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7735
[Bug 7735] Meta rules need to handle missing/unrun dependencies
-- 
You are receiving this mail because:
You are the assignee for the bug.

Re: [Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by Michael Storz <Mi...@lrz.de>.
Am 2022-05-08 18:13, schrieb Henrik K:
> On Sun, May 08, 2022 at 05:33:24PM +0200, Michael Storz wrote:
>> Am 2022-05-08 06:43, schrieb bugzilla-daemon@spamassassin.apache.org:
>> > https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987
>> >
>> > --- Comment #4 from Henrik Krohns <ap...@hege.li> ---
>> > I mostly agree with everything, great to have extra eyeballs. Can you
>> > comment
>> > if my previous list comment and Revision 1900667 additions cleared any
>> > things
>> > up for you and changes anything?
>> >
>> 
>> Unfortunately not, I'm still struggling to understand the various 
>> aspects of
>> processing. However, I understand, that a call of rule_ready for a 
>> sync rule
>> makes sense. It does not need a call to rule_pending in this case.
>> 
>> Example for a problem: a rule_pending must be followed by a rule_ready 
>> or a
>> got_hit. But how a got_hit should work instead of a rule_ready is a 
>> mystery
>> to me. In sub do_meta_tests the query whether a rule dependency exists 
>> is
>> defined as:
> 
> I noticed that got_hit doesn't do everything that it can.  But I also 
> found
> many other cases that need fixing and better testing.  Spent 8 hours 
> today
> trying different things and planning how to do things, trying to make 
> bgsend
> automatically mark things ready is hard, as it's so complex in itself..
> this is going to take few days..

Wow, maybe it would help when you try to explain what is going wrong? I 
have made the experience that I have often already found the solution 
when I tried to describe a problem to someone else, even though they 
didn't have much of an idea about the problem.

The only other problem I found with bgsend_and_start_lookup is the 
handling of end cases. When abort_remaining_lookups is called, it 
triggers a callback for the lookups that are still waiting. They in turn 
can register lookups with bgsend_and_start_lookup again. Therefore, 
bgsend_and_start_lookup must check whether it is in the aborting state, 
and instead of queuing the lookup, it should immediately abort the 
lookup in the same way as abort_remaining_lookups. However, it should be 
prepared to break a loop if the plugin misbehaves and schedules lookup 
after lookup.

Michael


Re: [Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by Henrik K <he...@hege.li>.
On Sun, May 08, 2022 at 05:33:24PM +0200, Michael Storz wrote:
> Am 2022-05-08 06:43, schrieb bugzilla-daemon@spamassassin.apache.org:
> > https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987
> > 
> > --- Comment #4 from Henrik Krohns <ap...@hege.li> ---
> > I mostly agree with everything, great to have extra eyeballs. Can you
> > comment
> > if my previous list comment and Revision 1900667 additions cleared any
> > things
> > up for you and changes anything?
> > 
> 
> Unfortunately not, I'm still struggling to understand the various aspects of
> processing. However, I understand, that a call of rule_ready for a sync rule
> makes sense. It does not need a call to rule_pending in this case.
> 
> Example for a problem: a rule_pending must be followed by a rule_ready or a
> got_hit. But how a got_hit should work instead of a rule_ready is a mystery
> to me. In sub do_meta_tests the query whether a rule dependency exists is
> defined as:

I noticed that got_hit doesn't do everything that it can.  But I also found
many other cases that need fixing and better testing.  Spent 8 hours today
trying different things and planning how to do things, trying to make bgsend
automatically mark things ready is hard, as it's so complex in itself.. 
this is going to take few days..


Re: [Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by Michael Storz <Mi...@lrz.de>.
Am 2022-05-08 06:43, schrieb bugzilla-daemon@spamassassin.apache.org:
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987
> 
> --- Comment #4 from Henrik Krohns <ap...@hege.li> ---
> I mostly agree with everything, great to have extra eyeballs. Can you 
> comment
> if my previous list comment and Revision 1900667 additions cleared any 
> things
> up for you and changes anything?
> 

Unfortunately not, I'm still struggling to understand the various 
aspects of processing. However, I understand, that a call of rule_ready 
for a sync rule makes sense. It does not need a call to rule_pending in 
this case.

Example for a problem: a rule_pending must be followed by a rule_ready 
or a got_hit. But how a got_hit should work instead of a rule_ready is a 
mystery to me. In sub do_meta_tests the query whether a rule dependency 
exists is defined as:

next RULE unless exists $h->{$deprule} && !$tp->{$deprule} && 
!$pl{$deprule};

(condition rearranged by me to make it easier to understand). If we have 
an asynchronous rule $deprule other than a DNS lookup, a call to 
rule_pending makes an entry in $tp. If the rule is true, the call to 
got_hit makes an entry in $h, but it does not delete the entry in $tp 
unlike rule_ready. Thus, the $deprule dependency is still present and 
the rule cannot be evaluated. Or did I miss something?

Regards,
Michael

PS: Unfortunately, I do not have a running 4.0 system. Therefore I 
cannot test it myself.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #4 from Henrik Krohns <ap...@hege.li> ---
I mostly agree with everything, great to have extra eyeballs. Can you comment
if my previous list comment and Revision 1900667 additions cleared any things
up for you and changes anything?

As a developer I can have hard time figuring out what to document about things,
so they are understandable for any third party plugin developers. I even think
SA originally went too far to "pluginize" every little detail, with it creating
it's own problems.. there's just way too many public hooks and tidbits to
understand..

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #22 from Henrik Krohns <ap...@hege.li> ---
(In reply to Henrik Krohns from comment #10)
> Created attachment 5784 [details]
> Eval functions returning undef for async
> 
> As discussed on list, here's a patch to change all plugins async eval
> functions to return undef.
> 
> Care needs to be taken to check all bgsend* return values, so there is no
> unnecessary async marking.

Committed revision 1900961.

Need to clean ups some docs later. There's also some suspicious unrun
rules/metas popping up randomly, still investigating if there's something to
fix.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

Kevin A. McGrail <km...@apache.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |kmcgrail@apache.org

--- Comment #5 from Kevin A. McGrail <km...@apache.org> ---
(In reply to Henrik Krohns from comment #4)
> I
> even think SA originally went too far to "pluginize" every little detail,
> with it creating it's own problems.. there's just way too many public hooks
> and tidbits to understand..

Very fair comment.  The idea was good though because it gave us different
standards for code quality: Core code > Plugin Enabled by Default > Plugin not
Enabled by Default

And it gave the ability to create proprietary plugins which was good because an
important part of the ASF is the pro-business ASLv2 with no copyleft licensing
or common clause concepts.

Plus, I think sometimes when you are writing code, you think, this will only
get used for a year.  Not 20 and we're getting close to that.

And there are definitely things I've wanted to fix for eons. I remember after
Vixie created RBLs and we implemented them, I made a statement like DNS is a
brilliant hack but we need to replace using DNS in a year, lol.  DNS was
distributed and resilient but you've seen all the blocking code in trying to
handle lookups on DNS and timeouts.  Spa.ghet.ti. And don't get me started on
Net::DNS.

Anyway, just thought it might be good to have some of the thought process from
the olden days.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

Henrik Krohns <ap...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |apache@hege.li

--- Comment #2 from Henrik Krohns <ap...@hege.li> ---
I'll look into this again, but note that I spent considerable effort and
testing to make sure things worked when I implemented these meta changes. Not
everything is always as it seems at first glance and reading the sparse
documentation I made might not reveal all the intricacies. So great care should
be taken to remove something, especially if it just for "cosmetic reasons", if
they really are redundant calls, they shouldn't have ill effects..

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

Henrik Krohns <ap...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|FIXED                       |---
             Status|RESOLVED                    |REOPENED

--- Comment #11 from Henrik Krohns <ap...@hege.li> ---
Reopening for review

-- 
You are receiving this mail because:
You are the assignee for the bug.

Re: [Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by Michael Storz <Mi...@lrz.de>.
Am 2022-05-17 17:09, schrieb bugzilla-daemon@spamassassin.apache.org:
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987
> 
> --- Comment #27 from Michael Storz <sa...@lrz.de> ---
> (In reply to Henrik Krohns from comment #24)
>> (In reply to Michael Storz from comment #23)
>> > Great work! Are you now ready to switch from the brute force algorithm to a
>> > deterministic algorithm? I am not sure which of the two algorithms would be
>> > faster. That needs to be tested.
>> 
>> It should not matter if I'm ready or not. If you have something, then 
>> share
>> it for everyone. :-)
> 
> Ok, let's see if the change leads to better performance or not. The 
> idea is
> really simple, no big surprise, no fancy algorithm :-)
> 
> Each meta rule needs a counter with the number of dependent rules. For 
> each
> rule, an array is created with all meta rules that depend on that rule. 
> When a
> rule has finished, got_hit/got_mis/rule_ready, it decrements the 
> counter of
> each dependent meta rule. If the counter is 0, then the meta rule is 
> moved from
> meta_pending to meta_ready queue. The meta_ready queue is then 
> processed by
> do_meta_tests.
> 
> Alternative: Instead of moving the meta rule, it is executed 
> immediately. The
> do_meta_tests subroutine then becomes redundant (what to do with the 
> reuse part
> then, I have no idea).
> 
> In the end, the leftover meta-rules must then be handled with
> finish_meta_tests.
> 
> A few code fragments for clarification:
> 
> sub compile_meta_rules
> 
>   foreach my $name (keys %{$conf->{tests}}) {
>     ...
>     $conf->{meta_dep_count}->{$name} = scalar @{$rule_deps{$name}};
>     foreach my $rulename (@{$rule_deps{$name}}) {
>       push @{$conf->{rule_dependencies}->{$rulename}}, $name;
>     }
>   }
> 
> ---------------------------------------
> 
> got_hit/got_miss/rule_ready
> 
> foreach my $meta_rule (@{$conf->{rule_dependencies}->{$rulename}) {
>   move_meta_p2r($meta_rule) unless 
> --$conf->{meta_dep_count}->{$meta_rule};
> }
> 
> foreach my $meta_rule (@{$conf->{rule_dependencies}->{$rulename}) {
>   run_meta_rule($meta_rule) unless 
> --$conf->{meta_dep_count}->{$meta_rule};
> }
> 
> run_meta_rule like do_meta_tests + delete 
> $pms->{meta_pending}->{$meta_rule}
> 
> sub do_meta_tests {
>   my ($self, $pms, $priority) = @_;
>   ...
>   return if $self->{am_compiling}; # nothing to compile here
> 
>   my $mr = $pms->{meta_ready};
>   my $mt = $pms->{conf}->{meta_tests};
>   my $h = $pms->{tests_already_hit};
> 
>   while (my $rulename = shift @$mr) {
>     # Metasubs look like ($_[1]->{$rulename}||($_[2]->{$rulename}?1:0)) 
> ...
>     my $result = $mt->{$rulename}->($pms, $h, {});
>     if ($result) {
>       dbg("rules: ran meta rule $rulename ======> got hit ($result)");
>       $pms->got_hit($rulename, '', ruletype => 'meta', value => 
> $result);
>     } else {
>       dbg("rules-all: ran meta rule $rulename, no hit") if
> $would_log_rules_all;
>       $pms->got_miss($rulename); # mark meta done
>     }
>   }
> }

Performance: The standard ruleset of SpamAssassin includes about 3,000 
rules. One third of these are meta rules. The performance of the 
evaluation of the meta rules therefore matters.

My code fragments with move_meta_p2r and run_meta_rule would therefore 
degrade performance if these were actually implemented as subroutine 
calls, since about 1,000 additional subroutine calls would then be made.

The immediate execution of meta rules by run_meta_rule makes little 
sense, since we do not short-circuit the boolean expressions of the meta 
rules. Short-circuit would only bring something, if then in consequence 
large parts of the rules would not be evaluated any more.

Therefore only move_meta_p2r comes into question. move_meta_p2r can be 
implemented as the two statements add ready + delete pending.

In do_meta_tests we then have the set of ready meta rules. Instead of 
processing them individually, which would mean executing 1,000 
subroutine calls, we can now generate subroutines that execute many meta 
rules at once, analogous to the procedure for the header tests. This 
could reduce the number of calls to a handful and would boost 
performance.

Michael

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #27 from Michael Storz <sa...@lrz.de> ---
(In reply to Henrik Krohns from comment #24)
> (In reply to Michael Storz from comment #23)
> > Great work! Are you now ready to switch from the brute force algorithm to a
> > deterministic algorithm? I am not sure which of the two algorithms would be
> > faster. That needs to be tested.
> 
> It should not matter if I'm ready or not. If you have something, then share
> it for everyone. :-)

Ok, let's see if the change leads to better performance or not. The idea is
really simple, no big surprise, no fancy algorithm :-)

Each meta rule needs a counter with the number of dependent rules. For each
rule, an array is created with all meta rules that depend on that rule. When a
rule has finished, got_hit/got_mis/rule_ready, it decrements the counter of
each dependent meta rule. If the counter is 0, then the meta rule is moved from
meta_pending to meta_ready queue. The meta_ready queue is then processed by
do_meta_tests.

Alternative: Instead of moving the meta rule, it is executed immediately. The
do_meta_tests subroutine then becomes redundant (what to do with the reuse part
then, I have no idea).

In the end, the leftover meta-rules must then be handled with
finish_meta_tests.

A few code fragments for clarification:

sub compile_meta_rules

  foreach my $name (keys %{$conf->{tests}}) {
    ...
    $conf->{meta_dep_count}->{$name} = scalar @{$rule_deps{$name}};
    foreach my $rulename (@{$rule_deps{$name}}) {
      push @{$conf->{rule_dependencies}->{$rulename}}, $name;
    }
  }

---------------------------------------

got_hit/got_miss/rule_ready

foreach my $meta_rule (@{$conf->{rule_dependencies}->{$rulename}) {
  move_meta_p2r($meta_rule) unless --$conf->{meta_dep_count}->{$meta_rule};
}

foreach my $meta_rule (@{$conf->{rule_dependencies}->{$rulename}) {
  run_meta_rule($meta_rule) unless --$conf->{meta_dep_count}->{$meta_rule};
}

run_meta_rule like do_meta_tests + delete $pms->{meta_pending}->{$meta_rule}

sub do_meta_tests {
  my ($self, $pms, $priority) = @_;
  ...
  return if $self->{am_compiling}; # nothing to compile here

  my $mr = $pms->{meta_ready};
  my $mt = $pms->{conf}->{meta_tests};
  my $h = $pms->{tests_already_hit};

  while (my $rulename = shift @$mr) {
    # Metasubs look like ($_[1]->{$rulename}||($_[2]->{$rulename}?1:0)) ...
    my $result = $mt->{$rulename}->($pms, $h, {});
    if ($result) {
      dbg("rules: ran meta rule $rulename ======> got hit ($result)");
      $pms->got_hit($rulename, '', ruletype => 'meta', value => $result);
    } else {
      dbg("rules-all: ran meta rule $rulename, no hit") if
$would_log_rules_all;
      $pms->got_miss($rulename); # mark meta done
    }
  }
}

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #26 from Henrik Krohns <ap...@hege.li> ---
(In reply to Henrik Krohns from comment #25)
> - Bring back async pending check in do_meta_tests

My brain is starting to be exhausted. It's not really needed.

Committed revision 1900984.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #14 from Michael Storz <sa...@lrz.de> ---
I remember someone renamed get_pending_lookups to get_async_pending_rules,
which is the name I would have chosen too ;-)

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #23 from Michael Storz <sa...@lrz.de> ---
Great work! Are you now ready to switch from the brute force algorithm to a
deterministic algorithm? I am not sure which of the two algorithms would be
faster. That needs to be tested.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987
Bug 7987 depends on bug 7735, which changed state.

Bug 7735 Summary: Meta rules need to handle missing/unrun dependencies
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7735

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |---

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #16 from Henrik Krohns <ap...@hege.li> ---
(In reply to Michael Storz from comment #14)
> I remember someone renamed get_pending_lookups to get_async_pending_rules,
> which is the name I would have chosen too ;-)

Well that function never even existed outside trunk.. but I forgot alias for it
just in case..

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987
Bug 7987 depends on bug 7735, which changed state.

Bug 7735 Summary: Meta rules need to handle missing/unrun dependencies
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7735

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|---                         |FIXED

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987
Bug 7987 depends on bug 7735, which changed state.

Bug 7735 Summary: Meta rules need to handle missing/unrun dependencies
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7735

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|---                         |FIXED

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

Henrik Krohns <ap...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|Undefined                   |4.0.0
           Severity|minor                       |blocker

--- Comment #7 from Henrik Krohns <ap...@hege.li> ---
Setting as blocker for 4.0.0.

There's some remaining bugs, especially the way check_rbl_sub is handled, which
makes some rules considered unrun occasionally.

Already made some progress, but I really need a small break and a breather,
will work on in the coming days..

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

Henrik Krohns <ap...@hege.li> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
         Resolution|---                         |FIXED
             Status|REOPENED                    |RESOLVED

--- Comment #29 from Henrik Krohns <ap...@hege.li> ---
No objections to the combo of "return undef" for eval and $pms->rule_ready()
usage have been seen.

Any bugs or theoretical performance improvements can be discussed here or in a
new bug, but the current state "works as intended" and should not hold up
4.0.0, so I'm closing this.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987
Bug 7987 depends on bug 7735, which changed state.

Bug 7735 Summary: Meta rules need to handle missing/unrun dependencies
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7735

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |---

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #20 from Michael Storz <sa...@lrz.de> ---
(In reply to Henrik Krohns from comment #18)
> (In reply to Michael Storz from comment #17)
> > If that
> > could happen the logic of the eval rule is too complex and should be changed.
> 
> A DNSBL can do this:
> - It parses for example 4 IP addresses from Received
> - For each of the IP it starts a separate bgsend_and_start_lookup, expecting
> 4 separate callbacks
> 
> Obviously if got_hit() is called, we know rule is ready regardless of any
> pending lookups.
> 
> What things should the callback do if there is no hit?
> 
> Current logic:
> 
> sub callback {
>   rule_ready(thisrule)
>   foreach (answer) {
>     if (match) got_hit(thisrule)
>   }
> }
> 
> Using got_miss() would probably be:
> 
> sub callback {
>   foreach (answer) {
>     if (match) got_hit(thisrule)
>   }
>   if (!tests_already_hit(thisrule) && !get_async_pending_rules(thisrule))
>     got_miss(thisrule)
> }
> 
> I guess we could brind back the deprecated Dns/is_rule_complete function,
> which can do the checks. So would this be the most elegant flow?
> 
> sub callback {
>   foreach (answer) {
>     if (match) got_hit(thisrule)
>   }
>   if (is_rule_complete(thisrule)) got_miss(thisrule)
> }

short executive answer for Henrik: 

Since the decision logic of how to process the different asynchronous calls is
completely contained in the plugin and therefore cannot be detected from the
outside, I don't know how a support system could look like. For this we
probably need first several examples from which we could take an idea for the
realization.

long version:

If we take the standard case of one rule one result, then we can represent the
decision logic for your example of a DNSBL with 4 lookups l1 to l4 like this:

hit = l1 || l2 || l3 || l4
miss = !hit = !(l1 || l2 || l3 || l4) = !l1 && !l2 && !l3 && !l4

With short-circuit evaluation we can evaluate a hit already with the result of
one lookup hit. But for a miss we need the evaluation of all lookups.

The logic could be the other way around in another case:

hit = l1 && l2 && l3 && l4

Now we would need all lookup results for the evaluation of a hit and only one
for a miss.

In principle, the logic can take any form of a boolean expression:

hit = (l1 && l2) || (l3 && l4)

Here, too, there are combinations where we can only decide whether it is a hit
or a miss after evaluating all lookups.

I.e. logically we always need three levels

- eval function
- asynchronous lookups
- decision function

i.e. the callbacks of the lookups must always call a decision function, which
keeps track of the status of all lookups and makes a decision about hit or miss
based on this data.

Besides the standard case, there are also the cases with one rule many results.
In the extreme case, there could also be four hit/miss results associated with
the four lookups, i.e. the lookups are completely independent of each other.

Which logic is implemented in the plugin cannot be determined from the outside.
At the moment I don't know how to support the implementation of the logic. For
this we need first of all examples from which we can then perhaps derive
something.

I would therefore implement nothing for the support of the decision logic for
the time being.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #21 from Henrik Krohns <ap...@hege.li> ---
(In reply to Michael Storz from comment #20)
>
> I would therefore implement nothing for the support of the decision logic
> for the time being.

Then I'll just make a decision and go with rule_ready() in it's current state.
You'll see it in use for the whole 4.0 lifetime, which can be long. All fine to
me.

-- 
You are receiving this mail because:
You are the assignee for the bug.

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #24 from Henrik Krohns <ap...@hege.li> ---
(In reply to Michael Storz from comment #23)
> Great work! Are you now ready to switch from the brute force algorithm to a
> deterministic algorithm? I am not sure which of the two algorithms would be
> faster. That needs to be tested.

It should not matter if I'm ready or not. If you have something, then share it
for everyone. :-)

-- 
You are receiving this mail because:
You are the assignee for the bug.

Re: [Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by Michael Storz <Mi...@lrz.de>.
Am 2022-05-19 17:01, schrieb Henrik K:
> On Thu, May 19, 2022 at 04:11:53PM +0200, Michael Storz wrote:
>> Am 2022-05-19 11:53, schrieb bugzilla-daemon@spamassassin.apache.org:
>> > https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987
>> >
>> > --- Comment #28 from Henrik Krohns <ap...@hege.li> ---
>> > A quick cleanup to tidy things:
>> >
>> > - Use rule_ready() everywhere instead of direct tests_already_hit modify
>> 
>> Is this really necessary? It introduces thousands of additional 
>> subroutine
>> calls.
> 
> Not sure why you even ask.  Why do functions exist?  Do you now prefer 
> to
> duplicate same code blocks everywhere?
> 
>> The algorithm you have now implemented looks too complicated and does 
>> not
>> convince me.
> 
> Maybe for once you could post an actual tested and benchmarked patch,
> instead of some random code snippets and comments?  :-) There's only so 
> much
> time I can spare myself.

Ok, I will see what I can do. However, it will take until next week, as 
I am busy with something else at the moment.

Michael

Re: [Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by Henrik K <he...@hege.li>.
On Thu, May 19, 2022 at 04:11:53PM +0200, Michael Storz wrote:
> Am 2022-05-19 11:53, schrieb bugzilla-daemon@spamassassin.apache.org:
> > https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987
> > 
> > --- Comment #28 from Henrik Krohns <ap...@hege.li> ---
> > A quick cleanup to tidy things:
> > 
> > - Use rule_ready() everywhere instead of direct tests_already_hit modify
> 
> Is this really necessary? It introduces thousands of additional subroutine
> calls.

Not sure why you even ask.  Why do functions exist?  Do you now prefer to
duplicate same code blocks everywhere?

> The algorithm you have now implemented looks too complicated and does not
> convince me.

Maybe for once you could post an actual tested and benchmarked patch,
instead of some random code snippets and comments?  :-) There's only so much
time I can spare myself.


Re: [Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by Michael Storz <Mi...@lrz.de>.
Am 2022-05-19 11:53, schrieb bugzilla-daemon@spamassassin.apache.org:
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987
> 
> --- Comment #28 from Henrik Krohns <ap...@hege.li> ---
> A quick cleanup to tidy things:
> 
> - Use rule_ready() everywhere instead of direct tests_already_hit 
> modify

Is this really necessary? It introduces thousands of additional 
subroutine calls.

> - Simple tracking of meta dependency hits, run do_meta_tests only when 
> needed
> - Do not run do_meta_tests on last priority, as finish_meta_tests will 
> run
> anyway
> 
> Committed revision 1901060.
> 
> It already reduces unnecessary do_meta_tests calls a lot.
> 
> Before:
> 
> do_meta_tests -1000 ready 2
> do_meta_tests -950 ready 0
> do_meta_tests -900 ready 0
> do_meta_tests -100 ready 5
> do_meta_tests -90 ready 0
> do_meta_tests 0 ready 1234
> do_meta_tests 500 ready 0
> 
> After:
> 
> do_meta_tests -950 ready 2
> do_meta_tests -100 ready 5
> do_meta_tests 0 ready 1234
> 
> Though in the grand total runtimes, it makes very little difference. 
> It's hard
> to optimize things, as most metas will end up ready at priority 0 
> anyway.
> 
> Will look if there's further to improve, but I'm quite sceptical about 
> the
> --$conf->{meta_dep_count} stuff, as it's really hard to guarantee that 
> someone
> doesn't call rule_ready() multiple times, obviously then the counts 
> will end up
> wrong and metas could be run prematurely. It's possible to track 
> readiness
> perfectly with a hash from which you delete dependencies as they occur. 
> I
> already tried something like that, but it gets so complex that it 
> actually
> increases runtimes.

No, this is absolutely not a problem. After going through the array with 
the dependencies the array should be deleted. Since it is only used once 
at the first time the rule is declared finished either by a call to 
got_hit or rule_ready, it can be deleted. Any further call of got_hit or 
rule_ready makes no change of the counters of the meta rules.

The algorithm you have now implemented looks too complicated and does 
not convince me.

Try my algorithm and do it even for the last priority. Then look which 
rules must be handled by finish_meta_tests and if your algorithm is 
needed for a better evaluation.

Michael

[Bug 7987] DNSEval.pm,HashBL.pm,URILocalBL.pm: unnecessary use of rule_pending and rule_ready

Posted by bu...@spamassassin.apache.org.
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7987

--- Comment #28 from Henrik Krohns <ap...@hege.li> ---
A quick cleanup to tidy things:

- Use rule_ready() everywhere instead of direct tests_already_hit modify
- Simple tracking of meta dependency hits, run do_meta_tests only when needed
- Do not run do_meta_tests on last priority, as finish_meta_tests will run
anyway

Committed revision 1901060.

It already reduces unnecessary do_meta_tests calls a lot.

Before:

do_meta_tests -1000 ready 2
do_meta_tests -950 ready 0
do_meta_tests -900 ready 0
do_meta_tests -100 ready 5
do_meta_tests -90 ready 0
do_meta_tests 0 ready 1234
do_meta_tests 500 ready 0

After:

do_meta_tests -950 ready 2
do_meta_tests -100 ready 5
do_meta_tests 0 ready 1234

Though in the grand total runtimes, it makes very little difference. It's hard
to optimize things, as most metas will end up ready at priority 0 anyway.

Will look if there's further to improve, but I'm quite sceptical about the
--$conf->{meta_dep_count} stuff, as it's really hard to guarantee that someone
doesn't call rule_ready() multiple times, obviously then the counts will end up
wrong and metas could be run prematurely. It's possible to track readiness
perfectly with a hash from which you delete dependencies as they occur. I
already tried something like that, but it gets so complex that it actually
increases runtimes.

-- 
You are receiving this mail because:
You are the assignee for the bug.