You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Kevin Golding <kp...@caomhin.org> on 2016/08/01 08:22:02 UTC

Re: masscheck process timing

On Sun, 31 Jul 2016 22:00:11 +0100, John Hardin <jh...@impsec.org> wrote:

> Folks:
>
> It looks like we didn't get another successful weekly masscheck again,  
> even though if you check the counts today they are above the thresholds.
>
> I suspect this is happening due to some results being submitted "late".
>
> I think we might want to look into making a change to the masscheck  
> timing rules, specifically: the cutoff for having enough corpora to run  
> the scoring and produce a rules update is not a specific time, but is  
> instead related to the following masscheck run.

Change is good. I must admit it's getting a bit frustrating seeing these  
runs result in nothing at all.

> In other words:
>
> There is still a cutoff time for the masscheck run, but it only means  
> "the scoring won't start prior to this time."
>
> If the corpora are above the thresholds when this time is reached, the  
> scoring and update process commences immediately.
>
> If not, that doesn't mean we've missed an update, at least not yet.
>
> If another result set comes in for that pass, and that result set pushes  
> it over the thresholds, then we can start the scoring and rule  
> generation process.
>
> The actual hard cutoff for pass X would be sometime after pass X+1  
> starts. Perhaps if the cutoff time for pass X+1 is reached and pass X is  
> still waiting, then we give up on pass X.
>
> This way, a late result set that satisfies the threshholds will just  
> delay the rule generation, not prevent it.

I like that there's an effort to still push the updates out as early in  
the day as possible with this system. The simplest option is no doubt to  
just delay the score generation, even to the point of giving a whole 24  
hours if need be, as it would at least result in something fairly  
reliable. This seems a good hybrid approach though.

> This can use some refinement:

Some good thoughts, but ones that I fear may prove an obstacle to getting  
a change in place. Perhaps things for a wishlist instead?

> If we've started scoring and another result set for that pass comes in,  
> do we incorporate that into the score generation? We probably should;  
> the decision could be based on when the delayed results come in (we  
> don't want to keep resetting the scoring process and collide with the  
> following pass) and how large the new results are (we might want to  
> ignore a late small result set, but incorporate a late large result set).

As it stands I'm inclined to take the route that anything submitted after  
the run has started gets lost - this is no different to the current  
situation (as I understand it anyway) so it's not penalising anyone, but  
it also doesn't grant further concessions. Adding in new results just  
seems a way to potentially further delay an already delayed process.

Much as the additional data is beneficial it seems added complexity for no  
gain. Given how tight the ham threshold is most days (there are a lot of  
days in the 140k-150k region) a large result set is unlikely to arrive  
after the threshold has been met anyway, it's far more likely to be the  
trigger. If we start dividing large and small we need to pick a point and  
draw a line and potentially discourage submissions from people who feel  
they aren't important enough.

I'd also note that when you look at the uploads you have people like axb  
who submit multiple times in small groups - that is always an option to  
people if they feel something is important enough to beat the threshold.

> If we're still running a score generation for pass X and pass X+1 has  
> reached its cutoff and has enough corpora to satisfy the thresholds and  
> immediately start the scoring process, do we give up on processing pass  
> X? I would think yes.

I don't know how long the process takes, but if we never start a pass by  
the time the next day's start point comes I would assume it would never  
overlap. I could be wrong, but it seems likely that a hard cut off that  
shouldn't overlap the next day's start may be simpler. At some point we  
need to give up hope on a day's results anyway, so that may be the  
guideline for when that time is.

Re: masscheck process timing

Posted by Kevin Golding <kp...@caomhin.org>.

On Mon, 01 Aug 2016 16:49:15 +0100, John Hardin <jh...@impsec.org> wrote:

> My fear is that we start the scoring when we receive a small (20k ham  
> corpus) that just barely meets the threshold and then ignore a large  
> (100k ham corpus) that is received shortly thereafter and that would  
> greatly improve the results.
>
> Perhaps: if we receive a delayed corpus that crosses the threshold, we  
> don't *immediately* start scoring, instead we start in half an hour -  
> this gives a chance for another corpus to come in. This would continue  
> up to some maximum (1h?).
>
> Or perhaps I'm overthinking it. :)

Looking at the latest net run there were just over 160k ham. Only three  
corpora go over 20k ham and the largest is just under 80k ham. The largest  
single corpus that could come after the threshold is reached is just over  
6k. I think your ideas of big and small are optimistic, and if that really  
happened we may not have anything to worry about.

The odd thing is a lot of the smaller corpora probably add some of the  
most useful variety. I seem to recall someone from Norway was recently  
looking to get involved for example? A couple of thousand ham from there  
may make more impact than an extra couple of thousand from the same  
sources. For that net run we had 8 people uploading 11 sets of data - a  
bit more breadth wouldn't hurt. At the moment that's the bigger problem  
really, and I have even less idea how to help there.

Something that Jari touches on is that there's not really any info on when  
we need to submit by. I remember seeing something that said to start as  
soon after 9am GMT as possible but I don't recall a deadline for when it  
had to be in by. I know mine is usually uploaded by 1pm GMT and have  
always assumed that was early enough but tbh nobody has ever said anything  
about it. The only feedback I ever got was when I started too early. If  
I'm late I can look at faster options, but until someone tells me I assume  
I'm getting mine in on time.

If I knew when the deadline was, or why that was chosen, I may have an  
opinion on factoring in extra delays. Capping that additional window makes  
sense since at some point the bullet needs biting to get something out the  
door, but I'd still be disinclined. It potentially waits an hour even if  
it was the last upload that could happen. What if that then pushes it past  
the daily cutoff? Or should we only allow that extension before a certain  
point in the day to avoid that problem? It's not that we don't want that  
additional data, because the more the merrier, just that it seems to  
require a lot of extra factors to work as well as it should. Now I'm  
overthinking it!

> I dislike the idea of trying to calculate a hard start cutoff based on  
> how long the scoring run takes. Do we really want to maintain statistics  
> on that?

Probably not. Again, maybe if/when things get busier it may prove more  
worthwhile but at the moment it's likely poor reward for the effort.

> OK, so the hard starting cutoff could be the time the following pass  
> does its SVN get. If the scoring is underway at that point, we let it  
> run to completion? I am makign an assumption here, that the time the  
> scoring and rule generation takes is less than the get -> minimum  
> scoring start delay, so that the scoring+rulegen passes won't overlap.

It seems simple and reasonable to me.

Re: masscheck process timing

Posted by John Hardin <jh...@impsec.org>.

On Mon, 1 Aug 2016, Kevin Golding wrote:

> On Sun, 31 Jul 2016 22:00:11 +0100, John Hardin <jh...@impsec.org> wrote:
>
>> This can use some refinement:
>
> Some good thoughts, but ones that I fear may prove an obstacle to getting a 
> change in place. Perhaps things for a wishlist instead?

Maybe.

>> If we've started scoring and another result set for that pass comes in, do 
>> we incorporate that into the score generation? We probably should; the 
>> decision could be based on when the delayed results come in (we don't want 
>> to keep resetting the scoring process and collide with the following pass) 
>> and how large the new results are (we might want to ignore a late small 
>> result set, but incorporate a late large result set).
>
> As it stands I'm inclined to take the route that anything submitted after the 
> run has started gets lost - this is no different to the current situation (as 
> I understand it anyway) so it's not penalising anyone, but it also doesn't 
> grant further concessions. Adding in new results just seems a way to 
> potentially further delay an already delayed process.

I'm hoping to balance delay and quality of results.

> Much as the additional data is beneficial it seems added complexity for no 
> gain. Given how tight the ham threshold is most days (there are a lot of days 
> in the 140k-150k region) a large result set is unlikely to arrive after the 
> threshold has been met anyway, it's far more likely to be the trigger. If we 
> start dividing large and small we need to pick a point and draw a line and 
> potentially discourage submissions from people who feel they aren't important 
> enough.
>
> I'd also note that when you look at the uploads you have people like axb who 
> submit multiple times in small groups - that is always an option to people if 
> they feel something is important enough to beat the threshold.

My fear is that we start the scoring when we receive a small (20k ham 
corpus) that just barely meets the threshold and then ignore a large (100k 
ham corpus) that is received shortly thereafter and that would greatly 
improve the results.

Perhaps: if we receive a delayed corpus that crosses the threshold, we 
don't *immediately* start scoring, instead we start in half an hour - this 
gives a chance for another corpus to come in. This would continue up to 
some maximum (1h?).

Or perhaps I'm overthinking it. :)

>> If we're still running a score generation for pass X and pass X+1 has 
>> reached its cutoff and has enough corpora to satisfy the thresholds and 
>> immediately start the scoring process, do we give up on processing pass X? 
>> I would think yes.
>
> I don't know how long the process takes, but if we never start a pass by the 
> time the next day's start point comes I would assume it would never overlap.

I dislike the idea of trying to calculate a hard start cutoff based on how 
long the scoring run takes. Do we really want to maintain statistics on 
that?

> I could be wrong, but it seems likely that a hard cut off that shouldn't 
> overlap the next day's start may be simpler. At some point we need to give up 
> hope on a day's results anyway, so that may be the guideline for when that 
> time is.

OK, so the hard starting cutoff could be the time the following pass does 
its SVN get. If the scoring is underway at that point, we let it run to 
completion? I am makign an assumption here, that the time the scoring and 
rule generation takes is less than the get -> minimum scoring start delay, 
so that the scoring+rulegen passes won't overlap.


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   It is not the place of government to make right every tragedy and
   woe that befalls every resident of the nation.
-----------------------------------------------------------------------
  4 days until the 281st anniversary of John Peter Zenger's acquittal