You are viewing a plain text version of this content. The canonical link for it is here.
Posted to ruleqa@spamassassin.apache.org by Tom Hendrikx <to...@whyscream.net> on 2023/05/21 11:02:39 UTC

Masschecks behind upstream filtering service

Hi,

For the last years I have been contributing nightly masscheck data from 
my personal MTA setup. This has resulted in a rather small dataset 
compared to other contributors, but it seemed useful to me, mainly 
because of the non-english ham corpus (I have no idea if that is a valid 
assumption though).

Since a few weeks I've moved to a new MTA setup, where I no longer 
perform spam/virus scanning myself, but an upstream provider does this, 
and all mail (including all spam and virus content) is delivered with 
appropriate headers.

I have no problem spending some time on setting up a new masscheck job 
that uses the new corpus and tune it to ignore the upstream filter 
result headers etc, but I'd rather not invest time if you think that 
such a feed is not beneficiary to the ruleqa process.

I'd be happy to hear your thoughts.

PS 1 There are also some spamtraps that don't use the upstream service, 
but the contributed corpus from that is quite low.

PS 2 Contributed masscheck data from the last few weeks is not based on 
messages delivered through this upstream provider, only existing corpus 
from the old setup was used.

Kind regards,
   Tom

Re: Masschecks behind upstream filtering service

Posted by Tom Hendrikx <to...@whyscream.net>.
On 25-05-2023 12:11, Tom Hendrikx wrote:
> On 22-05-2023 15:53, Bill Cole wrote:
 >>
>> I think we need as large and as diverse a collection of masscheck 
>> contributors as we can get. I am reluctant to ask you to add work to 
>> what I presume is a project to reduce your email efforts, but I hope 
>> you will continue to submit your results.
>>
> 
> I'm still self-hosting though (on a new server), so I have full control 
> on what to do with all messages. I have no problem with setting up 
> mass-checks again with that dataset: corpus sorting is part of my daily 
> routine, and running mass-checks is pretty effortless once correctly setup.
> 
> I'll make an attempt at a new mass-checks routine shortly.

Set up a new masscheck process today, according to the latest 
instructions at 
https://cwiki.apache.org/confluence/display/SPAMASSASSIN/NightlyMassCheck

It seems to work, and just uploaded new results, in log files labeled 
with username 'whyscream' (my regular moniker). Would somebody be so 
kind to check whether the contents make sense, so I can enable the 
cronjob again?

Kind regards,

   Tom

Re: Masschecks behind upstream filtering service

Posted by Tom Hendrikx <to...@whyscream.net>.
On 22-05-2023 15:53, Bill Cole wrote:
> On 2023-05-21 at 07:02:39 UTC-0400 (Sun, 21 May 2023 13:02:39 +0200)
> Tom Hendrikx <ru...@spamassassin.apache.org>
> is rumored to have said:
> 
>> Hi,
>>
>> For the last years I have been contributing nightly masscheck data 
>> from my personal MTA setup. This has resulted in a rather small 
>> dataset compared to other contributors, but it seemed useful to me, 
>> mainly because of the non-english ham corpus (I have no idea if that 
>> is a valid assumption though).
> 
> THANK YOU!
> 
> One of the things that worries me most about SA is that we don't have a 
> robust and diverse community of masscheck contributors. I don't have 
> great ideas to fix that, but I am always grateful for the people who 
> have put in the effort for the community.

I always feel a bit icky about the big userbase that depends on 
(projects like) spamassassin, and the small amount of contributions or 
peer review to code or rules. This is what I can contribute to improve 
that situation, and happy to actually do so.

> 
>> Since a few weeks I've moved to a new MTA setup, where I no longer 
>> perform spam/virus scanning myself, but an upstream provider does 
>> this, and all mail (including all spam and virus content) is delivered 
>> with appropriate headers.
>>
>> I have no problem spending some time on setting up a new masscheck job 
>> that uses the new corpus and tune it to ignore the upstream filter 
>> result headers etc, but I'd rather not invest time if you think that 
>> such a feed is not beneficiary to the ruleqa process.
>>
>> I'd be happy to hear your thoughts.
> 
> I think we need as large and as diverse a collection of masscheck 
> contributors as we can get. I am reluctant to ask you to add work to 
> what I presume is a project to reduce your email efforts, but I hope you 
> will continue to submit your results.
> 

My professional interests in email security have already diminished 
years ago after switching jobs, but my personal interests always stayed 
alive. The amount of time that properly maintaining a fully self-managed 
email setup takes, is the reason that I'm switching to an external 
provider (f.i. the old setup was running Ubuntu 16.04 with 
distro-provided Spamassassin: version 3.4.2).

I'm still self-hosting though (on a new server), so I have full control 
on what to do with all messages. I have no problem with setting up 
mass-checks again with that dataset: corpus sorting is part of my daily 
routine, and running mass-checks is pretty effortless once correctly setup.

I'll make an attempt at a new mass-checks routine shortly.

Kind regards,
Tom

Re: Masschecks behind upstream filtering service

Posted by Bill Cole <sa...@billmail.scconsult.com>.
On 2023-05-21 at 07:02:39 UTC-0400 (Sun, 21 May 2023 13:02:39 +0200)
Tom Hendrikx <ru...@spamassassin.apache.org>
is rumored to have said:

> Hi,
>
> For the last years I have been contributing nightly masscheck data 
> from my personal MTA setup. This has resulted in a rather small 
> dataset compared to other contributors, but it seemed useful to me, 
> mainly because of the non-english ham corpus (I have no idea if that 
> is a valid assumption though).

THANK YOU!

One of the things that worries me most about SA is that we don't have a 
robust and diverse community of masscheck contributors. I don't have 
great ideas to fix that, but I am always grateful for the people who 
have put in the effort for the community.

> Since a few weeks I've moved to a new MTA setup, where I no longer 
> perform spam/virus scanning myself, but an upstream provider does 
> this, and all mail (including all spam and virus content) is delivered 
> with appropriate headers.
>
> I have no problem spending some time on setting up a new masscheck job 
> that uses the new corpus and tune it to ignore the upstream filter 
> result headers etc, but I'd rather not invest time if you think that 
> such a feed is not beneficiary to the ruleqa process.
>
> I'd be happy to hear your thoughts.

I think we need as large and as diverse a collection of masscheck 
contributors as we can get. I am reluctant to ask you to add work to 
what I presume is a project to reduce your email efforts, but I hope you 
will continue to submit your results.



-- 
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire