You are viewing a plain text version of this content. The canonical link for it is here.
Posted to sysadmins@spamassassin.apache.org by "Kevin A. McGrail" <km...@apache.org> on 2018/10/04 14:48:04 UTC

Re: Masscheck reuse

Might want to put this on the wiki too!  Adding SASA group too for their
input.
--
Kevin A. McGrail
VP Fundraising, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171


On Thu, Oct 4, 2018 at 10:28 AM Henrik K <he...@hege.li> wrote:

>
> Still hoping to get some conversation going on about reuse.
>
> Personally I create my corpus like this:
>
> - hacked amavisd-milter to save unmodified message copy to "pristine"
>   directory
>
> - run a separate clean install of trunk SA/spamd that has default rules,
>   razor/pyzor/dcc etc, and only runs all "reuse" flagged rules
>   (my recent trunk commit)
>     --pre "loadplugin Mail::SpamAssassin::Plugin::Reuse"
>     --pre "run_reuse_tests_only 1"
>
> - cron every minute: run messages from "pristine" directory through
>   above spamd to add X-Spam-Status header and move to "corpus"
>
> - a bit later get mailids and resulting ham/spam status from my main
> amavis,
>   and sort out "corpus" to "corpus_ham/spam" (of course with some manual
>   vetting, dspam crosscheck etc)
>
> Since my main setup uses extreme whitelisting and shortcircuiting, this is
> the only way to get 100% legit corpus.  It takes very little resources
> anyway, since that spamd just runs network lookups (which are mostly cached
> already).
>
> Basically I'd like to see masscheckers do something similar.  Doesn't
> matter
> where you source all the corpus, it is possible to clean them up to
> "pristine status" and run ASAP though spamd setup like above.  That way
> they
> have legit X-Spam-Status header that can be reused even years later.
>
> Of course if your corpus already has X-Spam-Status from mail receive time
> (and all possible plugins and checks are enabled), then it's simply the
> case
> of enabling reuse.  But shortcircuited messages should be skipped.
>
> I also recently added REUSE config here:
>
> http://svn.apache.org/viewvc/spamassassin/trunk/masses/contrib/automasscheck-minimal/
>
>
>
>
> On Mon, Sep 03, 2018 at 05:55:05PM +0300, Henrik Krohns wrote:
> >
> > If you look at the ancient mass-check code before Reuse.pm was split from
> > it, it shows the original intention:
> >
> >
> http://svn.apache.org/viewvc/spamassassin/trunk/masses/mass-check?revision=721962&view=markup
> >
> > # --reuse without --net means we need to just zero ALL net rules; skip
> net
> > # lookups entirely except for the reused ones.
> > (then it proceeds to zero scores for all "tflags net" rules)
> >
> > Ok I'm not even sure why it's talking about --reuse withOUT --net, since
> the
> > point here is to do separate scoresets with and without network checks?
> One
> > would simply run local checks only, or --reuse --net.
> >
> > If everyone used reuse, would there even be need for "weekly" masschecks
> as
> > every day simply included the network checks!?  If you ask me, without
> > --reuse one would be only allowed to submit "nightly" masschecks (no
> --net).
> >
> > Current Reuse.pm simply reads "reuse XXX" config clauses, and zeroes
> scores
> > for those.  So it is important to remember to use "reuse XXX" for any net
> > rules, since it doesn't automatically iterate through them anymore!
> Which
> > in my mind is silly, why not simply iterate again through "tflags net"
> and
> > forget "reuse" stanza completely.
> >
> > Cheers,
> > Henrik
> >
> >
> >
> >
> > On Mon, Sep 03, 2018 at 05:29:20PM +0300, Henrik K wrote:
> > >
> > > Hey guys,
> > >
> > > I'm wondering why pretty much no masscheck submitter is using --reuse?
> > >
> > > I just committed fixes for lots of missing reuse flags, and now I can
> > > actually do a ./mass-check --reuse --net run without ANY dns lookups
> > > launching.  So it's super fast too.
> > >
> > > What reason would there be to prefer running without reuse?  Is this
> simply
> > > a case of missing guidance/documentation?  Looking at some corpus logs,
> > > judging by Maildir file timestamps there are even few years old
> messages run
> > > through.  How can that make any sense, I wouldn't run anything older
> than
> > > an hour through DNSBLs.
> > >
> > > Of course I understand if someones messages don't have a scantime
> > > X-Spam-Status header for some reason, but even that could be easily
> fixable
> > > by simply running the messages through a dedicated spamd as soon as
> possible
> > > to add the headers.
> > >
> > > Cheers,
> > > Henrik
>