You are viewing a plain text version of this content. The canonical link for it is here.
Posted to ruleqa@spamassassin.apache.org by ja...@gmx.ch on 2021/05/05 07:24:37 UTC

new to masscheck, some questions

Hello

I'm new to masscheck, nothing uploaded yet, and have two questions

As my spam corpus comes from my traps and my ham "just" from my personal
addresses there is quite an imbalance between my spam- and ham corpus
(300 ham and several k's of spam). Is such an imbalance a problem for
reliable masscheck?

Second: I tested masscheck script with my config but I get a warning
which I'm not sure it can be ignored or not:

archive-iterator: invalid (undef) format in target list, run_masscheck
at
/root/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/ArchiveIterator.pm
line 545.
archive-iterator: invalid (undef) format in target list, ham-corpus at
/root/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/ArchiveIterator.pm
line 545.

but according to my config the ham-corpus is defined

run_all_masschecks() {
  ### sample: single corpus ###
  run_masscheck spam-corpus --all \
          --after=-4838400 spam:dir:/data/archive/spam/ \
  run_masscheck ham-corpus --all \
          --after=-174182400 ham:dir:/data/archive/ham/
}

can this be ignored or is there something that need a fix? The final
stats tells me that all mail from ham-corpus have been processed.

Cheers and have a good one

--

tobi




Re: new to masscheck, some questions

Posted by John Hardin <jh...@impsec.org>.
On Wed, 5 May 2021, jahlives@gmx.ch wrote:

> John,
>
> On 5/5/21 4:55 PM, John Hardin wrote:
>>
>> That said, what we really need is ham in non-English languages. If
>> there's any way you can get more good (accurately classified)
>> non-English ham, that would be the greatest benefit.
>
> my ham is mostly, would say about 90%+, German.

Yay!

> Spam is mostly English but also quite some Italian, Spanish and French. 
> From time to time Russian or Chinese :-)
>
>> Do you know anyone (perhaps family members) who would trust you with a
>> copy of their ham emails to add to your corpus?
>
> sure there are but I'm not so sure that their judgement related to
> spam/phish can be trusted without massive manual intervention ;-)

That is certainly part of it if anyone other than you is contributing to 
the corpora. You need to verify the correct classification of the messages 
they provide. It's just like vetting Bayes training messages (FPs and FNs) 
provided by users if you're an admin.

>> Is your ham corpus limited to what you've used to train Bayes? Or do
>> you really get that little email? Put more in. About the only
>> properly-classified ham I *wouldn't* put into masscheck corpora would
>> be emails discussing spam (e.g. the SA users list is a big no-no).
>
> my ham is what ended in my inbox and has not been sorted out as spam.

Ah. It's a BAD idea to train Bayes from or run masschecks directly against 
your inbox, because if you happen to fall behind for any reason then spams 
may be learned/scanned as ham.

It's better to set up separate email folders for messages that you have 
actually seen and confirmed as ham, then train/masscheck those folders.

> Most of the mails I got daily are from mailinglists but those get
> automoved (thanks to sieve) into subfolders which do not end in my ham
> corpus. My inbox contains 1:1 mail and quite a bunch of newsletters
> (which I registered for). Also all bounces and stuff like that goes
> directly into subfolders and is therefore **not** in my corpus.

If you know it's ham, it should be in your corpus. (Except, again, for 
something like the SA users list where we discuss spam signs and post 
examples, and things like non-delivery notices if you get backscatter.)

> I could put much more ham in if I dig deeper into my archive folders

Good!

> but I thought too old mail is not good for masschecks. For spam corpus I 
> delete everything older than 30day from corpus before running masscheck.

No. Ham a couple of years old is still useful, as the character of ham 
changes much more slowly than it does for spam.

The masscheck process has inherent - and different - age limits for ham 
and spam corpora, coded into the distributed script. Let those limits take 
care of it and feed it whatever you can get. I wouldn't *manually* filter 
by date until it's five years old, and that's only to reduce the amount 
of stuff the script needs to discard.

> Cheers and have a good one

Likewise!

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org                         pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
  3 days until the 76th anniversary of VE day

Re: new to masscheck, some questions

Posted by ja...@gmx.ch.
John,

On 5/5/21 4:55 PM, John Hardin wrote:
>
>
> That said, what we really need is ham in non-English languages. If
> there's any way you can get more good (accurately classified)
> non-English ham, that would be the greatest benefit.

my ham is mostly, would say about 90%+, German. Spam is mostly English
but also quite some Italian, Spanish and French. From time to time
Russian or Chinese :-)


>
> Do you know anyone (perhaps family members) who would trust you with a
> copy of their ham emails to add to your corpus?

sure there are but I'm not so sure that their judgement related to
spam/phish can be trusted without massive manual intervention ;-)


>
> Is your ham corpus limited to what you've used to train Bayes? Or do
> you really get that little email? Put more in. About the only
> properly-classified ham I *wouldn't* put into masscheck corpora would
> be emails discussing spam (e.g. the SA users list is a big no-no).

my ham is what ended in my inbox and has not been sorted out as spam.
Most of the mails I got daily are from mailinglists but those get
automoved (thanks to sieve) into subfolders which do not end in my ham
corpus. My inbox contains 1:1 mail and quite a bunch of newsletters
(which I registered for). Also all bounces and stuff like that goes
directly into subfolders and is therefore **not** in my corpus.

I could put much more ham in if I dig deeper into my archive folders but
I thought too old mail is not good for masschecks. For spam corpus I
delete everything older than 30day from corpus before running masscheck.

Cheers and have a good one


tobi





Re: new to masscheck, some questions

Posted by John Hardin <jh...@impsec.org>.
On Wed, 5 May 2021, jahlives@gmx.ch wrote:

> Hello
>
> I'm new to masscheck, nothing uploaded yet, and have two questions

Welcome aboard!

> As my spam corpus comes from my traps and my ham "just" from my personal
> addresses there is quite an imbalance between my spam- and ham corpus
> (300 ham and several k's of spam). Is such an imbalance a problem for
> reliable masscheck?

"Reliable"? No, the balance doesn't affect reliability. What affects 
reliability is the accuracy of the classification of the messages in your 
corpora - ham really needs to be *ham*. Misclassification has a greater 
impact than a poor ratio. Spend some time making sure it's correctly 
classified.

That said, what we really need is ham in non-English languages. If there's 
any way you can get more good (accurately classified) non-English ham, 
that would be the greatest benefit.

Your masscheck corpora don't leave your machine, only the rule hit stats 
get uploaded, so it's not a potential privacy violation (or not much of 
one). Do you know anyone (perhaps family members) who would trust you with 
a copy of their ham emails to add to your corpus?

Is your ham corpus limited to what you've used to train Bayes? Or do you 
really get that little email? Put more in. About the only 
properly-classified ham I *wouldn't* put into masscheck corpora would be 
emails discussing spam (e.g. the SA users list is a big no-no).


-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org                         pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Are you a mildly tech-literate politico horrified by the level of
   ignorance demonstrated by lawmakers gearing up to regulate online
   technology they don't even begin to grasp? Cool. Now you have a
   tiny glimpse into a day in the life of a gun owner.   -- Sean Davis
-----------------------------------------------------------------------
  3 days until the 76th anniversary of VE day

Re: new to masscheck, some questions

Posted by ja...@gmx.ch.
The warning disappears if I use just one masscheck and not split them in two


Cheers


tobi

On 5/5/21 9:47 AM, Henrik K wrote:
> On Wed, May 05, 2021 at 10:44:17AM +0300, Henrik K wrote:
>> On Wed, May 05, 2021 at 10:42:24AM +0300, Henrik K wrote:
>>> As per automasscheck-minimal.cf.dist it should look like this:
>>>
>>> run_all_masschecks() {
>>>   ### sample: single corpus ###
>>>   run_masscheck spam-corpus --all \
>>>           --after=-4838400 spam:dir:/data/archive/spam/ \
>>>           --after=-174182400 ham:dir:/data/archive/ham/
>>> }
>>>
>> PS.  Never mind the "spam-corpus"..  which is "single-corpus" in example.. 
>> the name doesn't affect anything..
> Actually it does..  unless you use "single-corpus" the name will be added to
> the physical lognames..
>
>   if [ "$CORPUSNAME" = "single-corpus" ]; then
>     # Use this if you have only a single corpus
>     LOGSUFFIX=
>   else
>     LOGSUFFIX="-${CORPUSNAME}"
>   fi
>
> So I recommend using "single-corpus".  Normally there isn't any need to call
> run_masscheck multiple times.
>
> Sorry for rambling, busy working right now..
>
>


Re: new to masscheck, some questions

Posted by Henrik K <he...@hege.li>.
On Wed, May 05, 2021 at 10:44:17AM +0300, Henrik K wrote:
> On Wed, May 05, 2021 at 10:42:24AM +0300, Henrik K wrote:
> > 
> > As per automasscheck-minimal.cf.dist it should look like this:
> > 
> > run_all_masschecks() {
> >   ### sample: single corpus ###
> >   run_masscheck spam-corpus --all \
> >           --after=-4838400 spam:dir:/data/archive/spam/ \
> >           --after=-174182400 ham:dir:/data/archive/ham/
> > }
> > 
> 
> PS.  Never mind the "spam-corpus"..  which is "single-corpus" in example.. 
> the name doesn't affect anything..

Actually it does..  unless you use "single-corpus" the name will be added to
the physical lognames..

  if [ "$CORPUSNAME" = "single-corpus" ]; then
    # Use this if you have only a single corpus
    LOGSUFFIX=
  else
    LOGSUFFIX="-${CORPUSNAME}"
  fi

So I recommend using "single-corpus".  Normally there isn't any need to call
run_masscheck multiple times.

Sorry for rambling, busy working right now..


Re: new to masscheck, some questions

Posted by Henrik K <he...@hege.li>.
On Wed, May 05, 2021 at 10:42:24AM +0300, Henrik K wrote:
> 
> As per automasscheck-minimal.cf.dist it should look like this:
> 
> run_all_masschecks() {
>   ### sample: single corpus ###
>   run_masscheck spam-corpus --all \
>           --after=-4838400 spam:dir:/data/archive/spam/ \
>           --after=-174182400 ham:dir:/data/archive/ham/
> }
> 

PS.  Never mind the "spam-corpus"..  which is "single-corpus" in example.. 
the name doesn't affect anything..

And thanks for your contribution.  :-)


Re: new to masscheck, some questions

Posted by Henrik K <he...@hege.li>.
On Wed, May 05, 2021 at 09:24:37AM +0200, jahlives@gmx.ch wrote:
> Hello
> 
> I'm new to masscheck, nothing uploaded yet, and have two questions
> 
> As my spam corpus comes from my traps and my ham "just" from my personal
> addresses there is quite an imbalance between my spam- and ham corpus
> (300 ham and several k's of spam). Is such an imbalance a problem for
> reliable masscheck?

Personal ham/spam counts are irrelevant as masscheck processes all the
corpuses together.  You can see https://ruleqa.spamassassin.org/ that there
are many "spam-only" corpuses etc.


> Second: I tested masscheck script with my config but I get a warning
> which I'm not sure it can be ignored or not:
> 
> archive-iterator: invalid (undef) format in target list, run_masscheck
> at
> /root/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/ArchiveIterator.pm
> line 545.
> archive-iterator: invalid (undef) format in target list, ham-corpus at
> /root/masscheckwork/nightly_mass_check/masses/../lib/Mail/SpamAssassin/ArchiveIterator.pm
> line 545.
> 
> but according to my config the ham-corpus is defined
> 
> run_all_masschecks() {
>   ### sample: single corpus ###
>   run_masscheck spam-corpus --all \
>           --after=-4838400 spam:dir:/data/archive/spam/ \
>   run_masscheck ham-corpus --all \
>           --after=-174182400 ham:dir:/data/archive/ham/
> }

Why did you split run_masscheck in two?  I think mass-check requires
defining both spam/ham always.

As per automasscheck-minimal.cf.dist it should look like this:

run_all_masschecks() {
  ### sample: single corpus ###
  run_masscheck spam-corpus --all \
          --after=-4838400 spam:dir:/data/archive/spam/ \
          --after=-174182400 ham:dir:/data/archive/ham/
}