You are viewing a plain text version of this content. The canonical link for it is here.

Posted to ruleqa@spamassassin.apache.org by David Jones <dj...@ena.com.INVALID> on 2017/06/01 02:52:42 UTC

Ruleqa masscheck so close.

I am working pretty hard to get the ruleqa processing going again on our new server.   We are so close to having enough contributors and ham/spam to get some new rules generated:  This is from the run minutes ago:

HAM CONTRIBUTORS FOUND: 9 (required 10)
SPAM CONTRIBUTORS FOUND: 9 (required 10)

We need to recruit some more masscheck'ers to get over the hump so I can do some final testing of the rules updates and start the DNS updates again for sa-update.

P.S. After spending the past month learning how this works, I have some ideas on how to make the nightly masschecks become hourly fairly easily so we can test and promote rule changes faster.

Dave

Re: Ruleqa masscheck so close.

Posted by "Kevin A. McGrail" <ke...@mcgrail.com>.

Re: I was under the mistaken impression that we were supposed to be manually
sorting individual spam/ham to prevent duplicates


The biggest issue is it must be 100% spam and 100% ham.

You cannot trust automated or user submissions.  

See https://wiki.apache.org/spamassassin/CorpusCleaning
Regards,
KAM

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

From: John Hardin <jh...@impsec.org>
On Fri, 2 Jun 2017, David Jones wrote:

>> >  Wow!  I just started collecting ham/spam for masscheck back in January
>> >  and my (apparently) tiny corpus only takes under a minute to run on a
>> >  low end 2 core VM.  I didn't realize that there would be some that take
>> >  a long time to run.  Still pretty new to all of this backend processing.
>>

I was under the mistaken impression that we were supposed to be manually
sorting individual spam/ham to prevent duplicates which was very time
consuming.  Now that I know how others are doing this, I have opened up
the spam/ham floodgates to sort them into staging folders for a quick review
and move into the final masscheck dirs.

>It shouldn't be *too* difficult to incorporate multi-core into the simple 
>masscheck script. Take a look for a "-j" parameter in the docs, IIRC all I 
>did was add "-j4" to the command line.

I watched my masscheck run a few minutes ago and it is definitely
threading out to take advantage of multiple cores.  Now that I have a large
corpus to masscheck, I am going to bump up the RAM and the cores on
my VM.  My run on about 8K messages took about 30 minutes on 2 cores.

Dave

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

On 06/02/2017 03:23 PM, John Hardin wrote:
> On Fri, 2 Jun 2017, David Jones wrote:
> 
>> On 06/01/2017 07:52 PM, John Hardin wrote:
>>>  On Thu, 1 Jun 2017, David Jones wrote:
>>
>>> >  I have not found enough details yet on the central masscheck so I 
>>> have
>>> >  started with getting the remote masscheck processing working first so
>>> >  we can get sa-update going again.
>>>
>>>  OK, I will focus on that instead.
>>>
>>> > >  P.S. After spending the past month learning how this works, I 
>>> have > >  some ideas on how to make the nightly masschecks become 
>>> hourly > >  fairly easily so we can test and promote rule changes 
>>> faster.
>>> > > >  How do you guarantee all the contributors can perform the 
>>> checks > >  within an hour?
>>> > >  Wow!  I just started collecting ham/spam for masscheck back in 
>>> January
>>> >  and my (apparently) tiny corpus only takes under a minute to run on a
>>> >  low end 2 core VM.  I didn't realize that there would be some that 
>>> take
>>> >  a long time to run.  Still pretty new to all of this backend 
>>> processing.
>>>
>>>  There are some contributors that run large honeypot networks.
>>>
>>>  My fairly small corpus takes (IIRC, it's been a while since a ran a 
>>> local
>>>  masscheck) over an hour on a dedicated 4-core box...
>>>
>>> >  Dave
>>
>> John,
>>
>> Can you watch your 4-core box when your local masscheck is running? 
>> Unless you have written something extra or there is another masscheck 
>> script that I haven't found, the masscheck processing is single threaded.
> 
> I haven't been using the simple masscheck script because I was just 
> doing it locally for my own consumption. It's a custom script.
> 
>> It doesn't matter if you have 24 cores, it's still going to take over 
>> an hour.  We may need to look at writing a newer masscheck script that 
>> will use async processing to get that time way down.
> 
> There is a parameter you can set somewhere to say how many parallel 
> scanning processes to run, and I was running 4. That box is powered down 
> at the moment, I will need to reboot it to look at my scripting. I'll do 
> that and post the details this evening or tomorrow.
> 
> It shouldn't be *too* difficult to incorporate multi-core into the 
> simple masscheck script. Take a look for a "-j" parameter in the docs, 
> IIRC all I did was add "-j4" to the command line.
> 

I recall that now.  My memory is not what it used to be.  I have to 
triple check things these days and I didn't in this case.  :)
-- 
Dave

Re: Ruleqa masscheck so close.

Posted by John Hardin <jh...@impsec.org>.

On Fri, 2 Jun 2017, David Jones wrote:

> On 06/01/2017 07:52 PM, John Hardin wrote:
>>  On Thu, 1 Jun 2017, David Jones wrote:
>
>> >  I have not found enough details yet on the central masscheck so I have
>> >  started with getting the remote masscheck processing working first so
>> >  we can get sa-update going again.
>>
>>  OK, I will focus on that instead.
>> 
>> > >  P.S. After spending the past month learning how this works, I have 
>> > >  some ideas on how to make the nightly masschecks become hourly 
>> > >  fairly easily so we can test and promote rule changes faster.
>> > 
>> > >  How do you guarantee all the contributors can perform the checks 
>> > >  within an hour?
>> > 
>> >  Wow!  I just started collecting ham/spam for masscheck back in January
>> >  and my (apparently) tiny corpus only takes under a minute to run on a
>> >  low end 2 core VM.  I didn't realize that there would be some that take
>> >  a long time to run.  Still pretty new to all of this backend processing.
>>
>>  There are some contributors that run large honeypot networks.
>>
>>  My fairly small corpus takes (IIRC, it's been a while since a ran a local
>>  masscheck) over an hour on a dedicated 4-core box...
>> 
>> >  Dave
> 
> John,
>
> Can you watch your 4-core box when your local masscheck is running? Unless 
> you have written something extra or there is another masscheck script that I 
> haven't found, the masscheck processing is single threaded.

I haven't been using the simple masscheck script because I was just doing 
it locally for my own consumption. It's a custom script.

> It doesn't matter if you have 24 cores, it's still going to take over an 
> hour.  We may need to look at writing a newer masscheck script that will 
> use async processing to get that time way down.

There is a parameter you can set somewhere to say how many parallel 
scanning processes to run, and I was running 4. That box is powered down 
at the moment, I will need to reboot it to look at my scripting. I'll do 
that and post the details this evening or tomorrow.

It shouldn't be *too* difficult to incorporate multi-core into the simple 
masscheck script. Take a look for a "-j" parameter in the docs, IIRC all I 
did was add "-j4" to the command line.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Our government should bear in mind the fact that the American
   Revolution was touched off by the then-current government
   attempting to confiscate firearms from the people.
-----------------------------------------------------------------------
  4 days until the 73rd anniversary of D-Day

Re: Ruleqa masscheck so close.

Posted by Dave Jones <da...@apache.org>.

On 06/04/2017 01:18 PM, Jari Fredriksson wrote:

> Cool. I have been doing masschecks somewhere at 1700UTC but now changed
> it to take place at 1200EET. My corpus takes 3-4 hours on 4 core (Core
> i7 920) or 1/2h at Google Compute @32 cores and ramdisk. I'm not looking
> much at the planned hourly submission, but it remains to be seen, what
> would it be.
> 

Daily masschecks are probably enough.  I thought that was the only time 
that rules were able to be updated.  Still learning/discovering what 
used to run on the previous servers a few months ago.

It's very possible that as I keep digging in the old server backups and 
logs, I may find that buildbot was linked to SVN commits so the devs 
could do on-demand rule updates.

> Ideally I would want to do this at night time when the electricity is
> cheap...
> 

If we can get enough interest in running masschecks twice a day to allow 
for some flexibility in the time for electricity costs, then we could 
let people choose to do both or stay with a single run.  I would be 
willing to do a 9:00 UTC run and a 21:00 UTC run if anyone else thinks 
this would be worth it.

For our current setup at 9:00 UTC, if we can get enough masscheck 
submissions by 13:00 UTC, then I can setup a script to detect when we 
have met the minimum requirements and run the scores updates without 
having to wait another ~20 hours like it's cron'd today.

Dave

Re: Ruleqa masscheck so close.

Posted by "Kevin A. McGrail" <ke...@mcgrail.com>.

There is a way to set a ceiling.  My masscheck is still messed up as I am working in foundational issues to launch things at a new data center.

Is there a score set anywhere for URI_WP_HACKED?  I can look later but there are ways to force scores and especially force a ceiling.  I equate it to doing manual square roots.  We have to feed it an educated starting point.l and let it float from there.

It might be more pinned than it should be!
Regards,
KAM

On June 9, 2017 9:03:16 AM EDT, David Jones <dj...@ena.com.INVALID> wrote:      
>Question about the URI_WP_HACKED rule.  Why is it still at the default 
>of 1.0 since it's S/O on http://ruleqa.spamassassin.org has been 1.000 
>for a long time?
>
>What sets the default scores in 50_scores.cf and what determines goes 
>into the nightly 72_scores.cf?  Is there still something I need to find
>
>and get running again on the new server?

Re: Ruleqa masscheck so close.

Posted by John Hardin <jh...@impsec.org>.

On Fri, 9 Jun 2017, David Jones wrote:

> Question about the URI_WP_HACKED rule.  Why is it still at the default of 1.0 
> since it's S/O on http://ruleqa.spamassassin.org has been 1.000 for a long 
> time?

There's a limit of 3.000, the score generator decides on a score up to 
that.

It takes into account total hits and score already on those messages as 
well as S/O. A rule with an S/O of 1.00 that hits on few messages that 
already score well, won't be scored very high.

> What sets the default scores in 50_scores.cf

That's manual now.

> and what determines goes into the nightly 72_scores.cf?

The masscheck score generation process. Running the "what hits" 
distributed part of masscheck is only part of it. There's another phase 
that takes that data from everybody and calculates rule scores.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   How can you reason with someone who thinks we're on a glidepath to
   a police state and yet their solution is to grant the government a
   monopoly on force? They are insane.
-----------------------------------------------------------------------
  71 days since the first commercial re-flight of an orbital booster (SpaceX)

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

On 06/04/2017 01:18 PM, Jari Fredriksson wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Cool. I have been doing masschecks somewhere at 1700UTC but now changed
> it to take place at 1200EET. My corpus takes 3-4 hours on 4 core (Core
> i7 920) or 1/2h at Google Compute @32 cores and ramdisk. I'm not looking
> much at the planned hourly submission, but it remains to be seen, what
> would it be.
> 
> Ideally I would want to do this at night time when the electricity is
> cheap...
> 
My 'ena' corpus is now up 85K (28K spam/57K ham), growing about 10-12K a 
day and scoring consistently on the ham/spam rule hits:

Rule hit frequencies:
    OVERALL        SPAM         HAM  NAME
      85428       28358       57070  (all messages)
       8647        8643           4  URI_WP_HACKED
       3914        3914           0  HELO_MISC_IP
       3138        3136           2  DATE_IN_FUTURE_06_12
       2897        2896           1  T_PDS_TO_EQ_FROM_NAME
       3105        3102           3  T_PDS_FROM_2_EMAILS
       2731        2729           2  DRUGS_ERECTILE
       2415        2415           0  URI_ONLY_MSGID_MALF
       2402        2402           0  DOS_OE_TO_MX
       2509        2507           2  LONGWORDS
       1928        1928           0  DRUGS_ERECTILE_OBFU
       3644        3622          22  MIMEOLE_DIRECT_TO_MX
       1820        1817           3  MISSING_SUBJECT
       1657        1657           0  FUZZY_PHARMACY
       1648        1648           0  DOS_OUTLOOK_TO_MX
       2709        2693          16  T_NAME_EMAIL_DIFF
       1545        1544           1  DATE_IN_FUTURE_03_06
       1514        1514           0  MISSING_MIME_HB_SEP
       1142        1142           0  SUBJECT_DRUG_GAP_L
       1128        1127           1  FUZZY_PRICES
       1064        1064           0  SUBJECT_DRUG_GAP_C

My masscheck processing is taking about 2 hours on my 4 core VM.

Question about the URI_WP_HACKED rule.  Why is it still at the default 
of 1.0 since it's S/O on http://ruleqa.spamassassin.org has been 1.000 
for a long time?

What sets the default scores in 50_scores.cf and what determines goes 
into the nightly 72_scores.cf?  Is there still something I need to find 
and get running again on the new server?

-- 
Dave

Re: Ruleqa masscheck so close.

Posted by Jari Fredriksson <ja...@iki.fi>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

David Jones kirjoitti 3.6.2017 0:09:
> On 06/02/2017 04:01 PM, Kevin Golding wrote:
>> On Fri, 02 Jun 2017 21:06:37 +0100, David Jones <dj...@ena.com.invalid> wrote:
>> 
>>> P.S. we are currently still at only 9 masscheck contributors in the past day so we need one or two more.
>> 
>> Unfortunately with ruleqa down you're the only person who knows the upload status. I received confirmation from my system at 11:56:06 which said my uploads were successful.
>> 
>> Given you reported 9 yesterday I don't know if there's an issue at the server with my upload or if someone else dropped off. If there is an issue with my upload then please tell me and I'll look into it, but I have no way of telling from the data you've provided.
> 
> Woohoo!  We now have 11...
> 
> SVN tagged rev in nightly_mass_check:  1797329
> 
> New masscheck submission listings in the past day:
>    SVN rev (Match) File Name (Date)
>    1797329 (Yes) - spam-darxus.log (Jun 2 02:09)
>    1797329 (Yes) - ham-kgolding.log (Jun 2 05:00)
>    1797329 (Yes) - ham-darxus.log (Jun 2 02:09)
>    1797329 (Yes) - ham-grenier.log (Jun 2 02:02)
>    1797329 (Yes) - ham-ena.log (Jun 2 02:07)
>    1797329 (Yes) - spam-jbrooks.log (Jun 2 13:00)
>    1797329 (Yes) - spam-axb-generic.log (Jun 2 04:41)
>    1797329 (Yes) - spam-axb-ham-misc.log (Jun 2 04:41)
>    1797329 (Yes) - spam-grenier.log (Jun 2 02:02)
>    1797329 (Yes) - ham-axb-ham-misc.log (Jun 2 04:41)
>    1797329 (Yes) - spam-kgolding.log (Jun 2 05:00)
>    1797329 (Yes) - ham-axb-generic.log (Jun 2 04:41)
>    1797329 (Yes) - ham-axb-ninja.log (Jun 2 04:41)
>    1797329 (Yes) - spam-axb-ninja.log (Jun 2 04:41)
>    1797329 (Yes) - ham-jbrooks.log (Jun 2 13:00)
>    1797329 (Yes) - spam-jarif.log (Jun 2 12:50)
>    1797329 (Yes) - ham-thendrikx.log (Jun 2 02:04)
>    1797329 (Yes) - ham-jarif.log (Jun 2 12:50)
>    1797329 (Yes) - spam-axb-coi-bulk.log (Jun 2 04:41)
>    1797329 (Yes) - spam-thendrikx.log (Jun 2 02:04)
>    1797329 (Yes) - ham-axb-coi-bulk.log (Jun 2 04:41)
>    1797329 (Yes) - spam-ena.log (Jun 2 02:07)
> 
> 22/22 matches (11 ham, 11 spam)

Cool. I have been doing masschecks somewhere at 1700UTC but now changed
it to take place at 1200EET. My corpus takes 3-4 hours on 4 core (Core
i7 920) or 1/2h at Google Compute @32 cores and ramdisk. I'm not looking
much at the planned hourly submission, but it remains to be seen, what
would it be.

Ideally I would want to do this at night time when the electricity is
cheap...

- -- 
jarif@iki.fi
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iEYEARECAAYFAlk0TtoACgkQKL4IzOyjSrYrxACfajqt2TqTJIEW7OWW2y4n9wuL
iZMAn0nwgL64OXMVdVaXGIQS5FvEgdDZ
=4BgA
-----END PGP SIGNATURE-----

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

On 06/02/2017 04:01 PM, Kevin Golding wrote:
> On Fri, 02 Jun 2017 21:06:37 +0100, David Jones <dj...@ena.com.invalid> 
> wrote:
> 
>> P.S. we are currently still at only 9 masscheck contributors in the 
>> past day so we need one or two more.
> 
> Unfortunately with ruleqa down you're the only person who knows the 
> upload status. I received confirmation from my system at 11:56:06 which 
> said my uploads were successful.
> 
> Given you reported 9 yesterday I don't know if there's an issue at the 
> server with my upload or if someone else dropped off. If there is an 
> issue with my upload then please tell me and I'll look into it, but I 
> have no way of telling from the data you've provided.

Woohoo!  We now have 11...

SVN tagged rev in nightly_mass_check:  1797329

New masscheck submission listings in the past day:
    SVN rev (Match) File Name (Date)
    1797329 (Yes) - spam-darxus.log (Jun 2 02:09)
    1797329 (Yes) - ham-kgolding.log (Jun 2 05:00)
    1797329 (Yes) - ham-darxus.log (Jun 2 02:09)
    1797329 (Yes) - ham-grenier.log (Jun 2 02:02)
    1797329 (Yes) - ham-ena.log (Jun 2 02:07)
    1797329 (Yes) - spam-jbrooks.log (Jun 2 13:00)
    1797329 (Yes) - spam-axb-generic.log (Jun 2 04:41)
    1797329 (Yes) - spam-axb-ham-misc.log (Jun 2 04:41)
    1797329 (Yes) - spam-grenier.log (Jun 2 02:02)
    1797329 (Yes) - ham-axb-ham-misc.log (Jun 2 04:41)
    1797329 (Yes) - spam-kgolding.log (Jun 2 05:00)
    1797329 (Yes) - ham-axb-generic.log (Jun 2 04:41)
    1797329 (Yes) - ham-axb-ninja.log (Jun 2 04:41)
    1797329 (Yes) - spam-axb-ninja.log (Jun 2 04:41)
    1797329 (Yes) - ham-jbrooks.log (Jun 2 13:00)
    1797329 (Yes) - spam-jarif.log (Jun 2 12:50)
    1797329 (Yes) - ham-thendrikx.log (Jun 2 02:04)
    1797329 (Yes) - ham-jarif.log (Jun 2 12:50)
    1797329 (Yes) - spam-axb-coi-bulk.log (Jun 2 04:41)
    1797329 (Yes) - spam-thendrikx.log (Jun 2 02:04)
    1797329 (Yes) - ham-axb-coi-bulk.log (Jun 2 04:41)
    1797329 (Yes) - spam-ena.log (Jun 2 02:07)

22/22 matches (11 ham, 11 spam)


-- 
Dave

Re: Ruleqa masscheck so close.

Posted by Kevin Golding <kp...@caomhin.org>.

On Fri, 02 Jun 2017 21:06:37 +0100, David Jones <dj...@ena.com.invalid>  
wrote:

> P.S. we are currently still at only 9 masscheck contributors in the past  
> day so we need one or two more.

Unfortunately with ruleqa down you're the only person who knows the upload  
status. I received confirmation from my system at 11:56:06 which said my  
uploads were successful.

Given you reported 9 yesterday I don't know if there's an issue at the  
server with my upload or if someone else dropped off. If there is an issue  
with my upload then please tell me and I'll look into it, but I have no  
way of telling from the data you've provided.

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

On 06/01/2017 07:52 PM, John Hardin wrote:
> On Thu, 1 Jun 2017, David Jones wrote:

>> I have not found enough details yet on the central masscheck so I have
>> started with getting the remote masscheck processing working first so
>> we can get sa-update going again.
> 
> OK, I will focus on that instead.
> 
>>> P.S. After spending the past month learning how this works, I have some
>>> ideas on how to make the nightly masschecks become hourly fairly easily
>>> so we can test and promote rule changes faster.
>>
>>> How do you guarantee all the contributors can perform the checks within
>>> an hour?
>>
>> Wow!  I just started collecting ham/spam for masscheck back in January
>> and my (apparently) tiny corpus only takes under a minute to run on a
>> low end 2 core VM.  I didn't realize that there would be some that take
>> a long time to run.  Still pretty new to all of this backend processing.
> 
> There are some contributors that run large honeypot networks.
> 
> My fairly small corpus takes (IIRC, it's been a while since a ran a 
> local masscheck) over an hour on a dedicated 4-core box...
> 
>> Dave
> 
John,

Can you watch your 4-core box when your local masscheck is running? 
Unless you have written something extra or there is another masscheck 
script that I haven't found, the masscheck processing is single 
threaded.  It doesn't matter if you have 24 cores, it's still going to 
take over an hour.  We may need to look at writing a newer masscheck 
script that will use async processing to get that time way down.

P.S. we are currently still at only 9 masscheck contributors in the past 
day so we need one or two more.

-- 
Dave

Re: Ruleqa masscheck so close.

Posted by "Kevin A. McGrail" <ke...@mcgrail.com>.

On 6/1/2017 9:42 PM, Dave Jones wrote:
> How are these large honeypot contributors sorting the ham/spam
> at a large scale without spending all day manually doing it? 

With the honeypot that I run, it's 100% spam.  No need to sort.

Regards,

KAM

Re: Ruleqa masscheck so close.

Posted by Dave Jones <da...@apache.org>.

On 06/01/2017 07:52 PM, John Hardin wrote:
> On Thu, 1 Jun 2017, David Jones wrote:
> 
>>> From: John Hardin <jh...@impsec.org>
>>
>>>> On Thu, 1 Jun 2017, David Jones wrote:
>>
>>>> I am working pretty hard to get the ruleqa processing going
>>>> again on our new server.   We are so close to having enough
>>>> contributors and ham/spam to get some new rules generated:
>>>> This is from the run minutes ago:
>>>>
>>>> HAM CONTRIBUTORS FOUND: 9 (required 10)
>>>> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>>>>
>>>> We need to recruit some more masscheck'ers to get over the hump so I 
>>>> can
>>>> do some final testing of the rules updates and start the DNS updates
>>>> again for sa-update.
>>
>>> I upload my corpora, not my results - I suppose that my corpora didn't
>>> survive the migration, and I haven't yet brought my submission bot and
>>> rsync account up-to-date for the new hardware - my apologies. I may be
>>> able to give that some cycles this weekend.
>>
>> So, about that.  I just started helping as a sysadmin a month or so 
>> ago.  We
>> had some hosting issues that we are trying to recover from with very 
>> little
>> documentation of the infrastructure 3 months ago.  I am having to dig
>> through logs, cron output, and old (outdated) documentation to try to put
>> the puzzle back together again.
>>
>> If anyone has any knowledge or documentation of how things were setup
>> in the past, I would love to talk with you.
>>
>> We do have backups from one of the servers but I think there were two
>> or three servers before based on the fact that I can't find any evidence
>> where buildbot was running that I think was running the centralized
>> masscheck.
>>
>>> However, I don't recall seeing confirmation that the central 
>>> masscheck was
>>> actually working; can you confirm that? Or do I need to change over to
>>> local masscheck and uploading results like most others do?
>>
>> I have not found enough details yet on the central masscheck so I have
>> started with getting the remote masscheck processing working first so
>> we can get sa-update going again.
> 
> OK, I will focus on that instead.
> 
>>> P.S. After spending the past month learning how this works, I have some
>>> ideas on how to make the nightly masschecks become hourly fairly easily
>>> so we can test and promote rule changes faster.
>>
>>> How do you guarantee all the contributors can perform the checks within
>>> an hour?
>>
>> Wow!  I just started collecting ham/spam for masscheck back in January
>> and my (apparently) tiny corpus only takes under a minute to run on a
>> low end 2 core VM.  I didn't realize that there would be some that take
>> a long time to run.  Still pretty new to all of this backend processing.
> 
> There are some contributors that run large honeypot networks.
> 

Cool.  How are these large honeypot contributors sorting the ham/spam
at a large scale without spending all day manually doing it?  I have the 
potential of getting large amounts of ham/spam but I don't have the time 
to manually sort all of it.

> My fairly small corpus takes (IIRC, it's been a while since a ran a 
> local masscheck) over an hour on a dedicated 4-core box...
> 

So if we can get this to run as close to 9:00 AM UTC then maybe we can 
get enough contributions by 11 or 12 AM UTC to roll out new ruleqa and 
scores out for sa-update sooner than in the past.

-- 
David Jones

Re: Ruleqa masscheck so close.

Posted by John Hardin <jh...@impsec.org>.

On Thu, 1 Jun 2017, David Jones wrote:

>> From: John Hardin <jh...@impsec.org>
>  
>>> On Thu, 1 Jun 2017, David Jones wrote:
>
>>> I am working pretty hard to get the ruleqa processing going
>>> again on our new server.   We are so close to having enough
>>> contributors and ham/spam to get some new rules generated:
>>> This is from the run minutes ago:
>>>
>>> HAM CONTRIBUTORS FOUND: 9 (required 10)
>>> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>>>
>>> We need to recruit some more masscheck'ers to get over the hump so I can
>>> do some final testing of the rules updates and start the DNS updates
>>> again for sa-update.
>
>> I upload my corpora, not my results - I suppose that my corpora didn't
>> survive the migration, and I haven't yet brought my submission bot and
>> rsync account up-to-date for the new hardware - my apologies. I may be
>> able to give that some cycles this weekend.
>
> So, about that.  I just started helping as a sysadmin a month or so ago.  We
> had some hosting issues that we are trying to recover from with very little
> documentation of the infrastructure 3 months ago.  I am having to dig
> through logs, cron output, and old (outdated) documentation to try to put
> the puzzle back together again.
>
> If anyone has any knowledge or documentation of how things were setup
> in the past, I would love to talk with you.
>
> We do have backups from one of the servers but I think there were two
> or three servers before based on the fact that I can't find any evidence
> where buildbot was running that I think was running the centralized
> masscheck.
>
>> However, I don't recall seeing confirmation that the central masscheck was
>> actually working; can you confirm that? Or do I need to change over to
>> local masscheck and uploading results like most others do?
>
> I have not found enough details yet on the central masscheck so I have
> started with getting the remote masscheck processing working first so
> we can get sa-update going again.

OK, I will focus on that instead.

>> P.S. After spending the past month learning how this works, I have some
>> ideas on how to make the nightly masschecks become hourly fairly easily
>> so we can test and promote rule changes faster.
>
>> How do you guarantee all the contributors can perform the checks within
>> an hour?
>
> Wow!  I just started collecting ham/spam for masscheck back in January
> and my (apparently) tiny corpus only takes under a minute to run on a
> low end 2 core VM.  I didn't realize that there would be some that take
> a long time to run.  Still pretty new to all of this backend processing.

There are some contributors that run large honeypot networks.

My fairly small corpus takes (IIRC, it's been a while since a ran a local 
masscheck) over an hour on a dedicated 4-core box...

> Dave

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   If you trust the government, you obviously failed history class.
                                                        -- Don Freeman
-----------------------------------------------------------------------
  5 days until the 73rd anniversary of D-Day

Re: Ruleqa masscheck so close.

Posted by John Brooks <jo...@fastquake.com>.

On 06/01/2017 08:29 PM, David Jones wrote:
>> From: John Brooks <jo...@fastquake.com>
>      
>> On 2017-06-01 07:59 PM, David Jones wrote:
>>>> How do you guarantee all the contributors can perform the checks within
>>>> an hour?
>>> Wow!  I just started collecting ham/spam for masscheck back in January
>>> and my (apparently) tiny corpus only takes under a minute to run on a
>>> low end 2 core VM.  I didn't realize that there would be some that take
>>> a long time to run.  Still pretty new to all of this backend processing.
>>>
>>> Dave
>> Today while setting it up, I did a test run, and it took 2 hours on my
>> single core VPS. My corpus is ~40k messages: ~30k spam from the last 3
>> months (spammers love me), the rest is ham from the last 4 years. But
>> that was a full run, clean slate. Maybe it doesn't have to re-scan every
>> message the next time, but I have no clue because I haven't done it yet.
> Impressive.  Are you manually sorting those 40K messages in the past 3
> months?  I have a couple of domains that attract a lot of spam but I
> couldn't sort all of that volume of mail so I use a few RBLs to block
> the "low hanging fruit" and then have rules put ham and spam into
> folders to make my manual sorting doable in 15 to 30 minutes a day.
>
> I could remove my RBLs and setup rules to automatically sort a ton of
> spam and ham into folders and have a pretty good accuracy but I don't
> think that is what we are supposed to be doing.  This would have a lot
> of duplicates and could have some incorrectly categorized ham and
> spam.
>
> Dave

For the spam, almost all of it goes into my Junk folder right off the 
bat (thanks to spamassassin). Then I search my junk folder in 
Thunderbird for a couple of unique strings that are often present in 
spam that I get, visually skim what comes up for any false positives, 
and move the spam to the spam corpus folder. That covers most of the 
spam that I get, and it doesn't take long because most of the work is 
done by the string search giving me enough confidence to not have to 
properly inspect every message. Anything that remains, I classify 
manually in the normal way.

The amount of ham that I get isn't insane, like I said those 10k 
messages are from the past 4 years. I sort them into folders (no 
differently from how any organized email user would), and set those 
folders to be scanned as ham.

Re: Ruleqa masscheck so close.

Posted by Kevin Golding <kp...@caomhin.org>.

On Fri, 02 Jun 2017 01:29:12 +0100, David Jones <dj...@ena.com.invalid>  
wrote:

> I could remove my RBLs and setup rules to automatically sort a ton of
> spam and ham into folders and have a pretty good accuracy but I don't
> think that is what we are supposed to be doing.  This would have a lot
> of duplicates and could have some incorrectly categorized ham and
> spam.

I use automatic sorting to make life easier, but then manually check:

USER_IN_DEF_WHITELIST is probably going to be ham. Put it in a whitelist  
pile and scroll through it quickly to spot anything that doesn't belong.

Things like URIBL_BLACK and the Spamahus rules tend to be a pretty good  
sign of spam so they get put in the RBL pile for a quick once over.

There are certain accounts which should never get ham. They can go into  
another nice pile to once more whizz through.

Anything misclassified in those filters tends to stand out a mile (and is  
pretty rare too). If I'm short of time I can move anything I question in  
those quick scans into a discard pile, or if I've got a bit longer I can  
look at it more carefully. If I'm not 100% on anything it stays discarded.

I have more rules in place for moving things around into different piles.  
I have different levels of trust for different piles, but probably 80-90%  
of my corpora takes maybe 10 minutes per day using that approach. Granted  
that's come from refining it over time and learning the mail flows I'm  
using, probably when I started it was slower and clumsier. If I have a few  
days off and come back to a mountain that needs sorting... well the easy  
stuff gets dealt with and the tough stuff is moved to discard.

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

>From: John Brooks <jo...@fastquake.com>

>On 2017-06-01 07:59 PM, David Jones wrote:
>>
>>> How do you guarantee all the contributors can perform the checks within
>>> an hour?
>> Wow!  I just started collecting ham/spam for masscheck back in January
>> and my (apparently) tiny corpus only takes under a minute to run on a
>> low end 2 core VM.  I didn't realize that there would be some that take
>> a long time to run.  Still pretty new to all of this backend processing.
>>
>> Dave

>Today while setting it up, I did a test run, and it took 2 hours on my 
>single core VPS. My corpus is ~40k messages: ~30k spam from the last 3 
>months (spammers love me), the rest is ham from the last 4 years. But 
>that was a full run, clean slate. Maybe it doesn't have to re-scan every 
>message the next time, but I have no clue because I haven't done it yet.

Impressive.  Are you manually sorting those 40K messages in the past 3
months?  I have a couple of domains that attract a lot of spam but I
couldn't sort all of that volume of mail so I use a few RBLs to block
the "low hanging fruit" and then have rules put ham and spam into
folders to make my manual sorting doable in 15 to 30 minutes a day.

I could remove my RBLs and setup rules to automatically sort a ton of
spam and ham into folders and have a pretty good accuracy but I don't
think that is what we are supposed to be doing.  This would have a lot
of duplicates and could have some incorrectly categorized ham and
spam. 

Dave

Re: Ruleqa masscheck so close.

Posted by "Kevin A. McGrail" <ke...@mcgrail.com>.

On 6/3/2017 2:46 PM, John Brooks wrote:
> Well, for me, this only matters for masschecks (and only weekly 
> masschecks have the --net option set). My normal mail volume per day 
> is small enough that the number of DNS queries isn't an issue since 
> they're spread out over time. It's just scanning thousands of messages 
> all at once that concerns me.
>
> If I were to disable most of the DNSBLs for my weekly masschecks, I 
> might as well just remove the --net argument. But I don't want to 
> diminish the quality of the masscheck data if I can avoid it. I also 
> don't want to abuse free DNSBLs with heavy load, however. 

Gents, I'm not aware of any RBL blocking anyone from the masschecking.  
If it is a problem, I am sure we have contacts to reach out to and 
request a larger limit, etc.

Regards,

KAM

Re: Ruleqa masscheck so close.

Posted by John Brooks <jo...@fastquake.com>.

On 2017-06-03 12:14 PM, David Jones wrote:
> On 06/03/2017 10:47 AM, John Brooks wrote:
>> On 2017-06-01 08:14 PM, John Brooks wrote:
>>> On 2017-06-01 07:59 PM, David Jones wrote:
>>>>
>>>>> How do you guarantee all the contributors can perform the checks 
>>>>> within
>>>>> an hour?
>>>> Wow!  I just started collecting ham/spam for masscheck back in January
>>>> and my (apparently) tiny corpus only takes under a minute to run on a
>>>> low end 2 core VM.  I didn't realize that there would be some that 
>>>> take
>>>> a long time to run.  Still pretty new to all of this backend 
>>>> processing.
>>>>
>>>> Dave
>>>
>>> Today while setting it up, I did a test run, and it took 2 hours on 
>>> my single core VPS. My corpus is ~40k messages: ~30k spam from the 
>>> last 3 months (spammers love me), the rest is ham from the last 4 
>>> years. But that was a full run, clean slate. Maybe it doesn't have 
>>> to re-scan every message the next time, but I have no clue because I 
>>> haven't done it yet.
>>
>> And today the weekly one ran and it took about 4 hours because of all 
>> the network tests. I could probably increase the job count to speed 
>> that up. But now that I think about it, scanning 40 000 messages at 
>> once with network tests enabled is a *lot* of DNS requests. Am I 
>> going to get myself banned from the DNSBLs by doing that? Most of 
>> them have rate limits on requests for non-paying users.
>>
>
> That's possible.  How many "main/primary" DNSBLs have this 
> restriction? I am aware of SpamHaus but I have disabled all of the 
> others since they weren't that valuable anyway for my environment:
>
> score URIBL_WS_SURBL 0
> score URIBL_PH_SURBL 0
> score URIBL_MW_SURBL 0
> score URIBL_CR_SURBL 0
> score URIBL_ABUSE_SURBL 0
> score URIBL_RHS_DOB 0
> score URIBL_SBL 0
> score URIBL_SBL_A 0
> score URIBL_DBL_SPAM 0
> score URIBL_DBL_PHISH 0
> score URIBL_DBL_MALWARE 0
> score URIBL_DBL_BOTNETCC 0
> score URIBL_DBL_ABUSE_SPAM 0
> score URIBL_DBL_ABUSE_REDIR 0
> score URIBL_DBL_ABUSE_PHISH 0
> score URIBL_DBL_ABUSE_MALW 0
> score URIBL_DBL_ABUSE_BOTCC 0
> score URIBL_DBL_ERROR 0
> score URIBL_BLACK 0
> score URIBL_GREY 0
> score URIBL_RED 0
>
> Senderscore.org has been very good for the past couple of years and 
> they don't seem to have limits.  I had to setup my own rules:
>
> ifplugin Mail::SpamAssassin::Plugin::DNSEval
>
> header        __RCVD_IN_SENDERSCORE_90_100 
> eval:check_rbl('senderscore90-lastexternal','score.senderscore.com.','^127\.0\.4\.(9[0-9]|100)$')
> meta        RCVD_IN_SENDERSCORE_90_100    SPF_PASS && 
> __RCVD_IN_SENDERSCORE_90_100
> describe    RCVD_IN_SENDERSCORE_90_100    Senderscore.org score of 90 
> to 100
> score        RCVD_IN_SENDERSCORE_90_100    -2.2
> tflags        RCVD_IN_SENDERSCORE_90_100    net
>
> header        __RCVD_IN_SENDERSCORE_80_89 
> eval:check_rbl('senderscorer80-lastexternal','score.senderscore.com.','^127\.0\.4\.(8[0-9])$')
> meta        RCVD_IN_SENDERSCORE_80_89    SPF_PASS && 
> __RCVD_IN_SENDERSCORE_80_89
> describe    RCVD_IN_SENDERSCORE_80_89    Senderscore.org score of 80 
> to 89
> score        RCVD_IN_SENDERSCORE_80_89    -1.2
> tflags        RCVD_IN_SENDERSCORE_80_89    net
>
> header        RCVD_IN_SENDERSCORE_70_79 
> eval:check_rbl('senderscorer70-lastexternal','score.senderscore.com.','^127\.0\.4\.(7[0-9])$')
> describe    RCVD_IN_SENDERSCORE_70_79    Senderscore.org score of 70 
> to 79
> score        RCVD_IN_SENDERSCORE_70_79    1.2
> tflags        RCVD_IN_SENDERSCORE_70_79    net
>
> header        RCVD_IN_SENDERSCORE_60_69 
> eval:check_rbl('senderscorer60-lastexternal','score.senderscore.com.','^127\.0\.4\.(6[0-9])$')
> describe    RCVD_IN_SENDERSCORE_60_69    Senderscore.org score of 60 
> to 69
> score        RCVD_IN_SENDERSCORE_60_69    2.2
> tflags        RCVD_IN_SENDERSCORE_60_69    net
>
> header        RCVD_IN_SENDERSCORE_50_59 
> eval:check_rbl('senderscorer50-lastexternal','score.senderscore.com.','^127\.0\.4\.(5[0-9])$')
> describe    RCVD_IN_SENDERSCORE_50_59    Senderscore.org score of 50 
> to 59
> score        RCVD_IN_SENDERSCORE_50_59    3.2
> tflags        RCVD_IN_SENDERSCORE_50_59    net
>
> header        RCVD_IN_SENDERSCORE_30_49 
> eval:check_rbl('senderscorer30-lastexternal','score.senderscore.com.','^127\.0\.4\.([3-4][0-9])$')
> describe    RCVD_IN_SENDERSCORE_30_49    Senderscore.org score of 30 
> to 49
> score        RCVD_IN_SENDERSCORE_30_49    4.2
> tflags        RCVD_IN_SENDERSCORE_30_49    net
>
> header        RCVD_IN_SENDERSCORE_0_29 
> eval:check_rbl('senderscore0-lastexternal','score.senderscore.com.','^127\.0\.4\.([1-2]?[0-9])$')
> describe    RCVD_IN_SENDERSCORE_0_29    Senderscore.org score of 0 to 29
> score        RCVD_IN_SENDERSCORE_0_29    5.2
> tflags        RCVD_IN_SENDERSCORE_0_29    net
>
> endif
>
> How would we go about putting these into the built-in SA rules? Do we 
> need to get permission from senderscore.org and start testing them 
> with tiny scores?
>
> I have paid rsync feeds for spamhaus-pbl, spamhaus-sbl, spamhaus-xbl, 
> and invaluement.  Uceprotect.net and SORBS provide a free rsync feed.
>
> Setup a local caching DNS server to help reduce the external queries?
> I just did on my server after you mentioned that.
>

Well, for me, this only matters for masschecks (and only weekly 
masschecks have the --net option set). My normal mail volume per day is 
small enough that the number of DNS queries isn't an issue since they're 
spread out over time. It's just scanning thousands of messages all at 
once that concerns me.

If I were to disable most of the DNSBLs for my weekly masschecks, I 
might as well just remove the --net argument. But I don't want to 
diminish the quality of the masscheck data if I can avoid it. I also 
don't want to abuse free DNSBLs with heavy load, however.

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

On 06/03/2017 10:47 AM, John Brooks wrote:
> On 2017-06-01 08:14 PM, John Brooks wrote:
>> On 2017-06-01 07:59 PM, David Jones wrote:
>>>
>>>> How do you guarantee all the contributors can perform the checks within
>>>> an hour?
>>> Wow!  I just started collecting ham/spam for masscheck back in January
>>> and my (apparently) tiny corpus only takes under a minute to run on a
>>> low end 2 core VM.  I didn't realize that there would be some that take
>>> a long time to run.  Still pretty new to all of this backend processing.
>>>
>>> Dave
>>
>> Today while setting it up, I did a test run, and it took 2 hours on my 
>> single core VPS. My corpus is ~40k messages: ~30k spam from the last 3 
>> months (spammers love me), the rest is ham from the last 4 years. But 
>> that was a full run, clean slate. Maybe it doesn't have to re-scan 
>> every message the next time, but I have no clue because I haven't done 
>> it yet.
> 
> And today the weekly one ran and it took about 4 hours because of all 
> the network tests. I could probably increase the job count to speed that 
> up. But now that I think about it, scanning 40 000 messages at once with 
> network tests enabled is a *lot* of DNS requests. Am I going to get 
> myself banned from the DNSBLs by doing that? Most of them have rate 
> limits on requests for non-paying users.
> 

That's possible.  How many "main/primary" DNSBLs have this restriction? 
I am aware of SpamHaus but I have disabled all of the others since they 
weren't that valuable anyway for my environment:

score URIBL_WS_SURBL 0
score URIBL_PH_SURBL 0
score URIBL_MW_SURBL 0
score URIBL_CR_SURBL 0
score URIBL_ABUSE_SURBL 0
score URIBL_RHS_DOB 0
score URIBL_SBL 0
score URIBL_SBL_A 0
score URIBL_DBL_SPAM 0
score URIBL_DBL_PHISH 0
score URIBL_DBL_MALWARE 0
score URIBL_DBL_BOTNETCC 0
score URIBL_DBL_ABUSE_SPAM 0
score URIBL_DBL_ABUSE_REDIR 0
score URIBL_DBL_ABUSE_PHISH 0
score URIBL_DBL_ABUSE_MALW 0
score URIBL_DBL_ABUSE_BOTCC 0
score URIBL_DBL_ERROR 0
score URIBL_BLACK 0
score URIBL_GREY 0
score URIBL_RED 0

Senderscore.org has been very good for the past couple of years and they 
don't seem to have limits.  I had to setup my own rules:

ifplugin Mail::SpamAssassin::Plugin::DNSEval

header		__RCVD_IN_SENDERSCORE_90_100 
eval:check_rbl('senderscore90-lastexternal','score.senderscore.com.','^127\.0\.4\.(9[0-9]|100)$')
meta		RCVD_IN_SENDERSCORE_90_100	SPF_PASS && __RCVD_IN_SENDERSCORE_90_100
describe	RCVD_IN_SENDERSCORE_90_100	Senderscore.org score of 90 to 100
score		RCVD_IN_SENDERSCORE_90_100	-2.2
tflags		RCVD_IN_SENDERSCORE_90_100	net

header		__RCVD_IN_SENDERSCORE_80_89 
eval:check_rbl('senderscorer80-lastexternal','score.senderscore.com.','^127\.0\.4\.(8[0-9])$')
meta		RCVD_IN_SENDERSCORE_80_89	SPF_PASS && __RCVD_IN_SENDERSCORE_80_89
describe	RCVD_IN_SENDERSCORE_80_89	Senderscore.org score of 80 to 89
score		RCVD_IN_SENDERSCORE_80_89	-1.2
tflags		RCVD_IN_SENDERSCORE_80_89	net

header		RCVD_IN_SENDERSCORE_70_79 
eval:check_rbl('senderscorer70-lastexternal','score.senderscore.com.','^127\.0\.4\.(7[0-9])$')
describe	RCVD_IN_SENDERSCORE_70_79	Senderscore.org score of 70 to 79
score		RCVD_IN_SENDERSCORE_70_79	1.2
tflags		RCVD_IN_SENDERSCORE_70_79	net

header		RCVD_IN_SENDERSCORE_60_69 
eval:check_rbl('senderscorer60-lastexternal','score.senderscore.com.','^127\.0\.4\.(6[0-9])$')
describe	RCVD_IN_SENDERSCORE_60_69	Senderscore.org score of 60 to 69
score		RCVD_IN_SENDERSCORE_60_69	2.2
tflags		RCVD_IN_SENDERSCORE_60_69	net

header		RCVD_IN_SENDERSCORE_50_59 
eval:check_rbl('senderscorer50-lastexternal','score.senderscore.com.','^127\.0\.4\.(5[0-9])$')
describe	RCVD_IN_SENDERSCORE_50_59	Senderscore.org score of 50 to 59
score		RCVD_IN_SENDERSCORE_50_59	3.2
tflags		RCVD_IN_SENDERSCORE_50_59	net

header		RCVD_IN_SENDERSCORE_30_49 
eval:check_rbl('senderscorer30-lastexternal','score.senderscore.com.','^127\.0\.4\.([3-4][0-9])$')
describe	RCVD_IN_SENDERSCORE_30_49	Senderscore.org score of 30 to 49
score		RCVD_IN_SENDERSCORE_30_49	4.2
tflags		RCVD_IN_SENDERSCORE_30_49	net

header		RCVD_IN_SENDERSCORE_0_29 
eval:check_rbl('senderscore0-lastexternal','score.senderscore.com.','^127\.0\.4\.([1-2]?[0-9])$')
describe	RCVD_IN_SENDERSCORE_0_29	Senderscore.org score of 0 to 29
score		RCVD_IN_SENDERSCORE_0_29	5.2
tflags		RCVD_IN_SENDERSCORE_0_29	net

endif

How would we go about putting these into the built-in SA rules?  Do we 
need to get permission from senderscore.org and start testing them with 
tiny scores?

I have paid rsync feeds for spamhaus-pbl, spamhaus-sbl, spamhaus-xbl, 
and invaluement.  Uceprotect.net and SORBS provide a free rsync feed.

Setup a local caching DNS server to help reduce the external queries?
I just did on my server after you mentioned that.

-- 
Dave

Re: Ruleqa masscheck so close.

Posted by John Brooks <jo...@fastquake.com>.

On 2017-06-01 08:14 PM, John Brooks wrote:
> On 2017-06-01 07:59 PM, David Jones wrote:
>>
>>> How do you guarantee all the contributors can perform the checks within
>>> an hour?
>> Wow!  I just started collecting ham/spam for masscheck back in January
>> and my (apparently) tiny corpus only takes under a minute to run on a
>> low end 2 core VM.  I didn't realize that there would be some that take
>> a long time to run.  Still pretty new to all of this backend processing.
>>
>> Dave
>
> Today while setting it up, I did a test run, and it took 2 hours on my 
> single core VPS. My corpus is ~40k messages: ~30k spam from the last 3 
> months (spammers love me), the rest is ham from the last 4 years. But 
> that was a full run, clean slate. Maybe it doesn't have to re-scan 
> every message the next time, but I have no clue because I haven't done 
> it yet.

And today the weekly one ran and it took about 4 hours because of all 
the network tests. I could probably increase the job count to speed that 
up. But now that I think about it, scanning 40 000 messages at once with 
network tests enabled is a *lot* of DNS requests. Am I going to get 
myself banned from the DNSBLs by doing that? Most of them have rate 
limits on requests for non-paying users.

Re: Ruleqa masscheck so close.

Posted by John Brooks <jo...@fastquake.com>.

On 2017-06-01 07:59 PM, David Jones wrote:
>
>> How do you guarantee all the contributors can perform the checks within
>> an hour?
> Wow!  I just started collecting ham/spam for masscheck back in January
> and my (apparently) tiny corpus only takes under a minute to run on a
> low end 2 core VM.  I didn't realize that there would be some that take
> a long time to run.  Still pretty new to all of this backend processing.
>
> Dave

Today while setting it up, I did a test run, and it took 2 hours on my 
single core VPS. My corpus is ~40k messages: ~30k spam from the last 3 
months (spammers love me), the rest is ham from the last 4 years. But 
that was a full run, clean slate. Maybe it doesn't have to re-scan every 
message the next time, but I have no clue because I haven't done it yet.

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

>From: John Hardin <jh...@impsec.org>

>>On Thu, 1 Jun 2017, David Jones wrote:

>> I am working pretty hard to get the ruleqa processing going
>> again on our new server.   We are so close to having enough
>> contributors and ham/spam to get some new rules generated:
>> This is from the run minutes ago:
>>
>> HAM CONTRIBUTORS FOUND: 9 (required 10)
>> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>>
>> We need to recruit some more masscheck'ers to get over the hump so I can 
>> do some final testing of the rules updates and start the DNS updates 
>> again for sa-update.

>I upload my corpora, not my results - I suppose that my corpora didn't 
>survive the migration, and I haven't yet brought my submission bot and 
>rsync account up-to-date for the new hardware - my apologies. I may be 
>able to give that some cycles this weekend.

So, about that.  I just started helping as a sysadmin a month or so ago.  We
had some hosting issues that we are trying to recover from with very little
documentation of the infrastructure 3 months ago.  I am having to dig
through logs, cron output, and old (outdated) documentation to try to put
the puzzle back together again.

If anyone has any knowledge or documentation of how things were setup
in the past, I would love to talk with you.

We do have backups from one of the servers but I think there were two
or three servers before based on the fact that I can't find any evidence
where buildbot was running that I think was running the centralized
masscheck.

>However, I don't recall seeing confirmation that the central masscheck was 
>actually working; can you confirm that? Or do I need to change over to 
>local masscheck and uploading results like most others do?

I have not found enough details yet on the central masscheck so I have
started with getting the remote masscheck processing working first so
we can get sa-update going again.

> P.S. After spending the past month learning how this works, I have some 
> ideas on how to make the nightly masschecks become hourly fairly easily 
> so we can test and promote rule changes faster.

>How do you guarantee all the contributors can perform the checks within
>an hour?

Wow!  I just started collecting ham/spam for masscheck back in January
and my (apparently) tiny corpus only takes under a minute to run on a
low end 2 core VM.  I didn't realize that there would be some that take
a long time to run.  Still pretty new to all of this backend processing.

Dave

Re: Ruleqa masscheck so close.

Posted by John Hardin <jh...@impsec.org>.

On Thu, 1 Jun 2017, David Jones wrote:

> I am working pretty hard to get the ruleqa processing going again on our new server.   We are so close to having enough contributors and ham/spam to get some new rules generated:  This is from the run minutes ago:
>
> HAM CONTRIBUTORS FOUND: 9 (required 10)
> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>
> We need to recruit some more masscheck'ers to get over the hump so I can 
> do some final testing of the rules updates and start the DNS updates 
> again for sa-update.

I upload my corpora, not my results - I suppose that my corpora didn't 
survive the migration, and I haven't yet brought my submission bot and 
rsync account up-to-date for the new hardware - my apologies. I may be 
able to give that some cycles this weekend.

However, I don't recall seeing confirmation that the central masscheck was 
actually working; can you confirm that? Or do I need to change over to 
local masscheck and uploading results like most others do?

> P.S. After spending the past month learning how this works, I have some 
> ideas on how to make the nightly masschecks become hourly fairly easily 
> so we can test and promote rule changes faster.

How do you guarantee all the contributors can perform the checks within an 
hour?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Journalism is about covering important stories.
   With a pillow, until they stop moving.               -- David Burge
-----------------------------------------------------------------------
  5 days until the 73rd anniversary of D-Day

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

>From: Kevin Golding <kp...@caomhin.org>

>On Thu, 01 Jun 2017 13:45:26 +0100, David Jones <dj...@ena.com.invalid> 
wrote:

>> Why do you think it was pointless?

>Because I got a daily email telling me that rsync failed while the system  
>was offline. Running a masscheck every day for no purpose seemed a little  
>pointless. I thought I'd suspend it until the system was back online. If  
>that's a problem I apologise, it wasn't made clear that I should keep  
>donating resources during a period when the system was offline.

Sure.  I completely understand when the server was offline.  Sounds like
we need a little enhancement in the automasscheck-minimal.sh script to
detect when the rsync fails and not waste processing resources.

>> I have an idea that will allow masscheckers to cron the automasscheck.sh
>> script hourly which would only run the full masscheck when they detect a
>> new tagged ruleset to work with.  Basically it would do a quick rsync of  
>> the latest tagged build dir like it does today but if there are no rsync  
>> changes, it would simply exit.

>Presumably there would also be a 24hr window that meant even if no rules  
>were updated we would rescore after that period to have recent score  
>adjustments? That would seem more effective than maintaining the morning  
>run and throwing in an additional one as needed since we could run checks  
>an hour before the morning run.

Yes.  We would still keep the daily tagged build of rules that we have today
so the existing 24 hour processing would work just like it does today for those
who don't want to go to the hourly.  Keep in mind, this wouldn't mean you
would need to masscheck hourly just for nothing.  The script would run hourly
"phone home" then exit if there was nothing new to masscheck against.  Even
the current nightly masscheck would have benefited from this logic while the
server was down and not wasted resources.

>It would logically also require an amendment to suggest running sa-update  
>hourly instead of daily too.

Sure.  Fair point.  With sa-update running "randomly" all over the Internet
from different locations and time zones, there could be some not getting
updates for up to 48 hours even when everything is running perfectly.  The
average update around the world would go from 24 hours down to 12 hours
for any hourly updates assuming we could get enough masscheckers to go
hourly.

I am still in the planning stages of this after sorting through all of the scripts
so I am definitely open to ideas and suggestions like this.  The idea is that
when we find the recent issue with Yahoo changing their message ID format
(see FORGED_MUA_MOZILLA & FORGED_YAHOO_RCVD thread), then this
could go out in hours instead of days.

>As a sidenote, not all of us use the main automasscheck.sh script so  
>depending on how the changes are rolled out I can't promise an  
>uninterrupted supply of masscheck data.

Thanks for that feedback.  My goal is to add the hourly functionality
without changing the current directory structure or timing of cron jobs
so this would not impact the existing masscheck submissions.

I am still working on getting the current masscheck processing finished
up and we are probably months away from the hourly stuff so I will take
things slowly and try to fully understand things before adding the hourly
logic.  I will test on my own masscheck processing for a while first.

Re: Ruleqa masscheck so close.

Posted by Kevin Golding <kp...@caomhin.org>.

On Thu, 01 Jun 2017 13:45:26 +0100, David Jones <dj...@ena.com.invalid>  
wrote:

> Why do you think it was pointless?

Because I got a daily email telling me that rsync failed while the system  
was offline. Running a masscheck every day for no purpose seemed a little  
pointless. I thought I'd suspend it until the system was back online. If  
that's a problem I apologise, it wasn't made clear that I should keep  
donating resources during a period when the system was offline.

> I have an idea that will allow masscheckers to cron the automasscheck.sh
> script hourly which would only run the full masscheck when they detect a
> new tagged ruleset to work with.  Basically it would do a quick rsync of  
> the
> latest tagged build dir like it does today but if there are no rsync  
> changes,
> it would simply exit.

Presumably there would also be a 24hr window that meant even if no rules  
were updated we would rescore after that period to have recent score  
adjustments? That would seem more effective than maintaining the morning  
run and throwing in an additional one as needed since we could run checks  
an hour before the morning run.

It would logically also require an amendment to suggest running sa-update  
hourly instead of daily too.

As a sidenote, not all of us use the main automasscheck.sh script so  
depending on how the changes are rolled out I can't promise an  
uninterrupted supply of masscheck data.

Re: problem with setting up masscheck

Posted by Marcin Mirosław <ma...@mejor.pl>.

W dniu 2017-06-02 o 23:23, Dave Jones pisze:
> On 06/02/2017 02:47 AM, marcin@mejor.pl wrote:
>> W dniu 01.06.2017 o 16:48, David Jones pisze:
>>
>> Hi David, hi all!
>>
>>> From: marcin@mejor.pl <ma...@mejor.pl>
>>>> W dniu 01.06.2017 o 14:57, David Jones pisze:
>>>>>> From: Kevin A. McGrail <ke...@mcgrail.com>
>>>>>
>>>>> https://wiki.apache.org/spamassassin/NightlyMassCheck
>>>
>>>
>>>> I have some troubles with setting up masscheck. Here is what I set 
>>>> in .automasscheck.cf :
>>>> LOGPREFIX="YOUR-USERNAME"
>>>> RSYNC_USERNAME="YOUR-USERNAME"
>>>> RSYNC_PASSWORD="YOUR-PASSWORD"
>>>
>>> Replace ""YOUR-USERNAME" with your own rsync username of "mmiroslaw" 
>>> in both places.
>>> Same for "YOUR-PASSWORD" with your own personal rsync password.  If 
>>> you don't
>>> remember your rsync password, I can send it to you off list.
>>
>> I've got my password but I didn't put it into configuration because I 
>> wanted to test if everything works correctly.
>>
>>>> WORKDIR=~/sa-masscheck/tmp
>>>> JOBS=1
>>>> TRUSTED_NETWORKS=
>>>> INTERNAL_NETWORKS=
>>>> run_all_masschecks() {
>>>>   run_masscheck  ham:dir:/dane/spam/HAM_DOUSUNIECIA/ \
>>>>                  spam:dir:/dane/spam/20170601/
>>>> }
>>>
>>>> And this is what I got from script bash -x automasscheck-minimal.sh:
>>>> [...]
>>>> open of ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log 
>>>> failed: No such file or directory at ./mass-check line 617.
>>>> + LOGLIST=' 
>>>> ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log 
>>>> spam-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
>>>> + set +x
>>>
>>>> name of log file is wrong.
>>>
>>> Did you use the updated automasscheck-minimal.sh recently updated in 
>>> the past few days
>>> or is this an older existing version that worked in the past?
>>
>>
>> I'm on r1797310, today i'm getting:
>>
>> + run_masscheck ham:dir:/dane/spam/HAM_DOUSUNIECIA/ 
>> spam:dir:/dane/spam/20170601/
>> + CORPUSNAME=ham:dir:/dane/spam/HAM_DOUSUNIECIA/
>> + shift
>> + [[ ham:dir:/dane/spam/HAM_DOUSUNIECIA/ == \s\i\n\g\l\e\-\c\o\r\p\u\s ]]
>> + LOGSUFFIX=-ham:dir:/dane/spam/HAM_DOUSUNIECIA/
>> + LOGNAME=mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
>> + rm -f ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log 
>> spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
>> + set -x
>> + ./mass-check 
>> --hamlog=ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log 
>> --spamlog=spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log -j 1 
>> --progress spam:dir:/dane/spam/20170601/
>> status: starting scan stage                              now: 
>> 2017-06-02 09:41:49
>> status: completed scan stage, 23 messages                now: 
>> 2017-06-02 09:41:49
>> status: starting run stage                               now: 
>> 2017-06-02 09:41:49
>> open of ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log failed: 
>> No such file or directory at ./mass-check line 617.
>> + LOGLIST=' ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log 
>> spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
>> + set +x
>> rsync -qPcvz  ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log 
>> spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log 
>> mmiroslaw@rsync.spamassassin.org::corpus/
>> The source and destination cannot both be remote.
>> rsync error: syntax or usage error (code 1) at main.c(1274) 
>> [Receiver=3.1.2]
>> ^C
>>
>> Why name of logs file are building with slashes? Slashes are not 
>> allowed as filename.
>>
>> Marcin
>>
> 
> In the .automasscheck.cf around line 53 there should be something like 
> this:
> 
> run_all_masschecks() {
>    ### sample: single corpus ###
>    run_masscheck single-corpus \
> 
> Did you remove the 'single-corpus' from the run_masscheck argument? From 
> what I can tell in your bash -x output (thanks for that by the way), it 
> looks like the first argument to the run_masscheck function is 
> 'ham:dir:/dane/spam/HAM_DOUSUNIECIA/' and it probably should be 
> 'single-corpus'.
> 
> Mine looks like this:
> 
> run_all_masschecks() {
>    ### sample: single corpus ###
>    run_masscheck single-corpus \
>          ham:dir:$MAILDIR/.Ham/ \
>          spam:dir:$MAILDIR/.Spam/
> 
> I have a wrapper script that sets the $MAILDIR then calls the 
> automasscheck-minimal.sh script to do a couple of things before and 
> afterwards for reporting and emailing the output.


Bingo!
I clean up conf file and I removed too much. I don't know why I removed 
single-corpus. Now it works! I'll prepare all configuration and will 
send results of masscheck after a weekend.
Thank you!
Marcin

Re: problem with setting up masscheck

Posted by Dave Jones <da...@apache.org>.

On 06/02/2017 02:47 AM, marcin@mejor.pl wrote:
> W dniu 01.06.2017 o 16:48, David Jones pisze:
> 
> Hi David, hi all!
> 
>> From: marcin@mejor.pl <ma...@mejor.pl>
>>      
>>> W dniu 01.06.2017 o 14:57, David Jones pisze:
>>>>> From: Kevin A. McGrail <ke...@mcgrail.com>
>>>>       
>>>>
>>>> https://wiki.apache.org/spamassassin/NightlyMassCheck
>>
>>
>>> I have some troubles with setting up masscheck. Here is what I set in .automasscheck.cf :
>>> LOGPREFIX="YOUR-USERNAME"
>>> RSYNC_USERNAME="YOUR-USERNAME"
>>> RSYNC_PASSWORD="YOUR-PASSWORD"
>>
>> Replace ""YOUR-USERNAME" with your own rsync username of "mmiroslaw" in both places.
>> Same for "YOUR-PASSWORD" with your own personal rsync password.  If you don't
>> remember your rsync password, I can send it to you off list.
> 
> I've got my password but I didn't put it into configuration because I wanted to test if everything works correctly.
> 
>   
>>> WORKDIR=~/sa-masscheck/tmp
>>> JOBS=1
>>> TRUSTED_NETWORKS=
>>> INTERNAL_NETWORKS=
>>> run_all_masschecks() {
>>>   run_masscheck  ham:dir:/dane/spam/HAM_DOUSUNIECIA/ \
>>>                  spam:dir:/dane/spam/20170601/
>>> }
>>
>>> And this is what I got from script bash -x automasscheck-minimal.sh:
>>> [...]
>>> open of ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log failed: No such file or directory at ./mass-check line 617.
>>> + LOGLIST=' ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
>>> + set +x
>>
>>> name of log file is wrong.
>>
>> Did you use the updated automasscheck-minimal.sh recently updated in the past few days
>> or is this an older existing version that worked in the past?
> 
> 
> I'm on r1797310, today i'm getting:
> 
> + run_masscheck ham:dir:/dane/spam/HAM_DOUSUNIECIA/ spam:dir:/dane/spam/20170601/
> + CORPUSNAME=ham:dir:/dane/spam/HAM_DOUSUNIECIA/
> + shift
> + [[ ham:dir:/dane/spam/HAM_DOUSUNIECIA/ == \s\i\n\g\l\e\-\c\o\r\p\u\s ]]
> + LOGSUFFIX=-ham:dir:/dane/spam/HAM_DOUSUNIECIA/
> + LOGNAME=mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
> + rm -f ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
> + set -x
> + ./mass-check --hamlog=ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log --spamlog=spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log -j 1 --progress spam:dir:/dane/spam/20170601/
> status: starting scan stage                              now: 2017-06-02 09:41:49
> status: completed scan stage, 23 messages                now: 2017-06-02 09:41:49
> status: starting run stage                               now: 2017-06-02 09:41:49
> open of ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log failed: No such file or directory at ./mass-check line 617.
> + LOGLIST=' ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
> + set +x
> rsync -qPcvz  ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log mmiroslaw@rsync.spamassassin.org::corpus/
> The source and destination cannot both be remote.
> rsync error: syntax or usage error (code 1) at main.c(1274) [Receiver=3.1.2]
> ^C
> 
> Why name of logs file are building with slashes? Slashes are not allowed as filename.
> 
> Marcin
> 

In the .automasscheck.cf around line 53 there should be something like this:

run_all_masschecks() {
   ### sample: single corpus ###
   run_masscheck single-corpus \

Did you remove the 'single-corpus' from the run_masscheck argument? 
 From what I can tell in your bash -x output (thanks for that by the 
way), it looks like the first argument to the run_masscheck function is 
'ham:dir:/dane/spam/HAM_DOUSUNIECIA/' and it probably should be 
'single-corpus'.

Mine looks like this:

run_all_masschecks() {
   ### sample: single corpus ###
   run_masscheck single-corpus \
         ham:dir:$MAILDIR/.Ham/ \
         spam:dir:$MAILDIR/.Spam/

I have a wrapper script that sets the $MAILDIR then calls the 
automasscheck-minimal.sh script to do a couple of things before and 
afterwards for reporting and emailing the output.

Dave

Re: problem with setting up masscheck (was: Ruleqa masscheck so close.)

Posted by "marcin@mejor.pl" <ma...@mejor.pl>.

W dniu 01.06.2017 o 16:48, David Jones pisze:

Hi David, hi all!

> From: marcin@mejor.pl <ma...@mejor.pl>
>     
>> W dniu 01.06.2017 o 14:57, David Jones pisze:
>>>> From: Kevin A. McGrail <ke...@mcgrail.com>
>>>      
>>>
>>> https://wiki.apache.org/spamassassin/NightlyMassCheck
> 
> 
>> I have some troubles with setting up masscheck. Here is what I set in .automasscheck.cf :
>> LOGPREFIX="YOUR-USERNAME"
>> RSYNC_USERNAME="YOUR-USERNAME"
>> RSYNC_PASSWORD="YOUR-PASSWORD"
> 
> Replace ""YOUR-USERNAME" with your own rsync username of "mmiroslaw" in both places.
> Same for "YOUR-PASSWORD" with your own personal rsync password.  If you don't
> remember your rsync password, I can send it to you off list.

I've got my password but I didn't put it into configuration because I wanted to test if everything works correctly.

 
>> WORKDIR=~/sa-masscheck/tmp
>> JOBS=1
>> TRUSTED_NETWORKS=
>> INTERNAL_NETWORKS=
>> run_all_masschecks() {
>>  run_masscheck  ham:dir:/dane/spam/HAM_DOUSUNIECIA/ \
>>                 spam:dir:/dane/spam/20170601/
>> }
> 
>> And this is what I got from script bash -x automasscheck-minimal.sh:
>> [...]
>> open of ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log failed: No such file or directory at ./mass-check line 617.
>> + LOGLIST=' ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
>> + set +x
> 
>> name of log file is wrong.
> 
> Did you use the updated automasscheck-minimal.sh recently updated in the past few days
> or is this an older existing version that worked in the past?


I'm on r1797310, today i'm getting:

+ run_masscheck ham:dir:/dane/spam/HAM_DOUSUNIECIA/ spam:dir:/dane/spam/20170601/
+ CORPUSNAME=ham:dir:/dane/spam/HAM_DOUSUNIECIA/
+ shift
+ [[ ham:dir:/dane/spam/HAM_DOUSUNIECIA/ == \s\i\n\g\l\e\-\c\o\r\p\u\s ]]
+ LOGSUFFIX=-ham:dir:/dane/spam/HAM_DOUSUNIECIA/
+ LOGNAME=mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
+ rm -f ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
+ set -x
+ ./mass-check --hamlog=ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log --spamlog=spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log -j 1 --progress spam:dir:/dane/spam/20170601/
status: starting scan stage                              now: 2017-06-02 09:41:49
status: completed scan stage, 23 messages                now: 2017-06-02 09:41:49
status: starting run stage                               now: 2017-06-02 09:41:49
open of ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log failed: No such file or directory at ./mass-check line 617.
+ LOGLIST=' ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
+ set +x
rsync -qPcvz  ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log mmiroslaw@rsync.spamassassin.org::corpus/
The source and destination cannot both be remote.
rsync error: syntax or usage error (code 1) at main.c(1274) [Receiver=3.1.2]
^C

Why name of logs file are building with slashes? Slashes are not allowed as filename.

Marcin

Re: Ruleqa masscheck so close.

Posted by John Brooks <jo...@fastquake.com>.

On 06/01/2017 10:48 AM, David Jones wrote:
> From: marcin@mejor.pl <ma...@mejor.pl>
>      
>> W dniu 01.06.2017 o 14:57, David Jones pisze:
>>>> From: Kevin A. McGrail <ke...@mcgrail.com>
>>>       
>>>
>>> https://wiki.apache.org/spamassassin/NightlyMassCheck
>
>> I have some troubles with setting up masscheck. Here is what I set in .automasscheck.cf :
>> LOGPREFIX="YOUR-USERNAME"
>> RSYNC_USERNAME="YOUR-USERNAME"
>> RSYNC_PASSWORD="YOUR-PASSWORD"
> Replace ""YOUR-USERNAME" with your own rsync username of "mmiroslaw" in both places.
> Same for "YOUR-PASSWORD" with your own personal rsync password.  If you don't
> remember your rsync password, I can send it to you off list.

Can you send me mine? I can't seem to authenticate with the username 
(jbrooks) and password I was given originally.

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

From: marcin@mejor.pl <ma...@mejor.pl>
    
>W dniu 01.06.2017 o 14:57, David Jones pisze:
>>> From: Kevin A. McGrail <ke...@mcgrail.com>
>>     
>> 
>> https://wiki.apache.org/spamassassin/NightlyMassCheck


>I have some troubles with setting up masscheck. Here is what I set in .automasscheck.cf :
>LOGPREFIX="YOUR-USERNAME"
>RSYNC_USERNAME="YOUR-USERNAME"
>RSYNC_PASSWORD="YOUR-PASSWORD"

Replace ""YOUR-USERNAME" with your own rsync username of "mmiroslaw" in both places.
Same for "YOUR-PASSWORD" with your own personal rsync password.  If you don't
remember your rsync password, I can send it to you off list.

>WORKDIR=~/sa-masscheck/tmp
>JOBS=1
>TRUSTED_NETWORKS=
>INTERNAL_NETWORKS=
>run_all_masschecks() {
>  run_masscheck  ham:dir:/dane/spam/HAM_DOUSUNIECIA/ \
>                spam:dir:/dane/spam/20170601/
>}

>And this is what I got from script bash -x automasscheck-minimal.sh:
>[...]
>open of ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log failed: No such file or directory at ./mass-check line 617.
>+ LOGLIST=' ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
>+ set +x

>name of log file is wrong.

Did you use the updated automasscheck-minimal.sh recently updated in the past few days
or is this an older existing version that worked in the past?

Dave

Re: Ruleqa masscheck so close.

Posted by "marcin@mejor.pl" <ma...@mejor.pl>.

W dniu 01.06.2017 o 14:57, David Jones pisze:
>> From: Kevin A. McGrail <ke...@mcgrail.com>
>     
>> On 6/1/2017 8:45 AM, David Jones wrote:
>>>> I disabled my masscheck after a while because... well, it was pointless.
>>>> It's passed the cron window for the day but if you need the data I can run
>>>> it manually, else it'll kick in again tomorrow.
>>> Why do you think it was pointless?
> 
>> I think he means while the server was offline.
> 
> Based on the last submissions I see on the server, they were in
> 2015 so we would love to have him an others "come back"...  :)
> 
> Also, I have updated the automasscheck-minimal.sh script slightly if
> anyone would like to update theirs and give some feedback.  I have
> been running it for the past few days just fine.
> 
> https://wiki.apache.org/spamassassin/NightlyMassCheck


I have some troubles with setting up masscheck. Here is what I set in .automasscheck.cf :
LOGPREFIX="YOUR-USERNAME"
RSYNC_USERNAME="YOUR-USERNAME"
RSYNC_PASSWORD="YOUR-PASSWORD"
WORKDIR=~/sa-masscheck/tmp
JOBS=1
TRUSTED_NETWORKS=
INTERNAL_NETWORKS=
run_all_masschecks() {
  run_masscheck  ham:dir:/dane/spam/HAM_DOUSUNIECIA/ \
                spam:dir:/dane/spam/20170601/
}

And this is what I got from script bash -x automasscheck-minimal.sh:
[...]
+ run_all_masschecks
+ run_masscheck ham:dir:/dane/spam/HAM_DOUSUNIECIA/ spam:dir:/dane/spam/20170601/
+ CORPUSNAME=ham:dir:/dane/spam/HAM_DOUSUNIECIA/
+ shift
+ [[ ham:dir:/dane/spam/HAM_DOUSUNIECIA/ == \s\i\n\g\l\e\-\c\o\r\p\u\s ]]
+ LOGSUFFIX=-ham:dir:/dane/spam/HAM_DOUSUNIECIA/
+ LOGNAME=YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
+ rm -f ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
+ set -x
+ ./mass-check --hamlog=ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log --spamlog=spam-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log -j 1 --progress spam:dir:/dane/spam/20170601/
status: starting scan stage                              now: 2017-06-01 15:31:47
status: completed scan stage, 21 messages                now: 2017-06-01 15:31:47
status: starting run stage                               now: 2017-06-01 15:31:47
open of ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log failed: No such file or directory at ./mass-check line 617.
+ LOGLIST=' ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
+ set +x


name of log file is wrong.

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

From: Kevin Golding <kp...@caomhin.org>
    
>On Thu, 01 Jun 2017 13:57:47 +0100, David Jones <dj...@ena.com.invalid> 
wrote:

>> From: Kevin A. McGrail <ke...@mcgrail.com>
>
>> On 6/1/2017 8:45 AM, David Jones wrote:
>>>> I disabled my masscheck after a while because... well, it was  
>>>> pointless.
>>>> It's passed the cron window for the day but if you need the data I  
>>>> can run
>>>> it manually, else it'll kick in again tomorrow.
>>> Why do you think it was pointless?
>
>> I think he means while the server was offline.
>
> Based on the last submissions I see on the server, they were in
> 2015 so we would love to have him an others "come back"...  :)

Sorry.  That reply about 2015 was meant for marcin@mejor.pl, not Kevin.

Thanks marcin@mejor.pl for starting up your masschecks again.

>Apparently I started in 2010 and kept going until the system failure  
>earlier this year.

Thanks Kevin for running the masschecks for so long and starting up again.

Dave

Re: Ruleqa masscheck so close.

Posted by Kevin Golding <kp...@caomhin.org>.

On Thu, 01 Jun 2017 13:57:47 +0100, David Jones <dj...@ena.com.invalid>  
wrote:

>> From: Kevin A. McGrail <ke...@mcgrail.com>
>
>> On 6/1/2017 8:45 AM, David Jones wrote:
>>>> I disabled my masscheck after a while because... well, it was  
>>>> pointless.
>>>> It's passed the cron window for the day but if you need the data I  
>>>> can run
>>>> it manually, else it'll kick in again tomorrow.
>>> Why do you think it was pointless?
>
>> I think he means while the server was offline.
>
> Based on the last submissions I see on the server, they were in
> 2015 so we would love to have him an others "come back"...  :)

Apparently I started in 2010 and kept going until the system failure  
earlier this year.

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

>From: Kevin A. McGrail <ke...@mcgrail.com>

>On 6/1/2017 8:45 AM, David Jones wrote:
>>> I disabled my masscheck after a while because... well, it was pointless.
>>> It's passed the cron window for the day but if you need the data I can run
>>> it manually, else it'll kick in again tomorrow.
>> Why do you think it was pointless?

>I think he means while the server was offline.

Based on the last submissions I see on the server, they were in
2015 so we would love to have him an others "come back"...  :)

Also, I have updated the automasscheck-minimal.sh script slightly if
anyone would like to update theirs and give some feedback.  I have
been running it for the past few days just fine.

https://wiki.apache.org/spamassassin/NightlyMassCheck

Dave

Re: Ruleqa masscheck so close.

Posted by "Kevin A. McGrail" <ke...@mcgrail.com>.

On 6/1/2017 8:45 AM, David Jones wrote:
>> I disabled my masscheck after a while because... well, it was pointless.
>> It's passed the cron window for the day but if you need the data I can run
>> it manually, else it'll kick in again tomorrow.
> Why do you think it was pointless?

I think he means while the server was offline.

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

From: Kevin Golding <kp...@caomhin.org>

>On Thu, 01 Jun 2017 03:52:42 +0100, David Jones <dj...@ena.com>
>wrote:

>> I am working pretty hard to get the ruleqa processing going again on our  
>> new server.   We are so close to having enough contributors and ham/spam  
>> to get some new rules generated:  This is from the run minutes ago:
>>
>> HAM CONTRIBUTORS FOUND: 9 (required 10)
>> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>>
>> We need to recruit some more masscheck'ers to get over the hump so I can  
>> do some final testing of the rules updates and start the DNS updates  
>> again for sa-update.

>I disabled my masscheck after a while because... well, it was pointless.  
>It's passed the cron window for the day but if you need the data I can run  
>it manually, else it'll kick in again tomorrow.

Why do you think it was pointless?  This does a couple of things:
1. It provides needed feedback to rules before they can be published to
the Internet via sa-update.
2. It adjusts the 72_scores.cf based on recent spam/ham which benefits
everyone using spamassassin all over the world that runs sa-update regularly.

>> P.S. After spending the past month learning how this works, I have some  
>> ideas on how to make the nightly masschecks become hourly fairly easily  
>> so we can test and promote rule changes faster.

>You may need to explain the requirements for that. Are you asking for  
>hourly masscheck submissions?

Today the delay of up to 24 hours is pretty slow to provide feedback or
score updates.  I don't think this will ever be intended to update quickly
enough to help with zero-hour spam or replace technologies that react
quickly like RBLs, DCC, Pyzor, etc.

I have an idea that will allow masscheckers to cron the automasscheck.sh
script hourly which would only run the full masscheck when they detect a
new tagged ruleset to work with.  Basically it would do a quick rsync of the
latest tagged build dir like it does today but if there are no rsync changes,
it would simply exit.

Everyone would still keep sorting ham/spam as they do today so there
would be no real change in that.  Hopefully everyone is sorting at least
every other day or every third day.  I try to sort some every day since I
also have this tied to local Bayes training to make this work a little more
worth the time and effort.

Dave

Re: Ruleqa masscheck so close.

Posted by Kevin Golding <kp...@caomhin.org>.

On Thu, 01 Jun 2017 03:52:42 +0100, David Jones <dj...@ena.com.invalid>  
wrote:

> I am working pretty hard to get the ruleqa processing going again on our  
> new server.   We are so close to having enough contributors and ham/spam  
> to get some new rules generated:  This is from the run minutes ago:
>
> HAM CONTRIBUTORS FOUND: 9 (required 10)
> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>
> We need to recruit some more masscheck'ers to get over the hump so I can  
> do some final testing of the rules updates and start the DNS updates  
> again for sa-update.

I disabled my masscheck after a while because... well, it was pointless.  
It's passed the cron window for the day but if you need the data I can run  
it manually, else it'll kick in again tomorrow.

> P.S. After spending the past month learning how this works, I have some  
> ideas on how to make the nightly masschecks become hourly fairly easily  
> so we can test and promote rule changes faster.

You may need to explain the requirements for that. Are you asking for  
hourly masscheck submissions?

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

From: marcin@mejor.pl <ma...@mejor.pl>
    
>W dniu 01.06.2017 o 04:52, David Jones pisze:
>> I am working pretty hard to get the ruleqa processing going again
>> on our new server.   We are so close to having enough contributors
>> and ham/spam to get some new rules generated:  This is from the run minutes ago:
>> 
>> HAM CONTRIBUTORS FOUND: 9 (required 10)
>> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>> 
>> We need to recruit some more masscheck'ers to get over the hump
>>so I can do some final testing of the rules updates and start the DNS
>>updates again for sa-update.

>Hi!
>Does my account "mmiroslaw" still exists?
 
Yes.  If you are still manually sorting ham and spam, please enable your
cron job to run just after 9:00 AM UTC.  This would help us a lot.

Thanks,
Dave

Re: Ruleqa masscheck so close.

Posted by "marcin@mejor.pl" <ma...@mejor.pl>.

W dniu 01.06.2017 o 04:52, David Jones pisze:
> I am working pretty hard to get the ruleqa processing going again on our new server.   We are so close to having enough contributors and ham/spam to get some new rules generated:  This is from the run minutes ago:
> 
> HAM CONTRIBUTORS FOUND: 9 (required 10)
> SPAM CONTRIBUTORS FOUND: 9 (required 10)
> 
> We need to recruit some more masscheck'ers to get over the hump so I can do some final testing of the rules updates and start the DNS updates again for sa-update.

Hi!
Does my account "mmiroslaw" still exists?

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

>From: John Brooks <jo...@fastquake.com>

>On 06/01/2017 02:05 PM, David Jones wrote:
>>
>> P.S. Based on some documentation I saw on the wiki, I have been moving
>> ham and spam older than 90 days into an archive folder.  But now that
>> I see the ham goes back 7 years, I guess I need to keep my ham in my
>> masscheck ham folder longer.
>>
>> Dave

>Where did you read that? This page says 6 years for ham: 
>https://wiki.apache.org/spamassassin/CorpusCleaning

I confused two different issues.  I am also training my Bayes DB with
the same sorted folders so I was limiting my Bayes DB training to 90
days.  I guess I can train my Bayes DB based on the same time periods
that the masscheck uses.

Dave

Re: Ruleqa masscheck so close.

Posted by John Brooks <jo...@fastquake.com>.

On 06/01/2017 02:05 PM, David Jones wrote:
>
> P.S. Based on some documentation I saw on the wiki, I have been moving
> ham and spam older than 90 days into an archive folder.  But now that
> I see the ham goes back 7 years, I guess I need to keep my ham in my
> masscheck ham folder longer.
>
> Dave

Where did you read that? This page says 6 years for ham: 
https://wiki.apache.org/spamassassin/CorpusCleaning

Re: Ruleqa masscheck so close.

Posted by David Jones <dj...@ena.com.INVALID>.

>From: John Brooks <jo...@fastquake.com>
    
>I'm now setting up my masscheck again (it wasn't running properly before
>and I didn't bother fixing it when the server went down for months). 
>It's just me on my mail server, and I don't get that much mail; maybe 
>5-10 ham/a few hundred spam per day, not counting mailing lists which I 
>don't include in my scans. So I was going to do weekly runs instead of 
>nightly.

>Would it be more useful to the project if I ran it nightly, despite the 
>low volume?

>John

Yes.  Please run it nightly even if you don't have a high volume of mail
or if you don't sort the ham/spam often.  The current logic that I found
while trying to get this rebuilt on the new server is the following:

10 minimum contributors on the latest "sa-update" tagged revision
AND
150,000 ham combined minimum over the past 84 months
AND
150,000 spam combined minimum over the past 2 months

This is all based on the latest tagged "sa-update" in this link:
https://svn.apache.org/viewvc/spamassassin/tags/?sortby=date#dirlist

The rsync stage dir with the latest "sa-update"tagged version  is setup
shortly before 9:00 AM UTC so it works best if automasscheck-minimal.sh
is cron'd to run a few minutes or so after the top of the hour.  Technically
it can be run anytime after 9:00 AM UTC for the next ~17 hours but if we
could get enough to run in that first hour, we could potentially speed up
the sa-update process quite a bit without having to wait most of the day
like it does now.

P.S. Based on some documentation I saw on the wiki, I have been moving
ham and spam older than 90 days into an archive folder.  But now that
I see the ham goes back 7 years, I guess I need to keep my ham in my
masscheck ham folder longer.

Dave

Re: Ruleqa masscheck so close.

Posted by John Brooks <jo...@fastquake.com>.

I'm now setting up my masscheck again (it wasn't running properly before 
and I didn't bother fixing it when the server went down for months). 
It's just me on my mail server, and I don't get that much mail; maybe 
5-10 ham/a few hundred spam per day, not counting mailing lists which I 
don't include in my scans. So I was going to do weekly runs instead of 
nightly.

Would it be more useful to the project if I ran it nightly, despite the 
low volume?

John

On 05/31/2017 10:52 PM, David Jones wrote:
> I am working pretty hard to get the ruleqa processing going again on our new server.   We are so close to having enough contributors and ham/spam to get some new rules generated:  This is from the run minutes ago:
>
> HAM CONTRIBUTORS FOUND: 9 (required 10)
> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>
> We need to recruit some more masscheck'ers to get over the hump so I can do some final testing of the rules updates and start the DNS updates again for sa-update.
>
> P.S. After spending the past month learning how this works, I have some ideas on how to make the nightly masschecks become hourly fairly easily so we can test and promote rule changes faster.
>
> Dave
>