You are viewing a plain text version of this content. The canonical link for it is here.
Posted to ruleqa@spamassassin.apache.org by David Jones <dj...@ena.com.INVALID> on 2017/06/01 02:52:42 UTC
Ruleqa masscheck so close.
I am working pretty hard to get the ruleqa processing going again on our new server. We are so close to having enough contributors and ham/spam to get some new rules generated: This is from the run minutes ago:
HAM CONTRIBUTORS FOUND: 9 (required 10)
SPAM CONTRIBUTORS FOUND: 9 (required 10)
We need to recruit some more masscheck'ers to get over the hump so I can do some final testing of the rules updates and start the DNS updates again for sa-update.
P.S. After spending the past month learning how this works, I have some ideas on how to make the nightly masschecks become hourly fairly easily so we can test and promote rule changes faster.
Dave
Re: Ruleqa masscheck so close.
Posted by "Kevin A. McGrail" <ke...@mcgrail.com>.
Re: I was under the mistaken impression that we were supposed to be manually
sorting individual spam/ham to prevent duplicates
The biggest issue is it must be 100% spam and 100% ham.
You cannot trust automated or user submissions.
See https://wiki.apache.org/spamassassin/CorpusCleaning
Regards,
KAM
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
From: John Hardin <jh...@impsec.org>
On Fri, 2 Jun 2017, David Jones wrote:
>> > Wow! I just started collecting ham/spam for masscheck back in January
>> > and my (apparently) tiny corpus only takes under a minute to run on a
>> > low end 2 core VM. I didn't realize that there would be some that take
>> > a long time to run. Still pretty new to all of this backend processing.
>>
I was under the mistaken impression that we were supposed to be manually
sorting individual spam/ham to prevent duplicates which was very time
consuming. Now that I know how others are doing this, I have opened up
the spam/ham floodgates to sort them into staging folders for a quick review
and move into the final masscheck dirs.
>It shouldn't be *too* difficult to incorporate multi-core into the simple
>masscheck script. Take a look for a "-j" parameter in the docs, IIRC all I
>did was add "-j4" to the command line.
I watched my masscheck run a few minutes ago and it is definitely
threading out to take advantage of multiple cores. Now that I have a large
corpus to masscheck, I am going to bump up the RAM and the cores on
my VM. My run on about 8K messages took about 30 minutes on 2 cores.
Dave
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
On 06/02/2017 03:23 PM, John Hardin wrote:
> On Fri, 2 Jun 2017, David Jones wrote:
>
>> On 06/01/2017 07:52 PM, John Hardin wrote:
>>> On Thu, 1 Jun 2017, David Jones wrote:
>>
>>> > I have not found enough details yet on the central masscheck so I
>>> have
>>> > started with getting the remote masscheck processing working first so
>>> > we can get sa-update going again.
>>>
>>> OK, I will focus on that instead.
>>>
>>> > > P.S. After spending the past month learning how this works, I
>>> have > > some ideas on how to make the nightly masschecks become
>>> hourly > > fairly easily so we can test and promote rule changes
>>> faster.
>>> > > > How do you guarantee all the contributors can perform the
>>> checks > > within an hour?
>>> > > Wow! I just started collecting ham/spam for masscheck back in
>>> January
>>> > and my (apparently) tiny corpus only takes under a minute to run on a
>>> > low end 2 core VM. I didn't realize that there would be some that
>>> take
>>> > a long time to run. Still pretty new to all of this backend
>>> processing.
>>>
>>> There are some contributors that run large honeypot networks.
>>>
>>> My fairly small corpus takes (IIRC, it's been a while since a ran a
>>> local
>>> masscheck) over an hour on a dedicated 4-core box...
>>>
>>> > Dave
>>
>> John,
>>
>> Can you watch your 4-core box when your local masscheck is running?
>> Unless you have written something extra or there is another masscheck
>> script that I haven't found, the masscheck processing is single threaded.
>
> I haven't been using the simple masscheck script because I was just
> doing it locally for my own consumption. It's a custom script.
>
>> It doesn't matter if you have 24 cores, it's still going to take over
>> an hour. We may need to look at writing a newer masscheck script that
>> will use async processing to get that time way down.
>
> There is a parameter you can set somewhere to say how many parallel
> scanning processes to run, and I was running 4. That box is powered down
> at the moment, I will need to reboot it to look at my scripting. I'll do
> that and post the details this evening or tomorrow.
>
> It shouldn't be *too* difficult to incorporate multi-core into the
> simple masscheck script. Take a look for a "-j" parameter in the docs,
> IIRC all I did was add "-j4" to the command line.
>
I recall that now. My memory is not what it used to be. I have to
triple check things these days and I didn't in this case. :)
--
Dave
Re: Ruleqa masscheck so close.
Posted by John Hardin <jh...@impsec.org>.
On Fri, 2 Jun 2017, David Jones wrote:
> On 06/01/2017 07:52 PM, John Hardin wrote:
>> On Thu, 1 Jun 2017, David Jones wrote:
>
>> > I have not found enough details yet on the central masscheck so I have
>> > started with getting the remote masscheck processing working first so
>> > we can get sa-update going again.
>>
>> OK, I will focus on that instead.
>>
>> > > P.S. After spending the past month learning how this works, I have
>> > > some ideas on how to make the nightly masschecks become hourly
>> > > fairly easily so we can test and promote rule changes faster.
>> >
>> > > How do you guarantee all the contributors can perform the checks
>> > > within an hour?
>> >
>> > Wow! I just started collecting ham/spam for masscheck back in January
>> > and my (apparently) tiny corpus only takes under a minute to run on a
>> > low end 2 core VM. I didn't realize that there would be some that take
>> > a long time to run. Still pretty new to all of this backend processing.
>>
>> There are some contributors that run large honeypot networks.
>>
>> My fairly small corpus takes (IIRC, it's been a while since a ran a local
>> masscheck) over an hour on a dedicated 4-core box...
>>
>> > Dave
>
> John,
>
> Can you watch your 4-core box when your local masscheck is running? Unless
> you have written something extra or there is another masscheck script that I
> haven't found, the masscheck processing is single threaded.
I haven't been using the simple masscheck script because I was just doing
it locally for my own consumption. It's a custom script.
> It doesn't matter if you have 24 cores, it's still going to take over an
> hour. We may need to look at writing a newer masscheck script that will
> use async processing to get that time way down.
There is a parameter you can set somewhere to say how many parallel
scanning processes to run, and I was running 4. That box is powered down
at the moment, I will need to reboot it to look at my scripting. I'll do
that and post the details this evening or tomorrow.
It shouldn't be *too* difficult to incorporate multi-core into the simple
masscheck script. Take a look for a "-j" parameter in the docs, IIRC all I
did was add "-j4" to the command line.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Our government should bear in mind the fact that the American
Revolution was touched off by the then-current government
attempting to confiscate firearms from the people.
-----------------------------------------------------------------------
4 days until the 73rd anniversary of D-Day
Re: Ruleqa masscheck so close.
Posted by Dave Jones <da...@apache.org>.
On 06/04/2017 01:18 PM, Jari Fredriksson wrote:
> Cool. I have been doing masschecks somewhere at 1700UTC but now changed
> it to take place at 1200EET. My corpus takes 3-4 hours on 4 core (Core
> i7 920) or 1/2h at Google Compute @32 cores and ramdisk. I'm not looking
> much at the planned hourly submission, but it remains to be seen, what
> would it be.
>
Daily masschecks are probably enough. I thought that was the only time
that rules were able to be updated. Still learning/discovering what
used to run on the previous servers a few months ago.
It's very possible that as I keep digging in the old server backups and
logs, I may find that buildbot was linked to SVN commits so the devs
could do on-demand rule updates.
> Ideally I would want to do this at night time when the electricity is
> cheap...
>
If we can get enough interest in running masschecks twice a day to allow
for some flexibility in the time for electricity costs, then we could
let people choose to do both or stay with a single run. I would be
willing to do a 9:00 UTC run and a 21:00 UTC run if anyone else thinks
this would be worth it.
For our current setup at 9:00 UTC, if we can get enough masscheck
submissions by 13:00 UTC, then I can setup a script to detect when we
have met the minimum requirements and run the scores updates without
having to wait another ~20 hours like it's cron'd today.
Dave
Re: Ruleqa masscheck so close.
Posted by "Kevin A. McGrail" <ke...@mcgrail.com>.
There is a way to set a ceiling. My masscheck is still messed up as I am working in foundational issues to launch things at a new data center.
Is there a score set anywhere for URI_WP_HACKED? I can look later but there are ways to force scores and especially force a ceiling. I equate it to doing manual square roots. We have to feed it an educated starting point.l and let it float from there.
It might be more pinned than it should be!
Regards,
KAM
On June 9, 2017 9:03:16 AM EDT, David Jones <dj...@ena.com.INVALID> wrote:
>Question about the URI_WP_HACKED rule. Why is it still at the default
>of 1.0 since it's S/O on http://ruleqa.spamassassin.org has been 1.000
>for a long time?
>
>What sets the default scores in 50_scores.cf and what determines goes
>into the nightly 72_scores.cf? Is there still something I need to find
>
>and get running again on the new server?
Re: Ruleqa masscheck so close.
Posted by John Hardin <jh...@impsec.org>.
On Fri, 9 Jun 2017, David Jones wrote:
> Question about the URI_WP_HACKED rule. Why is it still at the default of 1.0
> since it's S/O on http://ruleqa.spamassassin.org has been 1.000 for a long
> time?
There's a limit of 3.000, the score generator decides on a score up to
that.
It takes into account total hits and score already on those messages as
well as S/O. A rule with an S/O of 1.00 that hits on few messages that
already score well, won't be scored very high.
> What sets the default scores in 50_scores.cf
That's manual now.
> and what determines goes into the nightly 72_scores.cf?
The masscheck score generation process. Running the "what hits"
distributed part of masscheck is only part of it. There's another phase
that takes that data from everybody and calculates rule scores.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
How can you reason with someone who thinks we're on a glidepath to
a police state and yet their solution is to grant the government a
monopoly on force? They are insane.
-----------------------------------------------------------------------
71 days since the first commercial re-flight of an orbital booster (SpaceX)
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
On 06/04/2017 01:18 PM, Jari Fredriksson wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Cool. I have been doing masschecks somewhere at 1700UTC but now changed
> it to take place at 1200EET. My corpus takes 3-4 hours on 4 core (Core
> i7 920) or 1/2h at Google Compute @32 cores and ramdisk. I'm not looking
> much at the planned hourly submission, but it remains to be seen, what
> would it be.
>
> Ideally I would want to do this at night time when the electricity is
> cheap...
>
My 'ena' corpus is now up 85K (28K spam/57K ham), growing about 10-12K a
day and scoring consistently on the ham/spam rule hits:
Rule hit frequencies:
OVERALL SPAM HAM NAME
85428 28358 57070 (all messages)
8647 8643 4 URI_WP_HACKED
3914 3914 0 HELO_MISC_IP
3138 3136 2 DATE_IN_FUTURE_06_12
2897 2896 1 T_PDS_TO_EQ_FROM_NAME
3105 3102 3 T_PDS_FROM_2_EMAILS
2731 2729 2 DRUGS_ERECTILE
2415 2415 0 URI_ONLY_MSGID_MALF
2402 2402 0 DOS_OE_TO_MX
2509 2507 2 LONGWORDS
1928 1928 0 DRUGS_ERECTILE_OBFU
3644 3622 22 MIMEOLE_DIRECT_TO_MX
1820 1817 3 MISSING_SUBJECT
1657 1657 0 FUZZY_PHARMACY
1648 1648 0 DOS_OUTLOOK_TO_MX
2709 2693 16 T_NAME_EMAIL_DIFF
1545 1544 1 DATE_IN_FUTURE_03_06
1514 1514 0 MISSING_MIME_HB_SEP
1142 1142 0 SUBJECT_DRUG_GAP_L
1128 1127 1 FUZZY_PRICES
1064 1064 0 SUBJECT_DRUG_GAP_C
My masscheck processing is taking about 2 hours on my 4 core VM.
Question about the URI_WP_HACKED rule. Why is it still at the default
of 1.0 since it's S/O on http://ruleqa.spamassassin.org has been 1.000
for a long time?
What sets the default scores in 50_scores.cf and what determines goes
into the nightly 72_scores.cf? Is there still something I need to find
and get running again on the new server?
--
Dave
Re: Ruleqa masscheck so close.
Posted by Jari Fredriksson <ja...@iki.fi>.
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
David Jones kirjoitti 3.6.2017 0:09:
> On 06/02/2017 04:01 PM, Kevin Golding wrote:
>> On Fri, 02 Jun 2017 21:06:37 +0100, David Jones <dj...@ena.com.invalid> wrote:
>>
>>> P.S. we are currently still at only 9 masscheck contributors in the past day so we need one or two more.
>>
>> Unfortunately with ruleqa down you're the only person who knows the upload status. I received confirmation from my system at 11:56:06 which said my uploads were successful.
>>
>> Given you reported 9 yesterday I don't know if there's an issue at the server with my upload or if someone else dropped off. If there is an issue with my upload then please tell me and I'll look into it, but I have no way of telling from the data you've provided.
>
> Woohoo! We now have 11...
>
> SVN tagged rev in nightly_mass_check: 1797329
>
> New masscheck submission listings in the past day:
> SVN rev (Match) File Name (Date)
> 1797329 (Yes) - spam-darxus.log (Jun 2 02:09)
> 1797329 (Yes) - ham-kgolding.log (Jun 2 05:00)
> 1797329 (Yes) - ham-darxus.log (Jun 2 02:09)
> 1797329 (Yes) - ham-grenier.log (Jun 2 02:02)
> 1797329 (Yes) - ham-ena.log (Jun 2 02:07)
> 1797329 (Yes) - spam-jbrooks.log (Jun 2 13:00)
> 1797329 (Yes) - spam-axb-generic.log (Jun 2 04:41)
> 1797329 (Yes) - spam-axb-ham-misc.log (Jun 2 04:41)
> 1797329 (Yes) - spam-grenier.log (Jun 2 02:02)
> 1797329 (Yes) - ham-axb-ham-misc.log (Jun 2 04:41)
> 1797329 (Yes) - spam-kgolding.log (Jun 2 05:00)
> 1797329 (Yes) - ham-axb-generic.log (Jun 2 04:41)
> 1797329 (Yes) - ham-axb-ninja.log (Jun 2 04:41)
> 1797329 (Yes) - spam-axb-ninja.log (Jun 2 04:41)
> 1797329 (Yes) - ham-jbrooks.log (Jun 2 13:00)
> 1797329 (Yes) - spam-jarif.log (Jun 2 12:50)
> 1797329 (Yes) - ham-thendrikx.log (Jun 2 02:04)
> 1797329 (Yes) - ham-jarif.log (Jun 2 12:50)
> 1797329 (Yes) - spam-axb-coi-bulk.log (Jun 2 04:41)
> 1797329 (Yes) - spam-thendrikx.log (Jun 2 02:04)
> 1797329 (Yes) - ham-axb-coi-bulk.log (Jun 2 04:41)
> 1797329 (Yes) - spam-ena.log (Jun 2 02:07)
>
> 22/22 matches (11 ham, 11 spam)
Cool. I have been doing masschecks somewhere at 1700UTC but now changed
it to take place at 1200EET. My corpus takes 3-4 hours on 4 core (Core
i7 920) or 1/2h at Google Compute @32 cores and ramdisk. I'm not looking
much at the planned hourly submission, but it remains to be seen, what
would it be.
Ideally I would want to do this at night time when the electricity is
cheap...
- --
jarif@iki.fi
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iEYEARECAAYFAlk0TtoACgkQKL4IzOyjSrYrxACfajqt2TqTJIEW7OWW2y4n9wuL
iZMAn0nwgL64OXMVdVaXGIQS5FvEgdDZ
=4BgA
-----END PGP SIGNATURE-----
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
On 06/02/2017 04:01 PM, Kevin Golding wrote:
> On Fri, 02 Jun 2017 21:06:37 +0100, David Jones <dj...@ena.com.invalid>
> wrote:
>
>> P.S. we are currently still at only 9 masscheck contributors in the
>> past day so we need one or two more.
>
> Unfortunately with ruleqa down you're the only person who knows the
> upload status. I received confirmation from my system at 11:56:06 which
> said my uploads were successful.
>
> Given you reported 9 yesterday I don't know if there's an issue at the
> server with my upload or if someone else dropped off. If there is an
> issue with my upload then please tell me and I'll look into it, but I
> have no way of telling from the data you've provided.
Woohoo! We now have 11...
SVN tagged rev in nightly_mass_check: 1797329
New masscheck submission listings in the past day:
SVN rev (Match) File Name (Date)
1797329 (Yes) - spam-darxus.log (Jun 2 02:09)
1797329 (Yes) - ham-kgolding.log (Jun 2 05:00)
1797329 (Yes) - ham-darxus.log (Jun 2 02:09)
1797329 (Yes) - ham-grenier.log (Jun 2 02:02)
1797329 (Yes) - ham-ena.log (Jun 2 02:07)
1797329 (Yes) - spam-jbrooks.log (Jun 2 13:00)
1797329 (Yes) - spam-axb-generic.log (Jun 2 04:41)
1797329 (Yes) - spam-axb-ham-misc.log (Jun 2 04:41)
1797329 (Yes) - spam-grenier.log (Jun 2 02:02)
1797329 (Yes) - ham-axb-ham-misc.log (Jun 2 04:41)
1797329 (Yes) - spam-kgolding.log (Jun 2 05:00)
1797329 (Yes) - ham-axb-generic.log (Jun 2 04:41)
1797329 (Yes) - ham-axb-ninja.log (Jun 2 04:41)
1797329 (Yes) - spam-axb-ninja.log (Jun 2 04:41)
1797329 (Yes) - ham-jbrooks.log (Jun 2 13:00)
1797329 (Yes) - spam-jarif.log (Jun 2 12:50)
1797329 (Yes) - ham-thendrikx.log (Jun 2 02:04)
1797329 (Yes) - ham-jarif.log (Jun 2 12:50)
1797329 (Yes) - spam-axb-coi-bulk.log (Jun 2 04:41)
1797329 (Yes) - spam-thendrikx.log (Jun 2 02:04)
1797329 (Yes) - ham-axb-coi-bulk.log (Jun 2 04:41)
1797329 (Yes) - spam-ena.log (Jun 2 02:07)
22/22 matches (11 ham, 11 spam)
--
Dave
Re: Ruleqa masscheck so close.
Posted by Kevin Golding <kp...@caomhin.org>.
On Fri, 02 Jun 2017 21:06:37 +0100, David Jones <dj...@ena.com.invalid>
wrote:
> P.S. we are currently still at only 9 masscheck contributors in the past
> day so we need one or two more.
Unfortunately with ruleqa down you're the only person who knows the upload
status. I received confirmation from my system at 11:56:06 which said my
uploads were successful.
Given you reported 9 yesterday I don't know if there's an issue at the
server with my upload or if someone else dropped off. If there is an issue
with my upload then please tell me and I'll look into it, but I have no
way of telling from the data you've provided.
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
On 06/01/2017 07:52 PM, John Hardin wrote:
> On Thu, 1 Jun 2017, David Jones wrote:
>> I have not found enough details yet on the central masscheck so I have
>> started with getting the remote masscheck processing working first so
>> we can get sa-update going again.
>
> OK, I will focus on that instead.
>
>>> P.S. After spending the past month learning how this works, I have some
>>> ideas on how to make the nightly masschecks become hourly fairly easily
>>> so we can test and promote rule changes faster.
>>
>>> How do you guarantee all the contributors can perform the checks within
>>> an hour?
>>
>> Wow! I just started collecting ham/spam for masscheck back in January
>> and my (apparently) tiny corpus only takes under a minute to run on a
>> low end 2 core VM. I didn't realize that there would be some that take
>> a long time to run. Still pretty new to all of this backend processing.
>
> There are some contributors that run large honeypot networks.
>
> My fairly small corpus takes (IIRC, it's been a while since a ran a
> local masscheck) over an hour on a dedicated 4-core box...
>
>> Dave
>
John,
Can you watch your 4-core box when your local masscheck is running?
Unless you have written something extra or there is another masscheck
script that I haven't found, the masscheck processing is single
threaded. It doesn't matter if you have 24 cores, it's still going to
take over an hour. We may need to look at writing a newer masscheck
script that will use async processing to get that time way down.
P.S. we are currently still at only 9 masscheck contributors in the past
day so we need one or two more.
--
Dave
Re: Ruleqa masscheck so close.
Posted by "Kevin A. McGrail" <ke...@mcgrail.com>.
On 6/1/2017 9:42 PM, Dave Jones wrote:
> How are these large honeypot contributors sorting the ham/spam
> at a large scale without spending all day manually doing it?
With the honeypot that I run, it's 100% spam. No need to sort.
Regards,
KAM
Re: Ruleqa masscheck so close.
Posted by Dave Jones <da...@apache.org>.
On 06/01/2017 07:52 PM, John Hardin wrote:
> On Thu, 1 Jun 2017, David Jones wrote:
>
>>> From: John Hardin <jh...@impsec.org>
>>
>>>> On Thu, 1 Jun 2017, David Jones wrote:
>>
>>>> I am working pretty hard to get the ruleqa processing going
>>>> again on our new server. We are so close to having enough
>>>> contributors and ham/spam to get some new rules generated:
>>>> This is from the run minutes ago:
>>>>
>>>> HAM CONTRIBUTORS FOUND: 9 (required 10)
>>>> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>>>>
>>>> We need to recruit some more masscheck'ers to get over the hump so I
>>>> can
>>>> do some final testing of the rules updates and start the DNS updates
>>>> again for sa-update.
>>
>>> I upload my corpora, not my results - I suppose that my corpora didn't
>>> survive the migration, and I haven't yet brought my submission bot and
>>> rsync account up-to-date for the new hardware - my apologies. I may be
>>> able to give that some cycles this weekend.
>>
>> So, about that. I just started helping as a sysadmin a month or so
>> ago. We
>> had some hosting issues that we are trying to recover from with very
>> little
>> documentation of the infrastructure 3 months ago. I am having to dig
>> through logs, cron output, and old (outdated) documentation to try to put
>> the puzzle back together again.
>>
>> If anyone has any knowledge or documentation of how things were setup
>> in the past, I would love to talk with you.
>>
>> We do have backups from one of the servers but I think there were two
>> or three servers before based on the fact that I can't find any evidence
>> where buildbot was running that I think was running the centralized
>> masscheck.
>>
>>> However, I don't recall seeing confirmation that the central
>>> masscheck was
>>> actually working; can you confirm that? Or do I need to change over to
>>> local masscheck and uploading results like most others do?
>>
>> I have not found enough details yet on the central masscheck so I have
>> started with getting the remote masscheck processing working first so
>> we can get sa-update going again.
>
> OK, I will focus on that instead.
>
>>> P.S. After spending the past month learning how this works, I have some
>>> ideas on how to make the nightly masschecks become hourly fairly easily
>>> so we can test and promote rule changes faster.
>>
>>> How do you guarantee all the contributors can perform the checks within
>>> an hour?
>>
>> Wow! I just started collecting ham/spam for masscheck back in January
>> and my (apparently) tiny corpus only takes under a minute to run on a
>> low end 2 core VM. I didn't realize that there would be some that take
>> a long time to run. Still pretty new to all of this backend processing.
>
> There are some contributors that run large honeypot networks.
>
Cool. How are these large honeypot contributors sorting the ham/spam
at a large scale without spending all day manually doing it? I have the
potential of getting large amounts of ham/spam but I don't have the time
to manually sort all of it.
> My fairly small corpus takes (IIRC, it's been a while since a ran a
> local masscheck) over an hour on a dedicated 4-core box...
>
So if we can get this to run as close to 9:00 AM UTC then maybe we can
get enough contributions by 11 or 12 AM UTC to roll out new ruleqa and
scores out for sa-update sooner than in the past.
--
David Jones
Re: Ruleqa masscheck so close.
Posted by John Hardin <jh...@impsec.org>.
On Thu, 1 Jun 2017, David Jones wrote:
>> From: John Hardin <jh...@impsec.org>
>
>>> On Thu, 1 Jun 2017, David Jones wrote:
>
>>> I am working pretty hard to get the ruleqa processing going
>>> again on our new server. We are so close to having enough
>>> contributors and ham/spam to get some new rules generated:
>>> This is from the run minutes ago:
>>>
>>> HAM CONTRIBUTORS FOUND: 9 (required 10)
>>> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>>>
>>> We need to recruit some more masscheck'ers to get over the hump so I can
>>> do some final testing of the rules updates and start the DNS updates
>>> again for sa-update.
>
>> I upload my corpora, not my results - I suppose that my corpora didn't
>> survive the migration, and I haven't yet brought my submission bot and
>> rsync account up-to-date for the new hardware - my apologies. I may be
>> able to give that some cycles this weekend.
>
> So, about that. I just started helping as a sysadmin a month or so ago. We
> had some hosting issues that we are trying to recover from with very little
> documentation of the infrastructure 3 months ago. I am having to dig
> through logs, cron output, and old (outdated) documentation to try to put
> the puzzle back together again.
>
> If anyone has any knowledge or documentation of how things were setup
> in the past, I would love to talk with you.
>
> We do have backups from one of the servers but I think there were two
> or three servers before based on the fact that I can't find any evidence
> where buildbot was running that I think was running the centralized
> masscheck.
>
>> However, I don't recall seeing confirmation that the central masscheck was
>> actually working; can you confirm that? Or do I need to change over to
>> local masscheck and uploading results like most others do?
>
> I have not found enough details yet on the central masscheck so I have
> started with getting the remote masscheck processing working first so
> we can get sa-update going again.
OK, I will focus on that instead.
>> P.S. After spending the past month learning how this works, I have some
>> ideas on how to make the nightly masschecks become hourly fairly easily
>> so we can test and promote rule changes faster.
>
>> How do you guarantee all the contributors can perform the checks within
>> an hour?
>
> Wow! I just started collecting ham/spam for masscheck back in January
> and my (apparently) tiny corpus only takes under a minute to run on a
> low end 2 core VM. I didn't realize that there would be some that take
> a long time to run. Still pretty new to all of this backend processing.
There are some contributors that run large honeypot networks.
My fairly small corpus takes (IIRC, it's been a while since a ran a local
masscheck) over an hour on a dedicated 4-core box...
> Dave
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
If you trust the government, you obviously failed history class.
-- Don Freeman
-----------------------------------------------------------------------
5 days until the 73rd anniversary of D-Day
Re: Ruleqa masscheck so close.
Posted by John Brooks <jo...@fastquake.com>.
On 06/01/2017 08:29 PM, David Jones wrote:
>> From: John Brooks <jo...@fastquake.com>
>
>> On 2017-06-01 07:59 PM, David Jones wrote:
>>>> How do you guarantee all the contributors can perform the checks within
>>>> an hour?
>>> Wow! I just started collecting ham/spam for masscheck back in January
>>> and my (apparently) tiny corpus only takes under a minute to run on a
>>> low end 2 core VM. I didn't realize that there would be some that take
>>> a long time to run. Still pretty new to all of this backend processing.
>>>
>>> Dave
>> Today while setting it up, I did a test run, and it took 2 hours on my
>> single core VPS. My corpus is ~40k messages: ~30k spam from the last 3
>> months (spammers love me), the rest is ham from the last 4 years. But
>> that was a full run, clean slate. Maybe it doesn't have to re-scan every
>> message the next time, but I have no clue because I haven't done it yet.
> Impressive. Are you manually sorting those 40K messages in the past 3
> months? I have a couple of domains that attract a lot of spam but I
> couldn't sort all of that volume of mail so I use a few RBLs to block
> the "low hanging fruit" and then have rules put ham and spam into
> folders to make my manual sorting doable in 15 to 30 minutes a day.
>
> I could remove my RBLs and setup rules to automatically sort a ton of
> spam and ham into folders and have a pretty good accuracy but I don't
> think that is what we are supposed to be doing. This would have a lot
> of duplicates and could have some incorrectly categorized ham and
> spam.
>
> Dave
For the spam, almost all of it goes into my Junk folder right off the
bat (thanks to spamassassin). Then I search my junk folder in
Thunderbird for a couple of unique strings that are often present in
spam that I get, visually skim what comes up for any false positives,
and move the spam to the spam corpus folder. That covers most of the
spam that I get, and it doesn't take long because most of the work is
done by the string search giving me enough confidence to not have to
properly inspect every message. Anything that remains, I classify
manually in the normal way.
The amount of ham that I get isn't insane, like I said those 10k
messages are from the past 4 years. I sort them into folders (no
differently from how any organized email user would), and set those
folders to be scanned as ham.
Re: Ruleqa masscheck so close.
Posted by Kevin Golding <kp...@caomhin.org>.
On Fri, 02 Jun 2017 01:29:12 +0100, David Jones <dj...@ena.com.invalid>
wrote:
> I could remove my RBLs and setup rules to automatically sort a ton of
> spam and ham into folders and have a pretty good accuracy but I don't
> think that is what we are supposed to be doing. This would have a lot
> of duplicates and could have some incorrectly categorized ham and
> spam.
I use automatic sorting to make life easier, but then manually check:
USER_IN_DEF_WHITELIST is probably going to be ham. Put it in a whitelist
pile and scroll through it quickly to spot anything that doesn't belong.
Things like URIBL_BLACK and the Spamahus rules tend to be a pretty good
sign of spam so they get put in the RBL pile for a quick once over.
There are certain accounts which should never get ham. They can go into
another nice pile to once more whizz through.
Anything misclassified in those filters tends to stand out a mile (and is
pretty rare too). If I'm short of time I can move anything I question in
those quick scans into a discard pile, or if I've got a bit longer I can
look at it more carefully. If I'm not 100% on anything it stays discarded.
I have more rules in place for moving things around into different piles.
I have different levels of trust for different piles, but probably 80-90%
of my corpora takes maybe 10 minutes per day using that approach. Granted
that's come from refining it over time and learning the mail flows I'm
using, probably when I started it was slower and clumsier. If I have a few
days off and come back to a mountain that needs sorting... well the easy
stuff gets dealt with and the tough stuff is moved to discard.
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
>From: John Brooks <jo...@fastquake.com>
>On 2017-06-01 07:59 PM, David Jones wrote:
>>
>>> How do you guarantee all the contributors can perform the checks within
>>> an hour?
>> Wow! I just started collecting ham/spam for masscheck back in January
>> and my (apparently) tiny corpus only takes under a minute to run on a
>> low end 2 core VM. I didn't realize that there would be some that take
>> a long time to run. Still pretty new to all of this backend processing.
>>
>> Dave
>Today while setting it up, I did a test run, and it took 2 hours on my
>single core VPS. My corpus is ~40k messages: ~30k spam from the last 3
>months (spammers love me), the rest is ham from the last 4 years. But
>that was a full run, clean slate. Maybe it doesn't have to re-scan every
>message the next time, but I have no clue because I haven't done it yet.
Impressive. Are you manually sorting those 40K messages in the past 3
months? I have a couple of domains that attract a lot of spam but I
couldn't sort all of that volume of mail so I use a few RBLs to block
the "low hanging fruit" and then have rules put ham and spam into
folders to make my manual sorting doable in 15 to 30 minutes a day.
I could remove my RBLs and setup rules to automatically sort a ton of
spam and ham into folders and have a pretty good accuracy but I don't
think that is what we are supposed to be doing. This would have a lot
of duplicates and could have some incorrectly categorized ham and
spam.
Dave
Re: Ruleqa masscheck so close.
Posted by "Kevin A. McGrail" <ke...@mcgrail.com>.
On 6/3/2017 2:46 PM, John Brooks wrote:
> Well, for me, this only matters for masschecks (and only weekly
> masschecks have the --net option set). My normal mail volume per day
> is small enough that the number of DNS queries isn't an issue since
> they're spread out over time. It's just scanning thousands of messages
> all at once that concerns me.
>
> If I were to disable most of the DNSBLs for my weekly masschecks, I
> might as well just remove the --net argument. But I don't want to
> diminish the quality of the masscheck data if I can avoid it. I also
> don't want to abuse free DNSBLs with heavy load, however.
Gents, I'm not aware of any RBL blocking anyone from the masschecking.
If it is a problem, I am sure we have contacts to reach out to and
request a larger limit, etc.
Regards,
KAM
Re: Ruleqa masscheck so close.
Posted by John Brooks <jo...@fastquake.com>.
On 2017-06-03 12:14 PM, David Jones wrote:
> On 06/03/2017 10:47 AM, John Brooks wrote:
>> On 2017-06-01 08:14 PM, John Brooks wrote:
>>> On 2017-06-01 07:59 PM, David Jones wrote:
>>>>
>>>>> How do you guarantee all the contributors can perform the checks
>>>>> within
>>>>> an hour?
>>>> Wow! I just started collecting ham/spam for masscheck back in January
>>>> and my (apparently) tiny corpus only takes under a minute to run on a
>>>> low end 2 core VM. I didn't realize that there would be some that
>>>> take
>>>> a long time to run. Still pretty new to all of this backend
>>>> processing.
>>>>
>>>> Dave
>>>
>>> Today while setting it up, I did a test run, and it took 2 hours on
>>> my single core VPS. My corpus is ~40k messages: ~30k spam from the
>>> last 3 months (spammers love me), the rest is ham from the last 4
>>> years. But that was a full run, clean slate. Maybe it doesn't have
>>> to re-scan every message the next time, but I have no clue because I
>>> haven't done it yet.
>>
>> And today the weekly one ran and it took about 4 hours because of all
>> the network tests. I could probably increase the job count to speed
>> that up. But now that I think about it, scanning 40 000 messages at
>> once with network tests enabled is a *lot* of DNS requests. Am I
>> going to get myself banned from the DNSBLs by doing that? Most of
>> them have rate limits on requests for non-paying users.
>>
>
> That's possible. How many "main/primary" DNSBLs have this
> restriction? I am aware of SpamHaus but I have disabled all of the
> others since they weren't that valuable anyway for my environment:
>
> score URIBL_WS_SURBL 0
> score URIBL_PH_SURBL 0
> score URIBL_MW_SURBL 0
> score URIBL_CR_SURBL 0
> score URIBL_ABUSE_SURBL 0
> score URIBL_RHS_DOB 0
> score URIBL_SBL 0
> score URIBL_SBL_A 0
> score URIBL_DBL_SPAM 0
> score URIBL_DBL_PHISH 0
> score URIBL_DBL_MALWARE 0
> score URIBL_DBL_BOTNETCC 0
> score URIBL_DBL_ABUSE_SPAM 0
> score URIBL_DBL_ABUSE_REDIR 0
> score URIBL_DBL_ABUSE_PHISH 0
> score URIBL_DBL_ABUSE_MALW 0
> score URIBL_DBL_ABUSE_BOTCC 0
> score URIBL_DBL_ERROR 0
> score URIBL_BLACK 0
> score URIBL_GREY 0
> score URIBL_RED 0
>
> Senderscore.org has been very good for the past couple of years and
> they don't seem to have limits. I had to setup my own rules:
>
> ifplugin Mail::SpamAssassin::Plugin::DNSEval
>
> header __RCVD_IN_SENDERSCORE_90_100
> eval:check_rbl('senderscore90-lastexternal','score.senderscore.com.','^127\.0\.4\.(9[0-9]|100)$')
> meta RCVD_IN_SENDERSCORE_90_100 SPF_PASS &&
> __RCVD_IN_SENDERSCORE_90_100
> describe RCVD_IN_SENDERSCORE_90_100 Senderscore.org score of 90
> to 100
> score RCVD_IN_SENDERSCORE_90_100 -2.2
> tflags RCVD_IN_SENDERSCORE_90_100 net
>
> header __RCVD_IN_SENDERSCORE_80_89
> eval:check_rbl('senderscorer80-lastexternal','score.senderscore.com.','^127\.0\.4\.(8[0-9])$')
> meta RCVD_IN_SENDERSCORE_80_89 SPF_PASS &&
> __RCVD_IN_SENDERSCORE_80_89
> describe RCVD_IN_SENDERSCORE_80_89 Senderscore.org score of 80
> to 89
> score RCVD_IN_SENDERSCORE_80_89 -1.2
> tflags RCVD_IN_SENDERSCORE_80_89 net
>
> header RCVD_IN_SENDERSCORE_70_79
> eval:check_rbl('senderscorer70-lastexternal','score.senderscore.com.','^127\.0\.4\.(7[0-9])$')
> describe RCVD_IN_SENDERSCORE_70_79 Senderscore.org score of 70
> to 79
> score RCVD_IN_SENDERSCORE_70_79 1.2
> tflags RCVD_IN_SENDERSCORE_70_79 net
>
> header RCVD_IN_SENDERSCORE_60_69
> eval:check_rbl('senderscorer60-lastexternal','score.senderscore.com.','^127\.0\.4\.(6[0-9])$')
> describe RCVD_IN_SENDERSCORE_60_69 Senderscore.org score of 60
> to 69
> score RCVD_IN_SENDERSCORE_60_69 2.2
> tflags RCVD_IN_SENDERSCORE_60_69 net
>
> header RCVD_IN_SENDERSCORE_50_59
> eval:check_rbl('senderscorer50-lastexternal','score.senderscore.com.','^127\.0\.4\.(5[0-9])$')
> describe RCVD_IN_SENDERSCORE_50_59 Senderscore.org score of 50
> to 59
> score RCVD_IN_SENDERSCORE_50_59 3.2
> tflags RCVD_IN_SENDERSCORE_50_59 net
>
> header RCVD_IN_SENDERSCORE_30_49
> eval:check_rbl('senderscorer30-lastexternal','score.senderscore.com.','^127\.0\.4\.([3-4][0-9])$')
> describe RCVD_IN_SENDERSCORE_30_49 Senderscore.org score of 30
> to 49
> score RCVD_IN_SENDERSCORE_30_49 4.2
> tflags RCVD_IN_SENDERSCORE_30_49 net
>
> header RCVD_IN_SENDERSCORE_0_29
> eval:check_rbl('senderscore0-lastexternal','score.senderscore.com.','^127\.0\.4\.([1-2]?[0-9])$')
> describe RCVD_IN_SENDERSCORE_0_29 Senderscore.org score of 0 to 29
> score RCVD_IN_SENDERSCORE_0_29 5.2
> tflags RCVD_IN_SENDERSCORE_0_29 net
>
> endif
>
> How would we go about putting these into the built-in SA rules? Do we
> need to get permission from senderscore.org and start testing them
> with tiny scores?
>
> I have paid rsync feeds for spamhaus-pbl, spamhaus-sbl, spamhaus-xbl,
> and invaluement. Uceprotect.net and SORBS provide a free rsync feed.
>
> Setup a local caching DNS server to help reduce the external queries?
> I just did on my server after you mentioned that.
>
Well, for me, this only matters for masschecks (and only weekly
masschecks have the --net option set). My normal mail volume per day is
small enough that the number of DNS queries isn't an issue since they're
spread out over time. It's just scanning thousands of messages all at
once that concerns me.
If I were to disable most of the DNSBLs for my weekly masschecks, I
might as well just remove the --net argument. But I don't want to
diminish the quality of the masscheck data if I can avoid it. I also
don't want to abuse free DNSBLs with heavy load, however.
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
On 06/03/2017 10:47 AM, John Brooks wrote:
> On 2017-06-01 08:14 PM, John Brooks wrote:
>> On 2017-06-01 07:59 PM, David Jones wrote:
>>>
>>>> How do you guarantee all the contributors can perform the checks within
>>>> an hour?
>>> Wow! I just started collecting ham/spam for masscheck back in January
>>> and my (apparently) tiny corpus only takes under a minute to run on a
>>> low end 2 core VM. I didn't realize that there would be some that take
>>> a long time to run. Still pretty new to all of this backend processing.
>>>
>>> Dave
>>
>> Today while setting it up, I did a test run, and it took 2 hours on my
>> single core VPS. My corpus is ~40k messages: ~30k spam from the last 3
>> months (spammers love me), the rest is ham from the last 4 years. But
>> that was a full run, clean slate. Maybe it doesn't have to re-scan
>> every message the next time, but I have no clue because I haven't done
>> it yet.
>
> And today the weekly one ran and it took about 4 hours because of all
> the network tests. I could probably increase the job count to speed that
> up. But now that I think about it, scanning 40 000 messages at once with
> network tests enabled is a *lot* of DNS requests. Am I going to get
> myself banned from the DNSBLs by doing that? Most of them have rate
> limits on requests for non-paying users.
>
That's possible. How many "main/primary" DNSBLs have this restriction?
I am aware of SpamHaus but I have disabled all of the others since they
weren't that valuable anyway for my environment:
score URIBL_WS_SURBL 0
score URIBL_PH_SURBL 0
score URIBL_MW_SURBL 0
score URIBL_CR_SURBL 0
score URIBL_ABUSE_SURBL 0
score URIBL_RHS_DOB 0
score URIBL_SBL 0
score URIBL_SBL_A 0
score URIBL_DBL_SPAM 0
score URIBL_DBL_PHISH 0
score URIBL_DBL_MALWARE 0
score URIBL_DBL_BOTNETCC 0
score URIBL_DBL_ABUSE_SPAM 0
score URIBL_DBL_ABUSE_REDIR 0
score URIBL_DBL_ABUSE_PHISH 0
score URIBL_DBL_ABUSE_MALW 0
score URIBL_DBL_ABUSE_BOTCC 0
score URIBL_DBL_ERROR 0
score URIBL_BLACK 0
score URIBL_GREY 0
score URIBL_RED 0
Senderscore.org has been very good for the past couple of years and they
don't seem to have limits. I had to setup my own rules:
ifplugin Mail::SpamAssassin::Plugin::DNSEval
header __RCVD_IN_SENDERSCORE_90_100
eval:check_rbl('senderscore90-lastexternal','score.senderscore.com.','^127\.0\.4\.(9[0-9]|100)$')
meta RCVD_IN_SENDERSCORE_90_100 SPF_PASS && __RCVD_IN_SENDERSCORE_90_100
describe RCVD_IN_SENDERSCORE_90_100 Senderscore.org score of 90 to 100
score RCVD_IN_SENDERSCORE_90_100 -2.2
tflags RCVD_IN_SENDERSCORE_90_100 net
header __RCVD_IN_SENDERSCORE_80_89
eval:check_rbl('senderscorer80-lastexternal','score.senderscore.com.','^127\.0\.4\.(8[0-9])$')
meta RCVD_IN_SENDERSCORE_80_89 SPF_PASS && __RCVD_IN_SENDERSCORE_80_89
describe RCVD_IN_SENDERSCORE_80_89 Senderscore.org score of 80 to 89
score RCVD_IN_SENDERSCORE_80_89 -1.2
tflags RCVD_IN_SENDERSCORE_80_89 net
header RCVD_IN_SENDERSCORE_70_79
eval:check_rbl('senderscorer70-lastexternal','score.senderscore.com.','^127\.0\.4\.(7[0-9])$')
describe RCVD_IN_SENDERSCORE_70_79 Senderscore.org score of 70 to 79
score RCVD_IN_SENDERSCORE_70_79 1.2
tflags RCVD_IN_SENDERSCORE_70_79 net
header RCVD_IN_SENDERSCORE_60_69
eval:check_rbl('senderscorer60-lastexternal','score.senderscore.com.','^127\.0\.4\.(6[0-9])$')
describe RCVD_IN_SENDERSCORE_60_69 Senderscore.org score of 60 to 69
score RCVD_IN_SENDERSCORE_60_69 2.2
tflags RCVD_IN_SENDERSCORE_60_69 net
header RCVD_IN_SENDERSCORE_50_59
eval:check_rbl('senderscorer50-lastexternal','score.senderscore.com.','^127\.0\.4\.(5[0-9])$')
describe RCVD_IN_SENDERSCORE_50_59 Senderscore.org score of 50 to 59
score RCVD_IN_SENDERSCORE_50_59 3.2
tflags RCVD_IN_SENDERSCORE_50_59 net
header RCVD_IN_SENDERSCORE_30_49
eval:check_rbl('senderscorer30-lastexternal','score.senderscore.com.','^127\.0\.4\.([3-4][0-9])$')
describe RCVD_IN_SENDERSCORE_30_49 Senderscore.org score of 30 to 49
score RCVD_IN_SENDERSCORE_30_49 4.2
tflags RCVD_IN_SENDERSCORE_30_49 net
header RCVD_IN_SENDERSCORE_0_29
eval:check_rbl('senderscore0-lastexternal','score.senderscore.com.','^127\.0\.4\.([1-2]?[0-9])$')
describe RCVD_IN_SENDERSCORE_0_29 Senderscore.org score of 0 to 29
score RCVD_IN_SENDERSCORE_0_29 5.2
tflags RCVD_IN_SENDERSCORE_0_29 net
endif
How would we go about putting these into the built-in SA rules? Do we
need to get permission from senderscore.org and start testing them with
tiny scores?
I have paid rsync feeds for spamhaus-pbl, spamhaus-sbl, spamhaus-xbl,
and invaluement. Uceprotect.net and SORBS provide a free rsync feed.
Setup a local caching DNS server to help reduce the external queries?
I just did on my server after you mentioned that.
--
Dave
Re: Ruleqa masscheck so close.
Posted by John Brooks <jo...@fastquake.com>.
On 2017-06-01 08:14 PM, John Brooks wrote:
> On 2017-06-01 07:59 PM, David Jones wrote:
>>
>>> How do you guarantee all the contributors can perform the checks within
>>> an hour?
>> Wow! I just started collecting ham/spam for masscheck back in January
>> and my (apparently) tiny corpus only takes under a minute to run on a
>> low end 2 core VM. I didn't realize that there would be some that take
>> a long time to run. Still pretty new to all of this backend processing.
>>
>> Dave
>
> Today while setting it up, I did a test run, and it took 2 hours on my
> single core VPS. My corpus is ~40k messages: ~30k spam from the last 3
> months (spammers love me), the rest is ham from the last 4 years. But
> that was a full run, clean slate. Maybe it doesn't have to re-scan
> every message the next time, but I have no clue because I haven't done
> it yet.
And today the weekly one ran and it took about 4 hours because of all
the network tests. I could probably increase the job count to speed that
up. But now that I think about it, scanning 40 000 messages at once with
network tests enabled is a *lot* of DNS requests. Am I going to get
myself banned from the DNSBLs by doing that? Most of them have rate
limits on requests for non-paying users.
Re: Ruleqa masscheck so close.
Posted by John Brooks <jo...@fastquake.com>.
On 2017-06-01 07:59 PM, David Jones wrote:
>
>> How do you guarantee all the contributors can perform the checks within
>> an hour?
> Wow! I just started collecting ham/spam for masscheck back in January
> and my (apparently) tiny corpus only takes under a minute to run on a
> low end 2 core VM. I didn't realize that there would be some that take
> a long time to run. Still pretty new to all of this backend processing.
>
> Dave
Today while setting it up, I did a test run, and it took 2 hours on my
single core VPS. My corpus is ~40k messages: ~30k spam from the last 3
months (spammers love me), the rest is ham from the last 4 years. But
that was a full run, clean slate. Maybe it doesn't have to re-scan every
message the next time, but I have no clue because I haven't done it yet.
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
>From: John Hardin <jh...@impsec.org>
>>On Thu, 1 Jun 2017, David Jones wrote:
>> I am working pretty hard to get the ruleqa processing going
>> again on our new server. We are so close to having enough
>> contributors and ham/spam to get some new rules generated:
>> This is from the run minutes ago:
>>
>> HAM CONTRIBUTORS FOUND: 9 (required 10)
>> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>>
>> We need to recruit some more masscheck'ers to get over the hump so I can
>> do some final testing of the rules updates and start the DNS updates
>> again for sa-update.
>I upload my corpora, not my results - I suppose that my corpora didn't
>survive the migration, and I haven't yet brought my submission bot and
>rsync account up-to-date for the new hardware - my apologies. I may be
>able to give that some cycles this weekend.
So, about that. I just started helping as a sysadmin a month or so ago. We
had some hosting issues that we are trying to recover from with very little
documentation of the infrastructure 3 months ago. I am having to dig
through logs, cron output, and old (outdated) documentation to try to put
the puzzle back together again.
If anyone has any knowledge or documentation of how things were setup
in the past, I would love to talk with you.
We do have backups from one of the servers but I think there were two
or three servers before based on the fact that I can't find any evidence
where buildbot was running that I think was running the centralized
masscheck.
>However, I don't recall seeing confirmation that the central masscheck was
>actually working; can you confirm that? Or do I need to change over to
>local masscheck and uploading results like most others do?
I have not found enough details yet on the central masscheck so I have
started with getting the remote masscheck processing working first so
we can get sa-update going again.
> P.S. After spending the past month learning how this works, I have some
> ideas on how to make the nightly masschecks become hourly fairly easily
> so we can test and promote rule changes faster.
>How do you guarantee all the contributors can perform the checks within
>an hour?
Wow! I just started collecting ham/spam for masscheck back in January
and my (apparently) tiny corpus only takes under a minute to run on a
low end 2 core VM. I didn't realize that there would be some that take
a long time to run. Still pretty new to all of this backend processing.
Dave
Re: Ruleqa masscheck so close.
Posted by John Hardin <jh...@impsec.org>.
On Thu, 1 Jun 2017, David Jones wrote:
> I am working pretty hard to get the ruleqa processing going again on our new server. We are so close to having enough contributors and ham/spam to get some new rules generated: This is from the run minutes ago:
>
> HAM CONTRIBUTORS FOUND: 9 (required 10)
> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>
> We need to recruit some more masscheck'ers to get over the hump so I can
> do some final testing of the rules updates and start the DNS updates
> again for sa-update.
I upload my corpora, not my results - I suppose that my corpora didn't
survive the migration, and I haven't yet brought my submission bot and
rsync account up-to-date for the new hardware - my apologies. I may be
able to give that some cycles this weekend.
However, I don't recall seeing confirmation that the central masscheck was
actually working; can you confirm that? Or do I need to change over to
local masscheck and uploading results like most others do?
> P.S. After spending the past month learning how this works, I have some
> ideas on how to make the nightly masschecks become hourly fairly easily
> so we can test and promote rule changes faster.
How do you guarantee all the contributors can perform the checks within an
hour?
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Journalism is about covering important stories.
With a pillow, until they stop moving. -- David Burge
-----------------------------------------------------------------------
5 days until the 73rd anniversary of D-Day
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
>From: Kevin Golding <kp...@caomhin.org>
>On Thu, 01 Jun 2017 13:45:26 +0100, David Jones <dj...@ena.com.invalid>
wrote:
>> Why do you think it was pointless?
>Because I got a daily email telling me that rsync failed while the system
>was offline. Running a masscheck every day for no purpose seemed a little
>pointless. I thought I'd suspend it until the system was back online. If
>that's a problem I apologise, it wasn't made clear that I should keep
>donating resources during a period when the system was offline.
Sure. I completely understand when the server was offline. Sounds like
we need a little enhancement in the automasscheck-minimal.sh script to
detect when the rsync fails and not waste processing resources.
>> I have an idea that will allow masscheckers to cron the automasscheck.sh
>> script hourly which would only run the full masscheck when they detect a
>> new tagged ruleset to work with. Basically it would do a quick rsync of
>> the latest tagged build dir like it does today but if there are no rsync
>> changes, it would simply exit.
>Presumably there would also be a 24hr window that meant even if no rules
>were updated we would rescore after that period to have recent score
>adjustments? That would seem more effective than maintaining the morning
>run and throwing in an additional one as needed since we could run checks
>an hour before the morning run.
Yes. We would still keep the daily tagged build of rules that we have today
so the existing 24 hour processing would work just like it does today for those
who don't want to go to the hourly. Keep in mind, this wouldn't mean you
would need to masscheck hourly just for nothing. The script would run hourly
"phone home" then exit if there was nothing new to masscheck against. Even
the current nightly masscheck would have benefited from this logic while the
server was down and not wasted resources.
>It would logically also require an amendment to suggest running sa-update
>hourly instead of daily too.
Sure. Fair point. With sa-update running "randomly" all over the Internet
from different locations and time zones, there could be some not getting
updates for up to 48 hours even when everything is running perfectly. The
average update around the world would go from 24 hours down to 12 hours
for any hourly updates assuming we could get enough masscheckers to go
hourly.
I am still in the planning stages of this after sorting through all of the scripts
so I am definitely open to ideas and suggestions like this. The idea is that
when we find the recent issue with Yahoo changing their message ID format
(see FORGED_MUA_MOZILLA & FORGED_YAHOO_RCVD thread), then this
could go out in hours instead of days.
>As a sidenote, not all of us use the main automasscheck.sh script so
>depending on how the changes are rolled out I can't promise an
>uninterrupted supply of masscheck data.
Thanks for that feedback. My goal is to add the hourly functionality
without changing the current directory structure or timing of cron jobs
so this would not impact the existing masscheck submissions.
I am still working on getting the current masscheck processing finished
up and we are probably months away from the hourly stuff so I will take
things slowly and try to fully understand things before adding the hourly
logic. I will test on my own masscheck processing for a while first.
Re: Ruleqa masscheck so close.
Posted by Kevin Golding <kp...@caomhin.org>.
On Thu, 01 Jun 2017 13:45:26 +0100, David Jones <dj...@ena.com.invalid>
wrote:
> Why do you think it was pointless?
Because I got a daily email telling me that rsync failed while the system
was offline. Running a masscheck every day for no purpose seemed a little
pointless. I thought I'd suspend it until the system was back online. If
that's a problem I apologise, it wasn't made clear that I should keep
donating resources during a period when the system was offline.
> I have an idea that will allow masscheckers to cron the automasscheck.sh
> script hourly which would only run the full masscheck when they detect a
> new tagged ruleset to work with. Basically it would do a quick rsync of
> the
> latest tagged build dir like it does today but if there are no rsync
> changes,
> it would simply exit.
Presumably there would also be a 24hr window that meant even if no rules
were updated we would rescore after that period to have recent score
adjustments? That would seem more effective than maintaining the morning
run and throwing in an additional one as needed since we could run checks
an hour before the morning run.
It would logically also require an amendment to suggest running sa-update
hourly instead of daily too.
As a sidenote, not all of us use the main automasscheck.sh script so
depending on how the changes are rolled out I can't promise an
uninterrupted supply of masscheck data.
Re: problem with setting up masscheck
Posted by Marcin Mirosław <ma...@mejor.pl>.
W dniu 2017-06-02 o 23:23, Dave Jones pisze:
> On 06/02/2017 02:47 AM, marcin@mejor.pl wrote:
>> W dniu 01.06.2017 o 16:48, David Jones pisze:
>>
>> Hi David, hi all!
>>
>>> From: marcin@mejor.pl <ma...@mejor.pl>
>>>> W dniu 01.06.2017 o 14:57, David Jones pisze:
>>>>>> From: Kevin A. McGrail <ke...@mcgrail.com>
>>>>>
>>>>> https://wiki.apache.org/spamassassin/NightlyMassCheck
>>>
>>>
>>>> I have some troubles with setting up masscheck. Here is what I set
>>>> in .automasscheck.cf :
>>>> LOGPREFIX="YOUR-USERNAME"
>>>> RSYNC_USERNAME="YOUR-USERNAME"
>>>> RSYNC_PASSWORD="YOUR-PASSWORD"
>>>
>>> Replace ""YOUR-USERNAME" with your own rsync username of "mmiroslaw"
>>> in both places.
>>> Same for "YOUR-PASSWORD" with your own personal rsync password. If
>>> you don't
>>> remember your rsync password, I can send it to you off list.
>>
>> I've got my password but I didn't put it into configuration because I
>> wanted to test if everything works correctly.
>>
>>>> WORKDIR=~/sa-masscheck/tmp
>>>> JOBS=1
>>>> TRUSTED_NETWORKS=
>>>> INTERNAL_NETWORKS=
>>>> run_all_masschecks() {
>>>> run_masscheck ham:dir:/dane/spam/HAM_DOUSUNIECIA/ \
>>>> spam:dir:/dane/spam/20170601/
>>>> }
>>>
>>>> And this is what I got from script bash -x automasscheck-minimal.sh:
>>>> [...]
>>>> open of ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
>>>> failed: No such file or directory at ./mass-check line 617.
>>>> + LOGLIST='
>>>> ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
>>>> spam-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
>>>> + set +x
>>>
>>>> name of log file is wrong.
>>>
>>> Did you use the updated automasscheck-minimal.sh recently updated in
>>> the past few days
>>> or is this an older existing version that worked in the past?
>>
>>
>> I'm on r1797310, today i'm getting:
>>
>> + run_masscheck ham:dir:/dane/spam/HAM_DOUSUNIECIA/
>> spam:dir:/dane/spam/20170601/
>> + CORPUSNAME=ham:dir:/dane/spam/HAM_DOUSUNIECIA/
>> + shift
>> + [[ ham:dir:/dane/spam/HAM_DOUSUNIECIA/ == \s\i\n\g\l\e\-\c\o\r\p\u\s ]]
>> + LOGSUFFIX=-ham:dir:/dane/spam/HAM_DOUSUNIECIA/
>> + LOGNAME=mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
>> + rm -f ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
>> spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
>> + set -x
>> + ./mass-check
>> --hamlog=ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
>> --spamlog=spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log -j 1
>> --progress spam:dir:/dane/spam/20170601/
>> status: starting scan stage now:
>> 2017-06-02 09:41:49
>> status: completed scan stage, 23 messages now:
>> 2017-06-02 09:41:49
>> status: starting run stage now:
>> 2017-06-02 09:41:49
>> open of ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log failed:
>> No such file or directory at ./mass-check line 617.
>> + LOGLIST=' ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
>> spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
>> + set +x
>> rsync -qPcvz ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
>> spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
>> mmiroslaw@rsync.spamassassin.org::corpus/
>> The source and destination cannot both be remote.
>> rsync error: syntax or usage error (code 1) at main.c(1274)
>> [Receiver=3.1.2]
>> ^C
>>
>> Why name of logs file are building with slashes? Slashes are not
>> allowed as filename.
>>
>> Marcin
>>
>
> In the .automasscheck.cf around line 53 there should be something like
> this:
>
> run_all_masschecks() {
> ### sample: single corpus ###
> run_masscheck single-corpus \
>
> Did you remove the 'single-corpus' from the run_masscheck argument? From
> what I can tell in your bash -x output (thanks for that by the way), it
> looks like the first argument to the run_masscheck function is
> 'ham:dir:/dane/spam/HAM_DOUSUNIECIA/' and it probably should be
> 'single-corpus'.
>
> Mine looks like this:
>
> run_all_masschecks() {
> ### sample: single corpus ###
> run_masscheck single-corpus \
> ham:dir:$MAILDIR/.Ham/ \
> spam:dir:$MAILDIR/.Spam/
>
> I have a wrapper script that sets the $MAILDIR then calls the
> automasscheck-minimal.sh script to do a couple of things before and
> afterwards for reporting and emailing the output.
Bingo!
I clean up conf file and I removed too much. I don't know why I removed
single-corpus. Now it works! I'll prepare all configuration and will
send results of masscheck after a weekend.
Thank you!
Marcin
Re: problem with setting up masscheck
Posted by Dave Jones <da...@apache.org>.
On 06/02/2017 02:47 AM, marcin@mejor.pl wrote:
> W dniu 01.06.2017 o 16:48, David Jones pisze:
>
> Hi David, hi all!
>
>> From: marcin@mejor.pl <ma...@mejor.pl>
>>
>>> W dniu 01.06.2017 o 14:57, David Jones pisze:
>>>>> From: Kevin A. McGrail <ke...@mcgrail.com>
>>>>
>>>>
>>>> https://wiki.apache.org/spamassassin/NightlyMassCheck
>>
>>
>>> I have some troubles with setting up masscheck. Here is what I set in .automasscheck.cf :
>>> LOGPREFIX="YOUR-USERNAME"
>>> RSYNC_USERNAME="YOUR-USERNAME"
>>> RSYNC_PASSWORD="YOUR-PASSWORD"
>>
>> Replace ""YOUR-USERNAME" with your own rsync username of "mmiroslaw" in both places.
>> Same for "YOUR-PASSWORD" with your own personal rsync password. If you don't
>> remember your rsync password, I can send it to you off list.
>
> I've got my password but I didn't put it into configuration because I wanted to test if everything works correctly.
>
>
>>> WORKDIR=~/sa-masscheck/tmp
>>> JOBS=1
>>> TRUSTED_NETWORKS=
>>> INTERNAL_NETWORKS=
>>> run_all_masschecks() {
>>> run_masscheck ham:dir:/dane/spam/HAM_DOUSUNIECIA/ \
>>> spam:dir:/dane/spam/20170601/
>>> }
>>
>>> And this is what I got from script bash -x automasscheck-minimal.sh:
>>> [...]
>>> open of ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log failed: No such file or directory at ./mass-check line 617.
>>> + LOGLIST=' ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
>>> + set +x
>>
>>> name of log file is wrong.
>>
>> Did you use the updated automasscheck-minimal.sh recently updated in the past few days
>> or is this an older existing version that worked in the past?
>
>
> I'm on r1797310, today i'm getting:
>
> + run_masscheck ham:dir:/dane/spam/HAM_DOUSUNIECIA/ spam:dir:/dane/spam/20170601/
> + CORPUSNAME=ham:dir:/dane/spam/HAM_DOUSUNIECIA/
> + shift
> + [[ ham:dir:/dane/spam/HAM_DOUSUNIECIA/ == \s\i\n\g\l\e\-\c\o\r\p\u\s ]]
> + LOGSUFFIX=-ham:dir:/dane/spam/HAM_DOUSUNIECIA/
> + LOGNAME=mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
> + rm -f ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
> + set -x
> + ./mass-check --hamlog=ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log --spamlog=spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log -j 1 --progress spam:dir:/dane/spam/20170601/
> status: starting scan stage now: 2017-06-02 09:41:49
> status: completed scan stage, 23 messages now: 2017-06-02 09:41:49
> status: starting run stage now: 2017-06-02 09:41:49
> open of ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log failed: No such file or directory at ./mass-check line 617.
> + LOGLIST=' ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
> + set +x
> rsync -qPcvz ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log mmiroslaw@rsync.spamassassin.org::corpus/
> The source and destination cannot both be remote.
> rsync error: syntax or usage error (code 1) at main.c(1274) [Receiver=3.1.2]
> ^C
>
> Why name of logs file are building with slashes? Slashes are not allowed as filename.
>
> Marcin
>
In the .automasscheck.cf around line 53 there should be something like this:
run_all_masschecks() {
### sample: single corpus ###
run_masscheck single-corpus \
Did you remove the 'single-corpus' from the run_masscheck argument?
From what I can tell in your bash -x output (thanks for that by the
way), it looks like the first argument to the run_masscheck function is
'ham:dir:/dane/spam/HAM_DOUSUNIECIA/' and it probably should be
'single-corpus'.
Mine looks like this:
run_all_masschecks() {
### sample: single corpus ###
run_masscheck single-corpus \
ham:dir:$MAILDIR/.Ham/ \
spam:dir:$MAILDIR/.Spam/
I have a wrapper script that sets the $MAILDIR then calls the
automasscheck-minimal.sh script to do a couple of things before and
afterwards for reporting and emailing the output.
Dave
Re: problem with setting up masscheck (was: Ruleqa masscheck so
close.)
Posted by "marcin@mejor.pl" <ma...@mejor.pl>.
W dniu 01.06.2017 o 16:48, David Jones pisze:
Hi David, hi all!
> From: marcin@mejor.pl <ma...@mejor.pl>
>
>> W dniu 01.06.2017 o 14:57, David Jones pisze:
>>>> From: Kevin A. McGrail <ke...@mcgrail.com>
>>>
>>>
>>> https://wiki.apache.org/spamassassin/NightlyMassCheck
>
>
>> I have some troubles with setting up masscheck. Here is what I set in .automasscheck.cf :
>> LOGPREFIX="YOUR-USERNAME"
>> RSYNC_USERNAME="YOUR-USERNAME"
>> RSYNC_PASSWORD="YOUR-PASSWORD"
>
> Replace ""YOUR-USERNAME" with your own rsync username of "mmiroslaw" in both places.
> Same for "YOUR-PASSWORD" with your own personal rsync password. If you don't
> remember your rsync password, I can send it to you off list.
I've got my password but I didn't put it into configuration because I wanted to test if everything works correctly.
>> WORKDIR=~/sa-masscheck/tmp
>> JOBS=1
>> TRUSTED_NETWORKS=
>> INTERNAL_NETWORKS=
>> run_all_masschecks() {
>> run_masscheck ham:dir:/dane/spam/HAM_DOUSUNIECIA/ \
>> spam:dir:/dane/spam/20170601/
>> }
>
>> And this is what I got from script bash -x automasscheck-minimal.sh:
>> [...]
>> open of ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log failed: No such file or directory at ./mass-check line 617.
>> + LOGLIST=' ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
>> + set +x
>
>> name of log file is wrong.
>
> Did you use the updated automasscheck-minimal.sh recently updated in the past few days
> or is this an older existing version that worked in the past?
I'm on r1797310, today i'm getting:
+ run_masscheck ham:dir:/dane/spam/HAM_DOUSUNIECIA/ spam:dir:/dane/spam/20170601/
+ CORPUSNAME=ham:dir:/dane/spam/HAM_DOUSUNIECIA/
+ shift
+ [[ ham:dir:/dane/spam/HAM_DOUSUNIECIA/ == \s\i\n\g\l\e\-\c\o\r\p\u\s ]]
+ LOGSUFFIX=-ham:dir:/dane/spam/HAM_DOUSUNIECIA/
+ LOGNAME=mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
+ rm -f ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
+ set -x
+ ./mass-check --hamlog=ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log --spamlog=spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log -j 1 --progress spam:dir:/dane/spam/20170601/
status: starting scan stage now: 2017-06-02 09:41:49
status: completed scan stage, 23 messages now: 2017-06-02 09:41:49
status: starting run stage now: 2017-06-02 09:41:49
open of ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log failed: No such file or directory at ./mass-check line 617.
+ LOGLIST=' ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
+ set +x
rsync -qPcvz ham-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-mmiroslaw-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log mmiroslaw@rsync.spamassassin.org::corpus/
The source and destination cannot both be remote.
rsync error: syntax or usage error (code 1) at main.c(1274) [Receiver=3.1.2]
^C
Why name of logs file are building with slashes? Slashes are not allowed as filename.
Marcin
Re: Ruleqa masscheck so close.
Posted by John Brooks <jo...@fastquake.com>.
On 06/01/2017 10:48 AM, David Jones wrote:
> From: marcin@mejor.pl <ma...@mejor.pl>
>
>> W dniu 01.06.2017 o 14:57, David Jones pisze:
>>>> From: Kevin A. McGrail <ke...@mcgrail.com>
>>>
>>>
>>> https://wiki.apache.org/spamassassin/NightlyMassCheck
>
>> I have some troubles with setting up masscheck. Here is what I set in .automasscheck.cf :
>> LOGPREFIX="YOUR-USERNAME"
>> RSYNC_USERNAME="YOUR-USERNAME"
>> RSYNC_PASSWORD="YOUR-PASSWORD"
> Replace ""YOUR-USERNAME" with your own rsync username of "mmiroslaw" in both places.
> Same for "YOUR-PASSWORD" with your own personal rsync password. If you don't
> remember your rsync password, I can send it to you off list.
Can you send me mine? I can't seem to authenticate with the username
(jbrooks) and password I was given originally.
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
From: marcin@mejor.pl <ma...@mejor.pl>
>W dniu 01.06.2017 o 14:57, David Jones pisze:
>>> From: Kevin A. McGrail <ke...@mcgrail.com>
>>
>>
>> https://wiki.apache.org/spamassassin/NightlyMassCheck
>I have some troubles with setting up masscheck. Here is what I set in .automasscheck.cf :
>LOGPREFIX="YOUR-USERNAME"
>RSYNC_USERNAME="YOUR-USERNAME"
>RSYNC_PASSWORD="YOUR-PASSWORD"
Replace ""YOUR-USERNAME" with your own rsync username of "mmiroslaw" in both places.
Same for "YOUR-PASSWORD" with your own personal rsync password. If you don't
remember your rsync password, I can send it to you off list.
>WORKDIR=~/sa-masscheck/tmp
>JOBS=1
>TRUSTED_NETWORKS=
>INTERNAL_NETWORKS=
>run_all_masschecks() {
> run_masscheck ham:dir:/dane/spam/HAM_DOUSUNIECIA/ \
> spam:dir:/dane/spam/20170601/
>}
>And this is what I got from script bash -x automasscheck-minimal.sh:
>[...]
>open of ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log failed: No such file or directory at ./mass-check line 617.
>+ LOGLIST=' ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
>+ set +x
>name of log file is wrong.
Did you use the updated automasscheck-minimal.sh recently updated in the past few days
or is this an older existing version that worked in the past?
Dave
Re: Ruleqa masscheck so close.
Posted by "marcin@mejor.pl" <ma...@mejor.pl>.
W dniu 01.06.2017 o 14:57, David Jones pisze:
>> From: Kevin A. McGrail <ke...@mcgrail.com>
>
>> On 6/1/2017 8:45 AM, David Jones wrote:
>>>> I disabled my masscheck after a while because... well, it was pointless.
>>>> It's passed the cron window for the day but if you need the data I can run
>>>> it manually, else it'll kick in again tomorrow.
>>> Why do you think it was pointless?
>
>> I think he means while the server was offline.
>
> Based on the last submissions I see on the server, they were in
> 2015 so we would love to have him an others "come back"... :)
>
> Also, I have updated the automasscheck-minimal.sh script slightly if
> anyone would like to update theirs and give some feedback. I have
> been running it for the past few days just fine.
>
> https://wiki.apache.org/spamassassin/NightlyMassCheck
I have some troubles with setting up masscheck. Here is what I set in .automasscheck.cf :
LOGPREFIX="YOUR-USERNAME"
RSYNC_USERNAME="YOUR-USERNAME"
RSYNC_PASSWORD="YOUR-PASSWORD"
WORKDIR=~/sa-masscheck/tmp
JOBS=1
TRUSTED_NETWORKS=
INTERNAL_NETWORKS=
run_all_masschecks() {
run_masscheck ham:dir:/dane/spam/HAM_DOUSUNIECIA/ \
spam:dir:/dane/spam/20170601/
}
And this is what I got from script bash -x automasscheck-minimal.sh:
[...]
+ run_all_masschecks
+ run_masscheck ham:dir:/dane/spam/HAM_DOUSUNIECIA/ spam:dir:/dane/spam/20170601/
+ CORPUSNAME=ham:dir:/dane/spam/HAM_DOUSUNIECIA/
+ shift
+ [[ ham:dir:/dane/spam/HAM_DOUSUNIECIA/ == \s\i\n\g\l\e\-\c\o\r\p\u\s ]]
+ LOGSUFFIX=-ham:dir:/dane/spam/HAM_DOUSUNIECIA/
+ LOGNAME=YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
+ rm -f ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log
+ set -x
+ ./mass-check --hamlog=ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log --spamlog=spam-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log -j 1 --progress spam:dir:/dane/spam/20170601/
status: starting scan stage now: 2017-06-01 15:31:47
status: completed scan stage, 21 messages now: 2017-06-01 15:31:47
status: starting run stage now: 2017-06-01 15:31:47
open of ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log failed: No such file or directory at ./mass-check line 617.
+ LOGLIST=' ham-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log spam-YOUR-USERNAME-ham:dir:/dane/spam/HAM_DOUSUNIECIA/.log'
+ set +x
name of log file is wrong.
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
From: Kevin Golding <kp...@caomhin.org>
>On Thu, 01 Jun 2017 13:57:47 +0100, David Jones <dj...@ena.com.invalid>
wrote:
>> From: Kevin A. McGrail <ke...@mcgrail.com>
>
>> On 6/1/2017 8:45 AM, David Jones wrote:
>>>> I disabled my masscheck after a while because... well, it was
>>>> pointless.
>>>> It's passed the cron window for the day but if you need the data I
>>>> can run
>>>> it manually, else it'll kick in again tomorrow.
>>> Why do you think it was pointless?
>
>> I think he means while the server was offline.
>
> Based on the last submissions I see on the server, they were in
> 2015 so we would love to have him an others "come back"... :)
Sorry. That reply about 2015 was meant for marcin@mejor.pl, not Kevin.
Thanks marcin@mejor.pl for starting up your masschecks again.
>Apparently I started in 2010 and kept going until the system failure
>earlier this year.
Thanks Kevin for running the masschecks for so long and starting up again.
Dave
Re: Ruleqa masscheck so close.
Posted by Kevin Golding <kp...@caomhin.org>.
On Thu, 01 Jun 2017 13:57:47 +0100, David Jones <dj...@ena.com.invalid>
wrote:
>> From: Kevin A. McGrail <ke...@mcgrail.com>
>
>> On 6/1/2017 8:45 AM, David Jones wrote:
>>>> I disabled my masscheck after a while because... well, it was
>>>> pointless.
>>>> It's passed the cron window for the day but if you need the data I
>>>> can run
>>>> it manually, else it'll kick in again tomorrow.
>>> Why do you think it was pointless?
>
>> I think he means while the server was offline.
>
> Based on the last submissions I see on the server, they were in
> 2015 so we would love to have him an others "come back"... :)
Apparently I started in 2010 and kept going until the system failure
earlier this year.
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
>From: Kevin A. McGrail <ke...@mcgrail.com>
>On 6/1/2017 8:45 AM, David Jones wrote:
>>> I disabled my masscheck after a while because... well, it was pointless.
>>> It's passed the cron window for the day but if you need the data I can run
>>> it manually, else it'll kick in again tomorrow.
>> Why do you think it was pointless?
>I think he means while the server was offline.
Based on the last submissions I see on the server, they were in
2015 so we would love to have him an others "come back"... :)
Also, I have updated the automasscheck-minimal.sh script slightly if
anyone would like to update theirs and give some feedback. I have
been running it for the past few days just fine.
https://wiki.apache.org/spamassassin/NightlyMassCheck
Dave
Re: Ruleqa masscheck so close.
Posted by "Kevin A. McGrail" <ke...@mcgrail.com>.
On 6/1/2017 8:45 AM, David Jones wrote:
>> I disabled my masscheck after a while because... well, it was pointless.
>> It's passed the cron window for the day but if you need the data I can run
>> it manually, else it'll kick in again tomorrow.
> Why do you think it was pointless?
I think he means while the server was offline.
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
From: Kevin Golding <kp...@caomhin.org>
>On Thu, 01 Jun 2017 03:52:42 +0100, David Jones <dj...@ena.com>
>wrote:
>> I am working pretty hard to get the ruleqa processing going again on our
>> new server. We are so close to having enough contributors and ham/spam
>> to get some new rules generated: This is from the run minutes ago:
>>
>> HAM CONTRIBUTORS FOUND: 9 (required 10)
>> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>>
>> We need to recruit some more masscheck'ers to get over the hump so I can
>> do some final testing of the rules updates and start the DNS updates
>> again for sa-update.
>I disabled my masscheck after a while because... well, it was pointless.
>It's passed the cron window for the day but if you need the data I can run
>it manually, else it'll kick in again tomorrow.
Why do you think it was pointless? This does a couple of things:
1. It provides needed feedback to rules before they can be published to
the Internet via sa-update.
2. It adjusts the 72_scores.cf based on recent spam/ham which benefits
everyone using spamassassin all over the world that runs sa-update regularly.
>> P.S. After spending the past month learning how this works, I have some
>> ideas on how to make the nightly masschecks become hourly fairly easily
>> so we can test and promote rule changes faster.
>You may need to explain the requirements for that. Are you asking for
>hourly masscheck submissions?
Today the delay of up to 24 hours is pretty slow to provide feedback or
score updates. I don't think this will ever be intended to update quickly
enough to help with zero-hour spam or replace technologies that react
quickly like RBLs, DCC, Pyzor, etc.
I have an idea that will allow masscheckers to cron the automasscheck.sh
script hourly which would only run the full masscheck when they detect a
new tagged ruleset to work with. Basically it would do a quick rsync of the
latest tagged build dir like it does today but if there are no rsync changes,
it would simply exit.
Everyone would still keep sorting ham/spam as they do today so there
would be no real change in that. Hopefully everyone is sorting at least
every other day or every third day. I try to sort some every day since I
also have this tied to local Bayes training to make this work a little more
worth the time and effort.
Dave
Re: Ruleqa masscheck so close.
Posted by Kevin Golding <kp...@caomhin.org>.
On Thu, 01 Jun 2017 03:52:42 +0100, David Jones <dj...@ena.com.invalid>
wrote:
> I am working pretty hard to get the ruleqa processing going again on our
> new server. We are so close to having enough contributors and ham/spam
> to get some new rules generated: This is from the run minutes ago:
>
> HAM CONTRIBUTORS FOUND: 9 (required 10)
> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>
> We need to recruit some more masscheck'ers to get over the hump so I can
> do some final testing of the rules updates and start the DNS updates
> again for sa-update.
I disabled my masscheck after a while because... well, it was pointless.
It's passed the cron window for the day but if you need the data I can run
it manually, else it'll kick in again tomorrow.
> P.S. After spending the past month learning how this works, I have some
> ideas on how to make the nightly masschecks become hourly fairly easily
> so we can test and promote rule changes faster.
You may need to explain the requirements for that. Are you asking for
hourly masscheck submissions?
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
From: marcin@mejor.pl <ma...@mejor.pl>
>W dniu 01.06.2017 o 04:52, David Jones pisze:
>> I am working pretty hard to get the ruleqa processing going again
>> on our new server. We are so close to having enough contributors
>> and ham/spam to get some new rules generated: This is from the run minutes ago:
>>
>> HAM CONTRIBUTORS FOUND: 9 (required 10)
>> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>>
>> We need to recruit some more masscheck'ers to get over the hump
>>so I can do some final testing of the rules updates and start the DNS
>>updates again for sa-update.
>Hi!
>Does my account "mmiroslaw" still exists?
Yes. If you are still manually sorting ham and spam, please enable your
cron job to run just after 9:00 AM UTC. This would help us a lot.
Thanks,
Dave
Re: Ruleqa masscheck so close.
Posted by "marcin@mejor.pl" <ma...@mejor.pl>.
W dniu 01.06.2017 o 04:52, David Jones pisze:
> I am working pretty hard to get the ruleqa processing going again on our new server. We are so close to having enough contributors and ham/spam to get some new rules generated: This is from the run minutes ago:
>
> HAM CONTRIBUTORS FOUND: 9 (required 10)
> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>
> We need to recruit some more masscheck'ers to get over the hump so I can do some final testing of the rules updates and start the DNS updates again for sa-update.
Hi!
Does my account "mmiroslaw" still exists?
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
>From: John Brooks <jo...@fastquake.com>
>On 06/01/2017 02:05 PM, David Jones wrote:
>>
>> P.S. Based on some documentation I saw on the wiki, I have been moving
>> ham and spam older than 90 days into an archive folder. But now that
>> I see the ham goes back 7 years, I guess I need to keep my ham in my
>> masscheck ham folder longer.
>>
>> Dave
>Where did you read that? This page says 6 years for ham:
>https://wiki.apache.org/spamassassin/CorpusCleaning
I confused two different issues. I am also training my Bayes DB with
the same sorted folders so I was limiting my Bayes DB training to 90
days. I guess I can train my Bayes DB based on the same time periods
that the masscheck uses.
Dave
Re: Ruleqa masscheck so close.
Posted by John Brooks <jo...@fastquake.com>.
On 06/01/2017 02:05 PM, David Jones wrote:
>
> P.S. Based on some documentation I saw on the wiki, I have been moving
> ham and spam older than 90 days into an archive folder. But now that
> I see the ham goes back 7 years, I guess I need to keep my ham in my
> masscheck ham folder longer.
>
> Dave
Where did you read that? This page says 6 years for ham:
https://wiki.apache.org/spamassassin/CorpusCleaning
Re: Ruleqa masscheck so close.
Posted by David Jones <dj...@ena.com.INVALID>.
>From: John Brooks <jo...@fastquake.com>
>I'm now setting up my masscheck again (it wasn't running properly before
>and I didn't bother fixing it when the server went down for months).
>It's just me on my mail server, and I don't get that much mail; maybe
>5-10 ham/a few hundred spam per day, not counting mailing lists which I
>don't include in my scans. So I was going to do weekly runs instead of
>nightly.
>Would it be more useful to the project if I ran it nightly, despite the
>low volume?
>John
Yes. Please run it nightly even if you don't have a high volume of mail
or if you don't sort the ham/spam often. The current logic that I found
while trying to get this rebuilt on the new server is the following:
10 minimum contributors on the latest "sa-update" tagged revision
AND
150,000 ham combined minimum over the past 84 months
AND
150,000 spam combined minimum over the past 2 months
This is all based on the latest tagged "sa-update" in this link:
https://svn.apache.org/viewvc/spamassassin/tags/?sortby=date#dirlist
The rsync stage dir with the latest "sa-update"tagged version is setup
shortly before 9:00 AM UTC so it works best if automasscheck-minimal.sh
is cron'd to run a few minutes or so after the top of the hour. Technically
it can be run anytime after 9:00 AM UTC for the next ~17 hours but if we
could get enough to run in that first hour, we could potentially speed up
the sa-update process quite a bit without having to wait most of the day
like it does now.
P.S. Based on some documentation I saw on the wiki, I have been moving
ham and spam older than 90 days into an archive folder. But now that
I see the ham goes back 7 years, I guess I need to keep my ham in my
masscheck ham folder longer.
Dave
Re: Ruleqa masscheck so close.
Posted by John Brooks <jo...@fastquake.com>.
I'm now setting up my masscheck again (it wasn't running properly before
and I didn't bother fixing it when the server went down for months).
It's just me on my mail server, and I don't get that much mail; maybe
5-10 ham/a few hundred spam per day, not counting mailing lists which I
don't include in my scans. So I was going to do weekly runs instead of
nightly.
Would it be more useful to the project if I ran it nightly, despite the
low volume?
John
On 05/31/2017 10:52 PM, David Jones wrote:
> I am working pretty hard to get the ruleqa processing going again on our new server. We are so close to having enough contributors and ham/spam to get some new rules generated: This is from the run minutes ago:
>
> HAM CONTRIBUTORS FOUND: 9 (required 10)
> SPAM CONTRIBUTORS FOUND: 9 (required 10)
>
> We need to recruit some more masscheck'ers to get over the hump so I can do some final testing of the rules updates and start the DNS updates again for sa-update.
>
> P.S. After spending the past month learning how this works, I have some ideas on how to make the nightly masschecks become hourly fairly easily so we can test and promote rule changes faster.
>
> Dave
>