You are viewing a plain text version of this content. The canonical link for it is here.
Posted to sysadmins@spamassassin.apache.org by Paul Stead <pa...@gmail.com> on 2019/05/25 12:00:23 UTC

Disappearing corpus

I'm investingating the problem with disappearing corpus - see the bug
report here -

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715

Whilst that is an issue, I've realised this might not be everything
involved.

I'm on the system but I can't find the process that is "cleaning" up the
directory at

/usr/local/spamassassin/automc/rsync/corpus

At first I thought it was the hourly script but I don't think this is true.

I've checked through cron.d run scripts and just can't seem to find it -
I've a feeling something is deleting logs from the corpus directory
prematurely, which then stops it being captured during the hourly when it
should - it's a case of < 1 hour.

It's possible this script has code to figure out if it's running at UTC or
needs an offset similar to the one in the bug.

It seems that the script is aware if it is running a nightly or weekly and
doesn't run the nightly on a Saturday.

Hope you might have an idea of which script I'm referring to?

I've "fixed" my problem by moving my corpus check to make sure it completes
after 10:00 UTC - this will like fix everyone's but I'd like to make sure
that when we say mass check after 09:00 UTC we mean it.

Paul

Re: Disappearing corpus

Posted by David Jones <dj...@ena.com>.
On 5/30/19 2:32 PM, Kevin A. McGrail wrote:
> Thanks for working on this.  I'm +1 without a technical review.
> 
> On 5/30/2019 3:27 PM, Paul Stead wrote:
>> I'm at the root of the issue and ready to commit changes around this:
>>
>> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715
>>
>> The changes will not affect how ruleqa works or how submissions should
>> be done - please continue to submit *after 0900 UTC*
>>
>> Any feedback appreciated, will be applying after 1st June unless
>> feedback received.
>>
>> Paul
> 
> 

+1

The concept sounds good and needed.  I am not a perl coder and don't 
have the time to dive into this in detail as it would probably take me a 
while not being good with perl.  When I looked at all of this a few 
years ago to get the masscheck running again, I seem to recall that I 
wanted to enable a triggered approach to doing something very similar to 
this.

For other similar file transfers in the past, I would setup swatch to 
monitor log files to trigger file movement or renames so subsequent 
uploads would not overwrite previous files even uploaded seconds later. 
A simple shell script with a few lines triggered by swatch along with a 
cron entry to cleanup after X days old would do the trick.

Thanks to all who took the time to look into these scripts.  Trust me, I 
know they are a mess.  This really should be rewritten to be more 
modular and have better logging at each step to help track down 
problems.  Many of them simply turn on shell tracing "set +x" which is 
not as nice as having true debug logging.

-- 
David Jones

Re: Disappearing corpus

Posted by "Kevin A. McGrail" <km...@apache.org>.
Thanks for working on this.  I'm +1 without a technical review.

On 5/30/2019 3:27 PM, Paul Stead wrote:
> I'm at the root of the issue and ready to commit changes around this:
>
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715
>
> The changes will not affect how ruleqa works or how submissions should
> be done - please continue to submit *after 0900 UTC*
>
> Any feedback appreciated, will be applying after 1st June unless
> feedback received.
>
> Paul


-- 
Kevin A. McGrail
Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171


Re: Disappearing corpus

Posted by "Kevin A. McGrail" <km...@apache.org>.
Thanks for working on this.  I'm +1 without a technical review.

On 5/30/2019 3:27 PM, Paul Stead wrote:
> I'm at the root of the issue and ready to commit changes around this:
>
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715
>
> The changes will not affect how ruleqa works or how submissions should
> be done - please continue to submit *after 0900 UTC*
>
> Any feedback appreciated, will be applying after 1st June unless
> feedback received.
>
> Paul


-- 
Kevin A. McGrail
Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171


Re: Disappearing corpus

Posted by Paul Stead <pa...@gmail.com>.
I'm at the root of the issue and ready to commit changes around this:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715

The changes will not affect how ruleqa works or how submissions should be
done - please continue to submit *after 0900 UTC*

Any feedback appreciated, will be applying after 1st June unless feedback
received.

Paul

Re: Disappearing corpus

Posted by "Kevin A. McGrail" <km...@apache.org>.
Hah, do not be too hard on yourself.  There are like 4 people on the planet
that have really dug into these scripts so I appreciate you working on it.

On Sat, May 25, 2019, 16:09 Paul Stead <pa...@gmail.com> wrote:

> I'm chasing my tail here....
>
> OF COURSE files are "disappearing" from the corpus directory, they get
> updated with todays/this weeks content, they don't get renamed/deleted they
> get changed to logs from today - I've been looking in the wrong place.
>
> Looks like corpus-hourly shouldn't be working from the corpus directory
> when re-calculating the class files for previous days but I clearly need to
> have a break and relax
>
>
> Paul
>
> On Sat, 25 May 2019 at 18:05, Paul Stead <pa...@gmail.com> wrote:
>
> > The 14:05 run has finished, here's the before and after in terms of
> output
> > on ruleqa (attached)
> >
> > I saw files disappear in the /usr/local/spamassassin/automc/rsync/corpus
> > from 18 May but still can't find the trigger that is removing these
> files.
> >
> > Will come back to this later if no one has any ideas
> >
> > On Sat, 25 May 2019 at 17:54, Paul Stead <pa...@gmail.com> wrote:
> >
> >> TLDR;
> >> Any pointers on what might be clearing up the old or "invalid" files in
> >> /usr/local/spamassassin/automc/rsync/corpus?
> >>
> >> ----
> >>
> >> I'm going on the opinion that some function is cleaning up the
> >>
> >> /usr/local/spamassassin/automc/rsync/corpus
> >>
> >> directory underneath the corpus-hourly script - though I've so far been
> >> unable to distinguish what. There seems to be a lot of superfluous
> scripts
> >> hanging around in the svn directories.
> >>
> >> As far as I can tell it isn't the corpus-hourly cron, nor the
> >> /usr/local/bin/checkMasscheckContribs.sh script.
> >>
> >> During my investigations I've noticed that the hourly does seem to take
> >> more than an hour to run, thus two processes can run at the same time
> >>
> >> automc    7749 13.9  0.1  40632 19040 ?        RN   15:05   3:27
> >> /usr/bin/perl -w
> >> /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly
> >> --dir=/usr/local/spamassassin/automc/rsync/corpus
> >> automc    8708 99.7  0.8 164560 145008 ?       RN   15:09  20:10
> >> /usr/bin/perl -w ./hit-frequencies -TxpaP -o
> >> /usr/local/spamassassin/automc/tmp/spam.log.25383
> >> /usr/local/spamassassin/automc/tmp/ham.log.25383
> >> automc   25383  9.3  0.1  38880 17480 ?        SN   14:05   7:56
> >> /usr/bin/perl -w
> >> /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly
> >> --dir=/usr/local/spamassassin/automc/rsync/corpus
> >>
> >> I'm not 100% that this is causing a problem, I see some protection
> >> against this for the running files, but I'm not sure about the resulting
> >> class files that are output.
> >>
> >> Paul
> >>
> >> On Sat, 25 May 2019 at 13:00, Paul Stead <pa...@gmail.com> wrote:
> >>
> >>> I'm investingating the problem with disappearing corpus - see the bug
> >>> report here -
> >>>
> >>> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715
> >>>
> >>> Whilst that is an issue, I've realised this might not be everything
> >>> involved.
> >>>
> >>> I'm on the system but I can't find the process that is "cleaning" up
> the
> >>> directory at
> >>>
> >>> /usr/local/spamassassin/automc/rsync/corpus
> >>>
> >>> At first I thought it was the hourly script but I don't think this is
> >>> true.
> >>>
> >>> I've checked through cron.d run scripts and just can't seem to find it
> -
> >>> I've a feeling something is deleting logs from the corpus directory
> >>> prematurely, which then stops it being captured during the hourly when
> it
> >>> should - it's a case of < 1 hour.
> >>>
> >>> It's possible this script has code to figure out if it's running at UTC
> >>> or needs an offset similar to the one in the bug.
> >>>
> >>> It seems that the script is aware if it is running a nightly or weekly
> >>> and doesn't run the nightly on a Saturday.
> >>>
> >>> Hope you might have an idea of which script I'm referring to?
> >>>
> >>> I've "fixed" my problem by moving my corpus check to make sure it
> >>> completes after 10:00 UTC - this will like fix everyone's but I'd like
> to
> >>> make sure that when we say mass check after 09:00 UTC we mean it.
> >>>
> >>> Paul
> >>>
> >>
>

Re: Disappearing corpus

Posted by Paul Stead <pa...@gmail.com>.
I'm chasing my tail here....

OF COURSE files are "disappearing" from the corpus directory, they get
updated with todays/this weeks content, they don't get renamed/deleted they
get changed to logs from today - I've been looking in the wrong place.

Looks like corpus-hourly shouldn't be working from the corpus directory
when re-calculating the class files for previous days but I clearly need to
have a break and relax


Paul

On Sat, 25 May 2019 at 18:05, Paul Stead <pa...@gmail.com> wrote:

> The 14:05 run has finished, here's the before and after in terms of output
> on ruleqa (attached)
>
> I saw files disappear in the /usr/local/spamassassin/automc/rsync/corpus
> from 18 May but still can't find the trigger that is removing these files.
>
> Will come back to this later if no one has any ideas
>
> On Sat, 25 May 2019 at 17:54, Paul Stead <pa...@gmail.com> wrote:
>
>> TLDR;
>> Any pointers on what might be clearing up the old or "invalid" files in
>> /usr/local/spamassassin/automc/rsync/corpus?
>>
>> ----
>>
>> I'm going on the opinion that some function is cleaning up the
>>
>> /usr/local/spamassassin/automc/rsync/corpus
>>
>> directory underneath the corpus-hourly script - though I've so far been
>> unable to distinguish what. There seems to be a lot of superfluous scripts
>> hanging around in the svn directories.
>>
>> As far as I can tell it isn't the corpus-hourly cron, nor the
>> /usr/local/bin/checkMasscheckContribs.sh script.
>>
>> During my investigations I've noticed that the hourly does seem to take
>> more than an hour to run, thus two processes can run at the same time
>>
>> automc    7749 13.9  0.1  40632 19040 ?        RN   15:05   3:27
>> /usr/bin/perl -w
>> /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly
>> --dir=/usr/local/spamassassin/automc/rsync/corpus
>> automc    8708 99.7  0.8 164560 145008 ?       RN   15:09  20:10
>> /usr/bin/perl -w ./hit-frequencies -TxpaP -o
>> /usr/local/spamassassin/automc/tmp/spam.log.25383
>> /usr/local/spamassassin/automc/tmp/ham.log.25383
>> automc   25383  9.3  0.1  38880 17480 ?        SN   14:05   7:56
>> /usr/bin/perl -w
>> /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly
>> --dir=/usr/local/spamassassin/automc/rsync/corpus
>>
>> I'm not 100% that this is causing a problem, I see some protection
>> against this for the running files, but I'm not sure about the resulting
>> class files that are output.
>>
>> Paul
>>
>> On Sat, 25 May 2019 at 13:00, Paul Stead <pa...@gmail.com> wrote:
>>
>>> I'm investingating the problem with disappearing corpus - see the bug
>>> report here -
>>>
>>> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715
>>>
>>> Whilst that is an issue, I've realised this might not be everything
>>> involved.
>>>
>>> I'm on the system but I can't find the process that is "cleaning" up the
>>> directory at
>>>
>>> /usr/local/spamassassin/automc/rsync/corpus
>>>
>>> At first I thought it was the hourly script but I don't think this is
>>> true.
>>>
>>> I've checked through cron.d run scripts and just can't seem to find it -
>>> I've a feeling something is deleting logs from the corpus directory
>>> prematurely, which then stops it being captured during the hourly when it
>>> should - it's a case of < 1 hour.
>>>
>>> It's possible this script has code to figure out if it's running at UTC
>>> or needs an offset similar to the one in the bug.
>>>
>>> It seems that the script is aware if it is running a nightly or weekly
>>> and doesn't run the nightly on a Saturday.
>>>
>>> Hope you might have an idea of which script I'm referring to?
>>>
>>> I've "fixed" my problem by moving my corpus check to make sure it
>>> completes after 10:00 UTC - this will like fix everyone's but I'd like to
>>> make sure that when we say mass check after 09:00 UTC we mean it.
>>>
>>> Paul
>>>
>>

Re: Disappearing corpus

Posted by Paul Stead <pa...@gmail.com>.
The 14:05 run has finished, here's the before and after in terms of output
on ruleqa (attached)

I saw files disappear in the /usr/local/spamassassin/automc/rsync/corpus
from 18 May but still can't find the trigger that is removing these files.

Will come back to this later if no one has any ideas

On Sat, 25 May 2019 at 17:54, Paul Stead <pa...@gmail.com> wrote:

> TLDR;
> Any pointers on what might be clearing up the old or "invalid" files in
> /usr/local/spamassassin/automc/rsync/corpus?
>
> ----
>
> I'm going on the opinion that some function is cleaning up the
>
> /usr/local/spamassassin/automc/rsync/corpus
>
> directory underneath the corpus-hourly script - though I've so far been
> unable to distinguish what. There seems to be a lot of superfluous scripts
> hanging around in the svn directories.
>
> As far as I can tell it isn't the corpus-hourly cron, nor the
> /usr/local/bin/checkMasscheckContribs.sh script.
>
> During my investigations I've noticed that the hourly does seem to take
> more than an hour to run, thus two processes can run at the same time
>
> automc    7749 13.9  0.1  40632 19040 ?        RN   15:05   3:27
> /usr/bin/perl -w
> /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly
> --dir=/usr/local/spamassassin/automc/rsync/corpus
> automc    8708 99.7  0.8 164560 145008 ?       RN   15:09  20:10
> /usr/bin/perl -w ./hit-frequencies -TxpaP -o
> /usr/local/spamassassin/automc/tmp/spam.log.25383
> /usr/local/spamassassin/automc/tmp/ham.log.25383
> automc   25383  9.3  0.1  38880 17480 ?        SN   14:05   7:56
> /usr/bin/perl -w
> /usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly
> --dir=/usr/local/spamassassin/automc/rsync/corpus
>
> I'm not 100% that this is causing a problem, I see some protection against
> this for the running files, but I'm not sure about the resulting class
> files that are output.
>
> Paul
>
> On Sat, 25 May 2019 at 13:00, Paul Stead <pa...@gmail.com> wrote:
>
>> I'm investingating the problem with disappearing corpus - see the bug
>> report here -
>>
>> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715
>>
>> Whilst that is an issue, I've realised this might not be everything
>> involved.
>>
>> I'm on the system but I can't find the process that is "cleaning" up the
>> directory at
>>
>> /usr/local/spamassassin/automc/rsync/corpus
>>
>> At first I thought it was the hourly script but I don't think this is
>> true.
>>
>> I've checked through cron.d run scripts and just can't seem to find it -
>> I've a feeling something is deleting logs from the corpus directory
>> prematurely, which then stops it being captured during the hourly when it
>> should - it's a case of < 1 hour.
>>
>> It's possible this script has code to figure out if it's running at UTC
>> or needs an offset similar to the one in the bug.
>>
>> It seems that the script is aware if it is running a nightly or weekly
>> and doesn't run the nightly on a Saturday.
>>
>> Hope you might have an idea of which script I'm referring to?
>>
>> I've "fixed" my problem by moving my corpus check to make sure it
>> completes after 10:00 UTC - this will like fix everyone's but I'd like to
>> make sure that when we say mass check after 09:00 UTC we mean it.
>>
>> Paul
>>
>

Re: Disappearing corpus

Posted by Paul Stead <pa...@gmail.com>.
TLDR;
Any pointers on what might be clearing up the old or "invalid" files in
/usr/local/spamassassin/automc/rsync/corpus?

----

I'm going on the opinion that some function is cleaning up the

/usr/local/spamassassin/automc/rsync/corpus

directory underneath the corpus-hourly script - though I've so far been
unable to distinguish what. There seems to be a lot of superfluous scripts
hanging around in the svn directories.

As far as I can tell it isn't the corpus-hourly cron, nor the
/usr/local/bin/checkMasscheckContribs.sh script.

During my investigations I've noticed that the hourly does seem to take
more than an hour to run, thus two processes can run at the same time

automc    7749 13.9  0.1  40632 19040 ?        RN   15:05   3:27
/usr/bin/perl -w
/usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly
--dir=/usr/local/spamassassin/automc/rsync/corpus
automc    8708 99.7  0.8 164560 145008 ?       RN   15:09  20:10
/usr/bin/perl -w ./hit-frequencies -TxpaP -o
/usr/local/spamassassin/automc/tmp/spam.log.25383
/usr/local/spamassassin/automc/tmp/ham.log.25383
automc   25383  9.3  0.1  38880 17480 ?        SN   14:05   7:56
/usr/bin/perl -w
/usr/local/spamassassin/automc/svn/masses/rule-qa/corpus-hourly
--dir=/usr/local/spamassassin/automc/rsync/corpus

I'm not 100% that this is causing a problem, I see some protection against
this for the running files, but I'm not sure about the resulting class
files that are output.

Paul

On Sat, 25 May 2019 at 13:00, Paul Stead <pa...@gmail.com> wrote:

> I'm investingating the problem with disappearing corpus - see the bug
> report here -
>
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715
>
> Whilst that is an issue, I've realised this might not be everything
> involved.
>
> I'm on the system but I can't find the process that is "cleaning" up the
> directory at
>
> /usr/local/spamassassin/automc/rsync/corpus
>
> At first I thought it was the hourly script but I don't think this is true.
>
> I've checked through cron.d run scripts and just can't seem to find it -
> I've a feeling something is deleting logs from the corpus directory
> prematurely, which then stops it being captured during the hourly when it
> should - it's a case of < 1 hour.
>
> It's possible this script has code to figure out if it's running at UTC or
> needs an offset similar to the one in the bug.
>
> It seems that the script is aware if it is running a nightly or weekly and
> doesn't run the nightly on a Saturday.
>
> Hope you might have an idea of which script I'm referring to?
>
> I've "fixed" my problem by moving my corpus check to make sure it
> completes after 10:00 UTC - this will like fix everyone's but I'd like to
> make sure that when we say mass check after 09:00 UTC we mean it.
>
> Paul
>

Re: Disappearing corpus

Posted by Paul Stead <pa...@gmail.com>.
I'm at the root of the issue and ready to commit changes around this:

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7715

The changes will not affect how ruleqa works or how submissions should be
done - please continue to submit *after 0900 UTC*

Any feedback appreciated, will be applying after 1st June unless feedback
received.

Paul