You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Dave Pooser <da...@pooserville.com> on 2014/06/16 15:42:27 UTC

Re: sa-update NOT updating.

On 5/30/14 11:11 AM, "Kevin A. McGrail" <KM...@PCCC.com> wrote:

>Good time for an update to the users list about the issue.  The box that
>processed the updates at the ASF collo failed catastrophically during a
>power surge that took down some other boxes as ell. Unfortunately, while
>the project requested backups in 2009, they were not implemented.

Now that the update box is back online (and thanks for all your hard work
on that! Systems archaeology is no fun at all), is there anything useful
the community can do to help prevent another such catastrophe? I'd be
willing to contribute hardware and/or VM space at $WORKPLACE for an
offsite replica as long as we wouldn't need to sync more than 2-4GB/day
after the initial setup completed.
-- 
Dave Pooser
Cat-Herder-in-Chief, Pooserville.com

Re: sa-update NOT updating.

Posted by John Hardin <jh...@impsec.org>.

On Mon, 16 Jun 2014, francis picabia wrote:

> If this is open source, why not take advantage of all of the repositories
> available for this?  Git?  Sourceforge?  Mirrors?  The problem isn't merely
> the lack of a backup, but a single point of failure waiting to happen.

The point of failure isn't simply a set of files, it's a processing 
resource.

Something like a VM snapshot that could be easily restored (i.e. on an 
Amazon cloud server, say) to get a working environment would be a good way 
to back up centralized part of the masscheck infrastructure.

Kevin, is this running on a VM now, such that this is an option?

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Microsoft is not a standards body.
-----------------------------------------------------------------------
  2 days until SWMBO's Birthday

Re: sa-update NOT updating.

Posted by Axb <ax...@gmail.com>.

On 06/16/2014 08:39 PM, francis picabia wrote:
> If this is open source, why not take advantage of all of the repositories
> available for this?  Git?  Sourceforge?  Mirrors?  The problem isn't merely
> the lack of a backup, but a single point of failure waiting to happen.
> When something goes wrong with kernel.org or the like, it isn't backup
> tapes that get them online again quickly.  I don't know enough about it,
> but I know the general principle that if you have a problem you're likely
> not the first to encounter it, and you usually don' t need to invent a
> solution, but look into how it has been solved elsewhere.

This not about where to store code...

This about a system that does all the rule score manipulation, prepares 
updates, etc, etc, etc
Software repositories don't handle this.
There is no similar processing going on elsewhere.

Re: sa-update NOT updating.

Posted by Ben <be...@list-subs.com>.

> At the ASF, there is an infrastructure team that manages those type of
> issues.  They work hard and do a lot of good but unfortunately, there
> was a disconnect back in 2009 and a backup request was not implemented
> correctly.
>

An untested backup is not a backup.  Some people only ever seem to learn 
that the hard way.

To be honest, given the popularity of SA and the relative importance of 
the sa-update service it's astonishing that such a SPOF with such a lax 
management routine was allowed to occur.

Re: sa-update NOT updating.

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.

On 6/16/2014 2:39 PM, francis picabia wrote:
> If this is open source, why not take advantage of all of the 
> repositories available for this?  Git?  Sourceforge? Mirrors?  The 
> problem isn't merely the lack of a backup, but a single point of 
> failure waiting to happen.  When something goes wrong with kernel.org 
> <http://kernel.org> or the like, it isn't backup tapes that get them 
> online again quickly.  I don't know enough about it, but I know the 
> general principle that if you have a problem you're likely not the 
> first to encounter it, and you usually don' t need to invent a 
> solution, but look into how it has been solved elsewhere.
At the ASF, there is an infrastructure team that manages those type of 
issues.  They work hard and do a lot of good but unfortunately, there 
was a disconnect back in 2009 and a backup request was not implemented 
correctly.

So this is less about inventing a wheel and more about making sure that 
we have air in the spare tire...

regards,
KAM

Re: sa-update NOT updating.

Posted by francis picabia <fp...@gmail.com>.

On Mon, Jun 16, 2014 at 12:36 PM, Kevin A. McGrail <KM...@pccc.com>
wrote:

> On 6/16/2014 9:49 AM, Joe Quinn wrote:
>
>> On 6/16/2014 9:42 AM, Dave Pooser wrote:
>>
>>> On 5/30/14 11:11 AM, "Kevin A. McGrail" <KM...@PCCC.com> wrote:
>>>
>>>  Good time for an update to the users list about the issue.  The box that
>>>> processed the updates at the ASF collo failed catastrophically during a
>>>> power surge that took down some other boxes as ell. Unfortunately, while
>>>> the project requested backups in 2009, they were not implemented.
>>>>
>>> Now that the update box is back online (and thanks for all your hard work
>>> on that! Systems archaeology is no fun at all), is there anything useful
>>> the community can do to help prevent another such catastrophe? I'd be
>>> willing to contribute hardware and/or VM space at $WORKPLACE for an
>>> offsite replica as long as we wouldn't need to sync more than 2-4GB/day
>>> after the initial setup completed.
>>>
>>
>> If you have access to any SA boxes, make sure they have a scheduled
>> backup (and make sure the backup works and has all important data!). If any
>> systems do not have backups, report it to the appropriate list.
>>
>> Also make sure every task the box is designed to handle is appropriately
>> documented, including user accounts required, libraries required and their
>> versions, what crontabs should be, etc.
>>
> I think Joe's answer is correct but at the same time doesn't answer the
> question of what the community at large can do to help.
>
> First, the overall takeaway for me is that documentation is important.
>  This was a hard task when it was just a box failure. When it became a box
> failure with missing backups and large documentation issues, it effectively
> became a personal mission to get the box working.  A little documentation
> on things went a long way for me and especially in the OS world, people
> burnout or go on to other projects so it's helpful if you try and document.
>
> Second, to answer your question less philosophically, from the community
> we always need:
>
> - Masscheckers - You run code nightly and automated against a hand sorted
> spam/ham corpora to improve our rule scoring.  Once you get it setup and
> get good, sorted email, the system is very automated.
>
> - Rule writers - Spam evolves and we need people to write rules.  If you
> like balancing your checkbook, doing SoDoKu, see patterns in gibberish,
> this might be perfect for you.  And what I love doing is evolving the rule
> writing from manual to automated processing.  It's quite a science really!
>
> - Coders - The life blood of a project really.  If you want to help write
> code, become a committer and help drive this project on the PMC, speak up!
>
> - Testers - People who will use trunk on production systems and give
> constructive feedback on real-world mail flow.  NOTE: trunk is usually in
> good shape and runs on many of the committers systems. Because of the way
> the system is plugin based, the experimental stuff is usually not enabled
> by default.
>
> - RBL Stuff - I'm also still working with the ASF to see if we can run a
> distributed RBL under the projects Umbrella so stay tuned on that...
>
>
If this is open source, why not take advantage of all of the repositories
available for this?  Git?  Sourceforge?  Mirrors?  The problem isn't merely
the lack of a backup, but a single point of failure waiting to happen.
When something goes wrong with kernel.org or the like, it isn't backup
tapes that get them online again quickly.  I don't know enough about it,
but I know the general principle that if you have a problem you're likely
not the first to encounter it, and you usually don' t need to invent a
solution, but look into how it has been solved elsewhere.

Re: sa-update NOT updating.

Posted by "Kevin A. McGrail" <KM...@PCCC.com>.

On 6/16/2014 9:49 AM, Joe Quinn wrote:
> On 6/16/2014 9:42 AM, Dave Pooser wrote:
>> On 5/30/14 11:11 AM, "Kevin A. McGrail" <KM...@PCCC.com> wrote:
>>
>>> Good time for an update to the users list about the issue.  The box 
>>> that
>>> processed the updates at the ASF collo failed catastrophically during a
>>> power surge that took down some other boxes as ell. Unfortunately, 
>>> while
>>> the project requested backups in 2009, they were not implemented.
>> Now that the update box is back online (and thanks for all your hard 
>> work
>> on that! Systems archaeology is no fun at all), is there anything useful
>> the community can do to help prevent another such catastrophe? I'd be
>> willing to contribute hardware and/or VM space at $WORKPLACE for an
>> offsite replica as long as we wouldn't need to sync more than 2-4GB/day
>> after the initial setup completed.
>
> If you have access to any SA boxes, make sure they have a scheduled 
> backup (and make sure the backup works and has all important data!). 
> If any systems do not have backups, report it to the appropriate list.
>
> Also make sure every task the box is designed to handle is 
> appropriately documented, including user accounts required, libraries 
> required and their versions, what crontabs should be, etc.
I think Joe's answer is correct but at the same time doesn't answer the 
question of what the community at large can do to help.

First, the overall takeaway for me is that documentation is important.  
This was a hard task when it was just a box failure. When it became a 
box failure with missing backups and large documentation issues, it 
effectively became a personal mission to get the box working.  A little 
documentation on things went a long way for me and especially in the OS 
world, people burnout or go on to other projects so it's helpful if you 
try and document.

Second, to answer your question less philosophically, from the community 
we always need:

- Masscheckers - You run code nightly and automated against a hand 
sorted spam/ham corpora to improve our rule scoring.  Once you get it 
setup and get good, sorted email, the system is very automated.

- Rule writers - Spam evolves and we need people to write rules.  If you 
like balancing your checkbook, doing SoDoKu, see patterns in gibberish, 
this might be perfect for you.  And what I love doing is evolving the 
rule writing from manual to automated processing.  It's quite a science 
really!

- Coders - The life blood of a project really.  If you want to help 
write code, become a committer and help drive this project on the PMC, 
speak up!

- Testers - People who will use trunk on production systems and give 
constructive feedback on real-world mail flow.  NOTE: trunk is usually 
in good shape and runs on many of the committers systems. Because of the 
way the system is plugin based, the experimental stuff is usually not 
enabled by default.

- RBL Stuff - I'm also still working with the ASF to see if we can run a 
distributed RBL under the projects Umbrella so stay tuned on that...

Regards,
KAM

Re: sa-update NOT updating.

Posted by Joe Quinn <jq...@pccc.com>.

On 6/16/2014 9:42 AM, Dave Pooser wrote:
> On 5/30/14 11:11 AM, "Kevin A. McGrail" <KM...@PCCC.com> wrote:
>
>> Good time for an update to the users list about the issue.  The box that
>> processed the updates at the ASF collo failed catastrophically during a
>> power surge that took down some other boxes as ell. Unfortunately, while
>> the project requested backups in 2009, they were not implemented.
> Now that the update box is back online (and thanks for all your hard work
> on that! Systems archaeology is no fun at all), is there anything useful
> the community can do to help prevent another such catastrophe? I'd be
> willing to contribute hardware and/or VM space at $WORKPLACE for an
> offsite replica as long as we wouldn't need to sync more than 2-4GB/day
> after the initial setup completed.

If you have access to any SA boxes, make sure they have a scheduled 
backup (and make sure the backup works and has all important data!). If 
any systems do not have backups, report it to the appropriate list.

Also make sure every task the box is designed to handle is appropriately 
documented, including user accounts required, libraries required and 
their versions, what crontabs should be, etc.