You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Adam Katz <an...@khopis.com> on 2010/04/24 20:36:43 UTC

hardware on ruleqa

The ruleqa system is very slow to crunch its results, and even slow to
display them.  I'd like to see it have a caching system for data it has
completed processing and maybe find some way to improve its ability to
handle the masscheck calculations.

Setting up a caching system within an apache.org server should be pretty
trivial ... I'm rusty but can help if needed.

As to the nightly runs, while a code revamp might help, it's typically
easier to throw more hardware at it.  What are the stats of that system,
where does it live, and what is the process for donating a replacement?
 (Are there other volunteers for this, or am I the only impatient one?)
 Hopefully with more power, additional statistical crunching can ensue...

(This is not a commitment, just an initial exploration.  ... which
reminds me; if this counts as a charitable donation, I can see about
getting my employer to pay for some of it, or perhaps donate a 1U server
I'm virtualizing.)


Re: hardware on ruleqa

Posted by Justin Mason <jm...@jmason.org>.
On Mon, Apr 26, 2010 at 21:15, Warren Togami <wt...@gmail.com> wrote:
> On Mon, Apr 26, 2010 at 6:00 AM, Justin Mason <jm...@jmason.org> wrote:
>>
>> On Sat, Apr 24, 2010 at 19:36, Adam Katz <an...@khopis.com> wrote:
>> > The ruleqa system is very slow to crunch its results, and even slow to
>> > display them.  I'd like to see it have a caching system for data it has
>> > completed processing and maybe find some way to improve its ability to
>> > handle the masscheck calculations.
>> >
>> > Setting up a caching system within an apache.org server should be pretty
>> > trivial ... I'm rusty but can help if needed.
>> >
>> > As to the nightly runs, while a code revamp might help, it's typically
>> > easier to throw more hardware at it.  What are the stats of that system,
>> > where does it live, and what is the process for donating a replacement?
>> >  (Are there other volunteers for this, or am I the only impatient one?)
>> >  Hopefully with more power, additional statistical crunching can
>> > ensue...
>>
>> Right now, it all lives on spamassassin2.zones.apache.org.  This is a
>> pretty
>> hefty Solaris "zone" (a jail-style isolated environment) running on ASF
>> hardware.
>>
>> It seems to be quite beefy, but we have a lot of data to crunch and as
>> you've
>> noted it tends to get very backlogged, particularly on the OVERLAP data,
>> which is very memory- and I/O-hungry.
>>
>
> Watching the logs on ruleqa, it appears that it was repeatedly processing
> old datasets long after they were current.  It appears that caching wasn't
> working properly?

ah, I was unaware of that.  sounds like a bug :(

Re: hardware on ruleqa

Posted by Warren Togami <wt...@gmail.com>.
On Mon, Apr 26, 2010 at 6:00 AM, Justin Mason <jm...@jmason.org> wrote:

> On Sat, Apr 24, 2010 at 19:36, Adam Katz <an...@khopis.com> wrote:
> > The ruleqa system is very slow to crunch its results, and even slow to
> > display them.  I'd like to see it have a caching system for data it has
> > completed processing and maybe find some way to improve its ability to
> > handle the masscheck calculations.
> >
> > Setting up a caching system within an apache.org server should be pretty
> > trivial ... I'm rusty but can help if needed.
> >
> > As to the nightly runs, while a code revamp might help, it's typically
> > easier to throw more hardware at it.  What are the stats of that system,
> > where does it live, and what is the process for donating a replacement?
> >  (Are there other volunteers for this, or am I the only impatient one?)
> >  Hopefully with more power, additional statistical crunching can ensue...
>
> Right now, it all lives on spamassassin2.zones.apache.org.  This is a
> pretty
> hefty Solaris "zone" (a jail-style isolated environment) running on ASF
> hardware.
>
> It seems to be quite beefy, but we have a lot of data to crunch and as
> you've
> noted it tends to get very backlogged, particularly on the OVERLAP data,
> which is very memory- and I/O-hungry.
>
>
Watching the logs on ruleqa, it appears that it was repeatedly processing
old datasets long after they were current.  It appears that caching wasn't
working properly?

Warren

Re: hardware on ruleqa

Posted by Justin Mason <jm...@jmason.org>.
On Sat, Apr 24, 2010 at 19:36, Adam Katz <an...@khopis.com> wrote:
> The ruleqa system is very slow to crunch its results, and even slow to
> display them.  I'd like to see it have a caching system for data it has
> completed processing and maybe find some way to improve its ability to
> handle the masscheck calculations.
>
> Setting up a caching system within an apache.org server should be pretty
> trivial ... I'm rusty but can help if needed.
>
> As to the nightly runs, while a code revamp might help, it's typically
> easier to throw more hardware at it.  What are the stats of that system,
> where does it live, and what is the process for donating a replacement?
>  (Are there other volunteers for this, or am I the only impatient one?)
>  Hopefully with more power, additional statistical crunching can ensue...

Right now, it all lives on spamassassin2.zones.apache.org.  This is a pretty
hefty Solaris "zone" (a jail-style isolated environment) running on ASF
hardware.

It seems to be quite beefy, but we have a lot of data to crunch and as you've
noted it tends to get very backlogged, particularly on the OVERLAP data,
which is very memory- and I/O-hungry.

> (This is not a commitment, just an initial exploration.  ... which
> reminds me; if this counts as a charitable donation, I can see about
> getting my employer to pay for some of it, or perhaps donate a 1U server
> I'm virtualizing.)

I wonder if we could use something like that -- maybe farm out the
analysis of OVERLAP data to a dedicated server VM?  the hard part might
be shipping the logs around -- they are quite large. :(

Feel free to investigate.  All the code is in SVN.  I can set up zones user
accounts for you if you like....

--j.

Re: hardware on ruleqa ... and perceptrons

Posted by Justin Mason <jm...@jmason.org>.
On Sun, Apr 25, 2010 at 00:57, Sidney Markowitz <si...@sidney.com> wrote:
> Adam Katz wrote, On 25/04/10 8:22 AM:
>> Today, I saw this in svn at masses/README.perceptron:
>
> See this that Justin posted to sa-dev that explains the history of our
> using GA, then perceptron, then back to GA.
>
> It also links to Duncan Findlay's thesis work on using logistic
> regression as a faster algorithm that gets better results, but I don't
> know what ended up happening with that.
>
> http://mail-archives.apache.org/mod_mbox/spamassassin-dev/200707.mbox/%3C20070701224117.F1A7732D60@radish.jmason.org%3E
>
> or if that link gets garbled, also archived at
>
> http://www.mail-archive.com/dev@spamassassin.apache.org/msg21162.html

Yep.  Basically, the perceptron implementation seems to require a lot
of hand-tuning to
produce decent results.  The GA is a lot more "fire and forget", if slower.

--j.

Re: hardware on ruleqa ... and perceptrons

Posted by Sidney Markowitz <si...@sidney.com>.
Adam Katz wrote, On 25/04/10 8:22 AM:
> Today, I saw this in svn at masses/README.perceptron:

See this that Justin posted to sa-dev that explains the history of our
using GA, then perceptron, then back to GA.

It also links to Duncan Findlay's thesis work on using logistic
regression as a faster algorithm that gets better results, but I don't
know what ended up happening with that.

http://mail-archives.apache.org/mod_mbox/spamassassin-dev/200707.mbox/%3C20070701224117.F1A7732D60@radish.jmason.org%3E

or if that link gets garbled, also archived at

http://www.mail-archive.com/dev@spamassassin.apache.org/msg21162.html


Re: hardware on ruleqa ... and perceptrons

Posted by Adam Katz <an...@khopis.com>.
On 04/24/2010 02:36 PM, Adam Katz wrote:
> The ruleqa system is very slow to crunch its results [...] As to the
> nightly runs, while a code revamp might help, it's typically easier
> to throw more hardware at it.

Hm.  Digging through the repository brought forth some of the
conversations I had at the MIT Spam Conference this year between some
Cisco (IronPort) developers working there under Henry Stern.

Specifically, I was being challenged with the differences between SA's
genetic algorithm and perceptrons, which is beyond my current
mathematical prowess.  Upon talking to my girlfriend (who knows far more
about these things than myself), we concluded that since the Cisco group
was specialized in perceptrons, they likely suffered from the "when you
have a hammer, everything looks like a nail" problem and that it was
probably a negligible gain not worth the needed rewriting.

Today, I saw this in svn at masses/README.perceptron:
> The advantage of this program over that of the genetic algorithm
> (GA) implementation in spamassassin/masses/craig_evolve.c is that
> while the GA requires several hours to run on high-end machines, the
> perceptron requires only about 15 seconds of CPU time on an Athlon XP
> 1700+ system.

Written by Henry Stern, 2004-01-08.  If I recall from my conversation
with him last month, he abandoned the project and his PhD pursuit when
offered a job at IronPort.

Henry:  I've Bcc'd you in case you're not on the dev list anymore.
Apologies if you get this twice.