You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Justin Mason <jm...@jmason.org> on 2005/10/27 02:32:41 UTC

Rules Project: need corpus fodder for the "preflight" mass-check

So, this is coming along nicely. ;)


STORY SO FAR:

http://spamassassin.zones.apache.org:8011/ is the main UI -- each time a
checkin occurs in SVN, a set of mass-checks are triggered.

There are 4 mass-checks at the moment: mc-fast, mc-med, mc-slow and
mc-slower.   The idea is that the "fast" one completes first, with only a
few thousand messages (right now, it mass-checks 5700 mails in 3
minutes!), providing a quick, rough look at freqs:

http://spamassassin.zones.apache.org:8011/mc-fast/builds/28/configure_2/0
(Mass-check results from "mc-fast")

then, gradually, the other 3 complete and provide their results
as well:

http://spamassassin.zones.apache.org:8011/mc-slower/builds/21/configure_2/0
(Mass-check results from "mc-slower")

The results page presents a basic look at the "freqs" output.  At
the same time, it starts generating the data for the next step,
the rule-QA app:

http://buildbot.spamassassin.org/ruleqa/ruleqa?daterev=20051025/r328495
(the rule-QA view of the same data)

This allows us to "drill down" to more details about a rule:

http://buildbot.spamassassin.org/ruleqa/ruleqa?daterev=20051025%2Fr328495&rule=T_SUBJ_RE_NUM&s_detail=1
(drilled-down for details about "T_SUBJ_RE_NUM")


Now, I have a couple of things on the todo list remaining for this
app:

  - message hits-over-time graphs
  - hitrates on messages by score (does this rule hit high-scoring
    spams only?)

So they're in the pipeline.


HELP NEEDED!

In addition, we need another thing: mail!   There's these issues that we
have to worry about, though:

  - the privacy of submitted ham: in other words, I think most of us might
    have a hard time uploading our freshest, unchecked ham mail, since
    there could be private stuff in there.

  - the freshness of submitted spam: old spam is only partially useful,
    and in fact can be misleading (ie a rule can fire well but be
    useless against current and future spam).

  - the hand-filteredness.

we need fresh spam and private ham.  Ham doesn't need to be quite
as up-to-the-minute-fresh, but spam does.

So, next question -- can you provide a corpus, which you're prepared to
frequently update?

What I'm thinking is, up to about 20k ham/20k spam messages from a few
people should be plenty.  (This is only the "preflight" mass-check,
for quick checking, it doesn't have to be comprehensive; anything
from a few thousand up would be perfect.)

It's important that the ham stuff be pristine ham, and that the spam
be reasonably pristine; spam needs to be up-to-date, ham, not so
much.

I think the easiest way to transfer it is via rsync.

--j.