You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spamassassin.apache.org by Duncan Findlay <du...@debian.org> on 2005/07/24 05:36:58 UTC

Hackathon summary

Just thought I'd post a quick note about the hackathon that took place
today at Stanford university. "We" below refers to Justin Mason,
Daniel Quinlan, Michael Parker and me. Matt Sergeant was also present
too for a while, so "we" can include him too for some of the
following items. :-)

Discussion

 * We discussed at length the ideas for the new rules project, and we
came up with some ideas, which we're trying to track
http://wiki.apache.org/spamassassin/RulesProjectPlan (Please give us
your feedback)

 * We discussed the 3.2 release goals
(http://wiki.apache.org/spamassassin/ReleaseGoals)

 * Dr. Andrew Ng gave us a brief presentation of how Logistic
Regression may be an algorithm we could use in the future to replace
the perceptron.

Development

 * We came up with a plan to restructure PerMsgStatus.pm so it's not
so unwieldy and out of
control. (http://bugzilla.spamassassin.org/show_bug.cgi?id=4497)

 * We branched the tree so we could start committing stuff to
HEAD. Strangely, however, we got almost no coding done.

QA/Bugs

 * We went through all the bugs targeted for 3.1.0 and triaged
them. (all the bugzilla comments from me today were really from all of
us that were present)

 * We added a "moreinfo" keyword to bugzilla for bugs that are in need
of more info. One side effect of this, is that we'll need to remove
that keyword when more info is actually given. :-)


That's about all.

-- 
Duncan Findlay

Re: Streamlining Rules Process

Posted by Daniel Quinlan <qu...@pathname.com>.

Robert Menschel <Ro...@Menschel.net> writes:

> One item not mentioned on this page yet is how to score rules going to
> either core and rapid distribution such as via sa-update or going to
> the extra rule sets.

Current practice is that new rules temporarily get the default score of
1.0.  We plan to rescore much more often in the future, though.  The new
scoring method makes it much easier and once we get the kinks worked
out, I think we'll be able to do it much more often than we have in the
past.

One option to bridge the gap would be to score new rules based on a
nightly run using the more limited corpora of a nightly run.  This would
be done by setting old rules' scores to be immutable and only scoring
the new ones.  That would not be too hard and would be more accurate
than any estimation technique.  There is definitely a correlation
between hit rates, S/O ratio, RANK, etc. to the ultimate
perceptron-generated score, but the correlations are not all that high,
unfortunately.

> The ideal would be to find some way to incorporate new rules into a
> GA/Perceptron-line mechanism, perhaps a Perceptron run which a)
> assumes whatever hit frequency applied to the last full scoring run,
> b) freezes all scores in all score sets according to the most recent
> distribution, and then c) incorporates an sa-update scoring run and
> calculates appropriate scores for the new rules.  [...]

Ah, very good.  I should have read your entire message.  ;-)

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Streamlining Rules Process

Posted by Robert Menschel <Ro...@Menschel.net>.

Saturday, July 23, 2005, 8:36:58 PM, Duncan wrote:

DF>  * We discussed at length the ideas for the new rules project, and we
DF> came up with some ideas, which we're trying to track
DF> http://wiki.apache.org/spamassassin/RulesProjectPlan (Please give us
DF> your feedback)

http://wiki.apache.org/spamassassin/RulesProjStreamlining

One item not mentioned on this page yet is how to score rules going to
either core and rapid distribution such as via sa-update or going to
the extra rule sets.

The ideal would be to find some way to incorporate new rules into a
GA/Perceptron-line mechanism, perhaps a Perceptron run which a)
assumes whatever hit frequency applied to the last full scoring run,
b) freezes all scores in all score sets according to the most recent
distribution, and then c) incorporates an sa-update scoring run and
calculates appropriate scores for the new rules.

If that's not practical, then perhaps we can use some standardized
algorithms to determine provisional scores. The algorithms we use
for general purpose rules within SARE seem to work very well, adding
significantly to spam scores without causing any significant number of
FPs.

Would it be appropriate for me to post those algorithms in the wiki
as part of a "scoring" discussion? I'm thinking this could easily grow
to warrant a page of its own...

Bob Menschel

Re[2]: Hackathon summary

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Daniel,

Sunday, July 24, 2005, 11:34:33 PM, you wrote:

DQ> Robert Menschel <Ro...@Menschel.net> writes:

>>> TODO: criteria for overlap with existing rules?
>>> BobMenschel: The method I used for weeding out SARE rules that
>>> overlapped 3.0.0 rules, was to run a full mass-check with overlap
>>> analysis, and throw away anything where the overlap is less than
>>> 50%.

DQ> By "throw away", do you mean put into the bucket that is retained going
DQ> forward or did you mean to say "greater than 50%"?

By "throw away anything where the overlap is less than 50%" I meant
to discard (exclude from the final file) anything where the overlap
was (IMO) insignificant.

This would leave those overlaps where RULE_A hit all the emails that
RULE_B also hit (100%), and RULE_B hit somewhere between 50% and 100%
of the rules that RULE_A hit.

It'd also be good to identify overlaps where RULE_A hit 90% of what
RULE_B hit, and RULE_B hit 90% of what rule A hit, but neither hit
100% of the other's ...

Bob Menschel

Re: Hackathon summary

Posted by Daniel Quinlan <qu...@pathname.com>.

Robert Menschel <Ro...@Menschel.net> writes:

>> TODO: criteria for overlap with existing rules?
>> BobMenschel: The method I used for weeding out SARE rules that
>> overlapped 3.0.0 rules, was to run a full mass-check with overlap
>> analysis, and throw away anything where the overlap is less than
>> 50%.

By "throw away", do you mean put into the bucket that is retained going
forward or did you mean to say "greater than 50%"?

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: Hackathon summary

Posted by Robert Menschel <Ro...@Menschel.net>.

Saturday, July 23, 2005, 8:36:58 PM, Duncan wrote:

DF>  * We discussed at length the ideas for the new rules project, and we
DF> came up with some ideas, which we're trying to track
DF> http://wiki.apache.org/spamassassin/RulesProjectPlan (Please give us
DF> your feedback)

Added to http://wiki.apache.org/spamassassin/RulesProjStreamlining :

> TODO: criteria for overlap with existing rules?
> BobMenschel: The method I used for weeding out SARE rules that
> overlapped 3.0.0 rules, was to run a full mass-check with overlap
> analysis, and throw away anything where the overlap is less than
> 50%. Manually reviewing the remaining (significantly) overlapping
> rules was fairly easy. The command I use is: perl ./overlap
> ../rules/tested/$testfile.ham.log ../rules/tested/$testfile.spam.log
> | grep -v mid= | awk ' NR == 1 { print } ; $2 + 0 == 1.000 && $3 + 0
> >= 0.500 { print } ' >../rules/tested/$testfile.overlap.out

Re[2]: Hackathon summary

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Justin,

Sunday, July 24, 2005, 11:30:49 PM, you wrote:

JM> Daniel Quinlan writes:
>> I would hope to include most language-specific rule sets in core (with
>> usual stipulations about quality), though.

JM> btw yeah, I'm still pretty conflicted about this -- I'm not sure what to
JM> do if we can't mass-check them ourselves reliably to get an idea what good
JM> scores would be.

There's also the problem which SARE has run into in many of our
obfuscation rules, where they hit beautifully on English spam, but
have horrible S/O rates for German ham (to pick an example).

That's why we use 70_sare_name_eng.cf files, to indicate that these
rules work well only on systems which expect almost 100% English ham,
and little to no ham in other languages.

I've begun to wonder whether it might be worth while having
50_scores.cf for English emails, and then 50_scores_de.cf for German
emails, and have SA pick the score appropriately depending upon the
language of the email...

Bob Menschel

Re: Hackathon summary

Posted by Daniel Quinlan <qu...@pathname.com>.

Robert Menschel <Ro...@Menschel.net> writes:

> Question -- at the bottom of that page, we have:

>> Repository Organization
>> * rules/core/ = standard rules directory
>> * rules/sandbox/<username>/ = per-user sandboxes
>> * rules/extra/<directory>/ = extra rule sets not in core
> 
> I understand the first two.  What is the intent of the third. Would
> that be a collection of non-Apache/CLA rule sets submitted by users,
> for users?  Or would this be a collection of Apache/CLA rule sets
> which don't qualify for rules/core for some reason (ie:
> language-specific, too many ham hits, not enough spam hits, etc)? Or
> some other collection?  Or a combination of these?

The second, a collection that do not qualify for rules/core.  For
example, SpamAssassin intentionally doesn't filter virus bounces (yet,
at least), but there is a good virus bounce ruleset out there.

I would hope to include most language-specific rule sets in core (with
usual stipulations about quality), though.

Daniel

-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: Hackathon summary

Posted by Robert Menschel <Ro...@Menschel.net>.

Hello Duncan, All,

Saturday, July 23, 2005, 8:36:58 PM, you wrote:

DF>  * We discussed at length the ideas for the new rules project, and we
DF> came up with some ideas, which we're trying to track
DF> http://wiki.apache.org/spamassassin/RulesProjectPlan (Please give us
DF> your feedback)

Question -- at the bottom of that page, we have:
> Repository Organization
> * rules/core/ = standard rules directory
> * rules/sandbox/<username>/ = per-user sandboxes
> * rules/extra/<directory>/ = extra rule sets not in core

I understand the first two.  What is the intent of the third. Would
that be a collection of non-Apache/CLA rule sets submitted by users,
for users?  Or would this be a collection of Apache/CLA rule sets
which don't qualify for rules/core for some reason (ie:
language-specific, too many ham hits, not enough spam hits, etc)? Or
some other collection?  Or a combination of these?

Bob Menschel