You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spamassassin.apache.org by Timo Volz <Ti...@gmx.de> on 2005/02/08 23:48:36 UTC

Overview of the scripts in the masses folder

Hi,

i'm working on a student project about spam detection at the University of
Stuttgart. That will also include calculating new scores for SpamAssassin
with some approaches of machine learning.

As a first step, i created an overview of the scripts in the SpamAssassin
/masses folder. I'm not sure if it's a good idea to send attachements to
this list, so i put it here:
http://w3studi.informatik.uni-stuttgart.de/~volzto/SAMassesFolder.txt

It would be very nice if someone could have a look at it and give me some
feedback (corrections or additions). Of course you can feel free to do with
it whatever you prefere. Maybe it's useful for others who want to learn how
to work with these scripts.

I also have some questions i need to solve to continue with my work. It
would be of great help to me if someone could answer them:

Which rules are considered as immutable and why?

Which ones are ignored in the rescore process or set to 0 and why?

Which rules are ignored by the logs-to-c script when creating a statistic
and why are they ignored?

What are tflags, and what is the meaning of the value "nice"?

Thank you in advance!

Best regards,
Timo Volz


Re: Overview of the scripts in the masses folder

Posted by Daniel Quinlan <qu...@pathname.com>.
"Timo Volz" <Ti...@gmx.de> writes:

> http://w3studi.informatik.uni-stuttgart.de/~volzto/SAMassesFolder.txt

I'd STRONGLY recommend adding your information to the development
information on our Wiki.  Then I'll comment on it and add information.
;-)

Probably a new page named "MassesOverview" indexed under this page:

  http://wiki.apache.org/spamassassin/DevelopmentStuff

> Which rules are considered as immutable and why?

rules not marked as mutable in 50_scores.cf because they can't be
accurately scored or we don't want the score changed:

  GTUBE - always want this to trigger spam
  "we dare you" rules - things we never want to let spammers get away
    with doing
  locale or language specific rules
  intentionally low scoring rules (HTML_MESSAGE, some virus-related stuff)
  userconf rules - things that require individual user settings to work;
    because they can't be accurately scored
  very low-hit rate, but accurate rules like Habeas, DomainKeys, etc.
  disabled-by-default rules like MAPS

> Which ones are ignored in the rescore process or set to 0 and why?

See above.
 
> Which rules are ignored by the logs-to-c script when creating a statistic
> and why are they ignored?

I'd have to check the code.  ;-)

> What are tflags, and what is the meaning of the value "nice"?

nice means it's a ham rule rather than a spam rule.  They are the only
rules that can be assigned negative scores.

Daniel
 
-- 
Daniel Quinlan
http://www.pathname.com/~quinlan/

Re: Overview of the scripts in the masses folder

Posted by Robert Menschel <Ro...@Menschel.net>.
Hello Timo,

Tuesday, February 8, 2005, 2:48:36 PM, you wrote:

TV> http://w3studi.informatik.uni-stuttgart.de/~volzto/SAMassesFolder.txt

TV> It would be very nice if someone could have a look at it and give
TV> me some feedback (corrections or additions). ...

Yes, as Daniel suggested, please add this to the wiki.

One addition/correction:  the hit-frequencies script works only with
*.cf rules files whose names begin with a numeric digit. Rules in
local.cf or alphaname.cf will be ignored.  I believe this a limitation
from the called script parse-rules-for-masses but my memory might be
wrong about that.

Bob Menschel