You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Dan <a...@patnode.net> on 2006/04/09 23:39:50 UTC

Learning SpamAssassin

Newbie here,

I've filtered email for years with Declude and am adding SpamAssassin  
to my arsenal.  I want to build something from scratch and am having  
problems getting started.  Thing is, all the guides (books, manual,  
web pages) seem to be geared toward using the standard configuration  
or making slight modifications to it - assuming the reader doesn't  
want to see under the hood.  I can't find anything that describes the  
configuration structure that SA uses, its requirements or rules.

I have a sense of 10_, 20_, 50_ file names in terms of controlling  
order and I know how to build SA rules.  But I want to know, if I  
delete every config/rule file, what files would I absolutely have to  
have and what would have to be in them?  For example, do I just need  
a local.cf file with specific global lines and then everything else  
can be body/description/scores, inside any file with any name?  Are  
new rules in new files automatically added when present?

For the purposes of these questions, assume a single configuration  
level will all rules in the same directory.  Any links or  
descriptions would be appreciated.

Thanks,
Dan


Re: Learning SpamAssassin

Posted by Dan <a...@patnode.net>.
> No, I'm saying 15 digits. That's *total* combined between integer  
> and decimal
> places.
>
> However, because the floating point is stored in a scientific  
> notation, you can
> add a bunch of extra zeros to push those 15 digits around.
>
>
> So you can have:
>  (15 digits) + (307 zeros).0
>
> or
> 0.(307 zeros) +(15 digits)
>
> (note: I've simplified the math a lot here, because none of this is  
> really
> stored in terms of decimal places. It's really stored as powers-of- 
> two, binary
> format. It's really 2^51 +/- 2^e , where e can be up to anywhere up  
> to 1023)

Okay, I'm getting you.  Maximum 15 unique/consecutive digits,  
positioned/valued by additional zeros.

Thanks!

Re: Learning SpamAssassin

Posted by Loren Wilton <lw...@earthlink.net>.
> I want to flag all messages as spam, then configure various rules as
> exceptions, marking them as ham.  But how do I universally mark
> messages one way in SpamAssassin and then unmark them in the other?
>
> I realize this is unorthodox, but I would appreciate any suggestions.

>From experience it isn't a particularly good way to go as it opens you to
various specific spam attacks on your rules.

That said, what you want to do is moderately trivial.  Add one rule that
will hit on anything with a huge positive score, than fight against it with
other rules with negative scores.

header    MUST_BE_SPAM    ALL =~ /./
score    MUST_BE_SPAM    1000


        Loren


Re: Learning SpamAssassin

Posted by Matt Kettler <mk...@comcast.net>.
Dan wrote:
> Follow up question (even more odd than weight limits):
>
> I want to flag all messages as spam, then configure various rules as
> exceptions, marking them as ham.  But how do I universally mark
> messages one way in SpamAssassin and then unmark them in the other?
>
> I realize this is unorthodox, but I would appreciate any suggestions.
Hmm.. this might be a bit tough. Since SA rules by nature detect
something, it might be difficult to detect "something or nothing"..

Perhaps this:

body L_DEFAULT_ALL   /.?/
score L_DEFAULT_ALL 5.0

would work.. That should detect even a message with no body..

>


Re: Learning SpamAssassin

Posted by Dan <a...@patnode.net>.
Thanks for the follow-ups guys, its much appreciated.  I'm building  
and testing furiously   :)


On Apr 11, 2006, at 6:05, Matt Kettler wrote:

> Dan wrote:
>> Follow up question (even more odd than weight limits):
>>
>> I want to flag all messages as spam, then configure various rules as
>> exceptions, marking them as ham.  But how do I universally mark
>> messages one way in SpamAssassin and then unmark them in the other?
>>
>> I realize this is unorthodox, but I would appreciate any suggestions.
>
> Better suggestion than my previous rule-based suggestion..
>
> Set your required_score to 0. Then use negative scoring rules to back
> email out of the tag range.


Re: Learning SpamAssassin

Posted by Matt Kettler <mk...@comcast.net>.
Dan wrote:
> Follow up question (even more odd than weight limits):
>
> I want to flag all messages as spam, then configure various rules as
> exceptions, marking them as ham.  But how do I universally mark
> messages one way in SpamAssassin and then unmark them in the other?
>
> I realize this is unorthodox, but I would appreciate any suggestions.

Better suggestion than my previous rule-based suggestion..

Set your required_score to 0. Then use negative scoring rules to back
email out of the tag range.

Re: Learning SpamAssassin

Posted by Dan <a...@patnode.net>.
Follow up question (even more odd than weight limits):

I want to flag all messages as spam, then configure various rules as  
exceptions, marking them as ham.  But how do I universally mark  
messages one way in SpamAssassin and then unmark them in the other?

I realize this is unorthodox, but I would appreciate any suggestions.

Thanks,
Dan

Re: Learning SpamAssassin

Posted by Matt Kettler <mk...@evi-inc.com>.
Dan wrote:
>> The total range for the mantissa of a double-precision float is
>> 52-bits, with 1
>> bit for sign. This means that the range between your most significant
>> and least
>> significant digit of the final summed answer cannot be greater than
>> 2^51, or
>> you'll loose precision.
>>
>> The total range for the exponent of a double-precision float is
>> 2^1023, so you
>> cannot express any numbers larger than 2^51 + 2^1023.
>>
> 
> 
> I'm not that experienced with math/compsci, but Excel describes
> 
> 2^51 as having 15 digits (2,251,799,813,685,250)
> 
> 2^1023 comes in with 307 digits (won't display above 255)
> 
> 
> So you're saying 307 integers and 15 decimals?:

No, I'm saying 15 digits. That's *total* combined between integer and decimal
places.

However, because the floating point is stored in a scientific notation, you can
add a bunch of extra zeros to push those 15 digits around.


So you can have:
 (15 digits) + (307 zeros).0

or
0.(307 zeros) +(15 digits)

(note: I've simplified the math a lot here, because none of this is really
stored in terms of decimal places. It's really stored as powers-of-two, binary
format. It's really 2^51 +/- 2^e , where e can be up to anywhere up to 1023)



Re: Learning SpamAssassin

Posted by Loren Wilton <lw...@earthlink.net>.
> So you're saying 307 integers and 15 decimals?:

307 total digits if you happen to have the decimal point on the right.  The
first 15 of those will be non-zero, all the rest will be zero.

Or alternately 280+ leading zeros and 15 trailing digits if you go to the
right of the decimal point.

        Loren


Re: Learning SpamAssassin

Posted by Dan <a...@patnode.net>.
> The total range for the mantissa of a double-precision float is 52- 
> bits, with 1
> bit for sign. This means that the range between your most  
> significant and least
> significant digit of the final summed answer cannot be greater than  
> 2^51, or
> you'll loose precision.
>
> The total range for the exponent of a double-precision float is  
> 2^1023, so you
> cannot express any numbers larger than 2^51 + 2^1023.
>


I'm not that experienced with math/compsci, but Excel describes

2^51 as having 15 digits (2,251,799,813,685,250)

2^1023 comes in with 307 digits (won't display above 255)


So you're saying 307 integers and 15 decimals?:

100000000000000000000000000000000000000000000000000000000000000000000000 
000000000000000000000000000000000000000000000000000000000000000000000000 
000000000000000000000000000000000000000000000000000000000000000000000000 
000000000000000000000000000000000000000000000000000000000000000000000000 
00000000000000000000.000000000000001

Thanks,
Dan

Re: Learning SpamAssassin

Posted by Loren Wilton <lw...@earthlink.net>.
> The total range for the mantissa of a double-precision float is 52-bits,
with 1
> bit for sign. This means that the range between your most significant and
least

Minor quibble: the number of mantissa bits stored is 52, but as the mantissa
is assumed to be normalized except in some very special cases, there is an
assumed leading 1 bit in the mantissa, making it actually 53 bits, except
when denormalization and underflow occur.

        Loren


Re: Learning SpamAssassin

Posted by Matt Kettler <mk...@evi-inc.com>.
Dan wrote:
> Good approach Herb, thanks
> 
> 
> To anyone:
> 
> 1) What is the highest weight value (in number of digits) supported by
> SpamAssassin?
> 
> 2) What is the smallest weigh value (in decimal places) supported by
> SpamAssassin?

In current practice, the range is 1000 to 0.0001. The code that prints results
is heavily biased toward this input range, and also rounds or truncates to the
nearest tenth. I think this is pretty much the widest dynamic range you should
ever have need to use.

However, when it comes down to parsing and internal mathematics, spamassassin is
likely only limited by the capacity of IEEE double-precision floating point
numbers, which perl uses for all floating point math.

These limits do not result in a static number of digits. The more you use to the
left of the decimal place, the less you can use to the right without loosing
precision.

The total range for the mantissa of a double-precision float is 52-bits, with 1
bit for sign. This means that the range between your most significant and least
significant digit of the final summed answer cannot be greater than 2^51, or
you'll loose precision.

The total range for the exponent of a double-precision float is 2^1023, so you
cannot express any numbers larger than 2^51 + 2^1023.



Re: Learning SpamAssassin

Posted by Dan <a...@patnode.net>.
Good approach Herb, thanks


To anyone:

1) What is the highest weight value (in number of digits) supported  
by SpamAssassin?

2) What is the smallest weigh value (in decimal places) supported by  
SpamAssassin?


These might look like:

10000000000000

.00000000000001


Thanks,
Dan

RE: Learning SpamAssassin

Posted by Herb Martin <He...@learnquick.com>.
Dan wrote:

> Thanks guys,
> > There isn't a lot of description of this because most people don't  
> > want to
> > do this - most f the value of SA comes in the rules packaged with  
> > it that
> > have been tested to hit current spam.

> Must be to many years of building legos, one brick at a time.
> 
> I may well use most of the well tested/supported cf files (including  
> antidrug.cf), but I want to use them in my own way.  It appears that  
> most configurations handle few domains with a specific kind of email  
> and various exceptions.  This lends itself to starting with the  
> public config and then modifying it.  I want to handle a range of  

Generally you will find you time best invested if you
do something like the above, along these lines:

	1) Adjust your Spam scores to that you experience
		few false positives IF you are concerned about this

	2) Identify rules which are incorrect or mis-scored for YOU

	3) Override these rule scores in your local or other config
		(0 disables the rule)

	4) Write your own supplementary rules

	5) Make sure you write BOTH POSITIVE and NEGATIVE rules

	6) RE-adjust thresholds in light of your changes

	7) Join the SARE or SA Rules writers if you are serious
		about writing rules and can offer help

If you find that you are writing a "lot" of rules in a certain
context or domain then offer this for others that might have
similar requirements.

You can develop as much skill and understanding using this
method but reap far larger benefits.

For #2 you will want to exspecially concentrate on those
rules which hit the MOST spam or the MOST ham using one 
of the analysis tools (e.g., sa-stats) -- focus your changes
and reading/understanding on those rules that offer the
most chance for (incremental) improvement.

I went through the above and discovered that SARE did most
of the work for ME (your mileage may vary) and that my time
was better spent on getting AUXILLARY spam methods to work:

	1) Exim (my email server) filters

	2) DNS whitelists and blacklists

	3) CRM (a statistical method similar but different from Bayes)

	4) SPF (not much real help but it seems the right thing to do)

	5) AND ESPECIALLY:  GREYLISTING (best done outside of SA,
		and actually BEFORE SA in the email receiving process)


--
Herb Martin



Re: Learning SpamAssassin

Posted by Dan <a...@patnode.net>.
Thanks guys,


> There isn't a lot of description of this because most people don't  
> want to
> do this - most f the value of SA comes in the rules packaged with  
> it that
> have been tested to hit current spam.

> The main reason why you're finding little documentation about  
> "starting
> from the ground up" is that it's a MASSIVE amount of work. I probably
> sunk about 100 man-hours of my free time into creating the rules in
> antidrug.cf, maybe more. That's for a very small number of rules.

Must be to many years of building legos, one brick at a time.

I may well use most of the well tested/supported cf files (including  
antidrug.cf), but I want to use them in my own way.  It appears that  
most configurations handle few domains with a specific kind of email  
and various exceptions.  This lends itself to starting with the  
public config and then modifying it.  I want to handle a range of  
types, all at the same time and all with the same configuration.   
Short term it will be more work to get started and long term it will  
mean more work keeping up to date (cross checking every line of each  
file thats added/updated), but I'm that fanatical about FP/FN  
performance.

My biggest challenge is actually getting an adequate spam/ham corpus  
on which to perfect everything.


> In theory, if you delete all the rule/config files, then you'd have a
> perfectly working SA that would always generate scores of 0.
>
> That said, this is not a well tested configuration. Some of the in- 
> code
> defaults may not match those established by 10_misc.cf, and some  
> things
> may not have defaults at all and may cause errors.  You might want to
> consider keeping this file, if for no other reason then ensuring  
> all the
> settings get reasonable defaults.

> ...init.pre to enable some plugins
> and a local.cf with some minimal configuration in it.  And the one  
> or two
> rules you want to run.  Additional rules can be put into  
> anything.cf in the
> same directory with local.cf, and the files will be read  
> alphabetically.

Will do.  The approach I'm taking is to separate every file into what  
I understand (ie filters) and what I don't (configs), push all the  
filters aside, and try to understand every line of each config file.   
At this point (I've been learning for a week now), most of what I  
don't are the misc and .pre files.  Once I learn them and build my  
scoring foundation, I will add the filter content (if not the actual  
files) back in.


> Yes, rules can be in any file with any name that ends in .cf
>
> Just remember that SA parses rule files in alphabetic order, so if you
> want to reference rules across files, name them to fit the parse-order
> you need.

Nice, thats more flexibility than I'm used to, but flexibility that  
forces us to use our own structure(s), less we get lost.  I just have  
a habit of pushing things to there limits, so I want to know what  
they are.  This is looking like a robust platform (a la Unix).


>> Are new rules in new files automatically added when present?
> Yes, but if you use spamd, spamd only parses these two directories  
> when
> it loads, so you'd have to restart spamd.

Will do.  I'll work out a update regiment.

Dan



Re: Learning SpamAssassin

Posted by Matt Kettler <mk...@comcast.net>.
Dan wrote:
> Newbie here,
>
> I've filtered email for years with Declude and am adding SpamAssassin
> to my arsenal.  I want to build something from scratch and am having
> problems getting started.
Before we get further: I'd really suggest starting off playing with SA's
default setup for a while, then try tinkering with customizations, THEN
move on to building a whole ruleset from scratch.

The main reason why you're finding little documentation about "starting
from the ground up" is that it's a MASSIVE amount of work. I probably
sunk about 100 man-hours of my free time into creating the rules in
antidrug.cf, maybe more. That's for a very small number of rules.
> Thing is, all the guides (books, manual, web pages) seem to be geared
> toward using the standard configuration or making slight modifications
> to it - assuming the reader doesn't want to see under the hood.  I
> can't find anything that describes the configuration structure that SA
> uses, its requirements or rules.
man Mail::SpamAssassin::Conf

>
> I have a sense of 10_, 20_, 50_ file names in terms of controlling
> order and I know how to build SA rules.  But I want to know, if I
> delete every config/rule file, what files would I absolutely have to
> have and what would have to be in them?
In theory, if you delete all the rule/config files, then you'd have a
perfectly working SA that would always generate scores of 0.

That said, this is not a well tested configuration. Some of the in-code
defaults may not match those established by 10_misc.cf, and some things
may not have defaults at all and may cause errors.  You might want to
consider keeping this file, if for no other reason then ensuring all the
settings get reasonable defaults.
>   For example, do I just need a local.cf file with specific global
> lines and then everything else can be body/description/scores, inside
> any file with any name?  
Yes, rules can be in any file with any name that ends in .cf, provided
it's placed in one of two directories:

Rules can be in /usr/share/spamassassin/*.cf, which is the intended home
of the "default" ruleset. This directory gets obliterated and
repopulated when you upgrade spamassassin.

Rules can also be in /etc/mail/spamassassin/*.cf, which is the intended
home of add-on and customized rules.

Just remember that SA parses rule files in alphabetic order, so if you
want to reference rules across files, name them to fit the parse-order
you need.

> Are new rules in new files automatically added when present?
Yes, but if you use spamd, spamd only parses these two directories when
it loads, so you'd have to restart spamd.
>
> For the purposes of these questions, assume a single configuration
> level will all rules in the same directory.  Any links or descriptions
> would be appreciated.
>
> Thanks,
> Dan
>
>


Re: Learning SpamAssassin

Posted by Loren Wilton <lw...@earthlink.net>.
There isn't a lot of description of this because most people don't want to
do this - most f the value of SA comes in the rules packaged with it that
have been tested to hit current spam.

That said, you probably these days need an init.pre to enable som plugins
and a local.cf with some minimal configuration in it.  And the one or two
rules you want to run.  Additional rules can be put into anything.cf in the
same directory with local.cf, adn the files will be read alphabetically.

        Loren