You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Tom Allison <to...@tacocat.net> on 2007/06/30 02:52:30 UTC
A different approach to scoring spamassassin hits
For some years now there has been a lot of effective spam filtering
using statistical approaches with variations on Bayesian theory, some
of these are inverse Chi Square modifications to Niave Bayes or even
CRM114 and other "languages" have been developed to improve the
scoring of statistical analysis of spam. For all statistical
processes the spamicity is always between 0 and 1.
Before this, and along side this, has been the approach of
spamassassin wherein every email is evaluated against a library of
rules and for each rule and number of points is assigned to it.
Given enough points, the email is ham/spam. To accomodate the
Bayesian process, SA was modified with a Bayes engine and the ability
to add points depending on where the bayesian score fell (>.85, >.
95...). And for all of these processes the score is between
something negative and something positive depending on the total
number of hits and the points assigned to them.
It occurred to me that this process of assigning points to each
"HIT" (either addition or subtraction of points) is slightly
arbitrary. There is a long process of evaluating for the "most
effective score" for each rule and then providing that as the
default. The Mail Admin has the option to retune these various
parameters as needed. To me, this looks like a lot of knobs I can
turn on a very complex machine I will probably never really
understand. In short, if I touch it, I will break it. But the
arbitrary part of the process is this manual balancing act between
how many points to apply to something and getting the call from the
CEO about his over abundance of east european teenage solicitors (or
lack thereof).
The thought I had, and have been working on for a while, is changing
how the scoring is done. Rather than making Bayes a part of the
scoring process, make the scoring process a part of the Bayes
statistical Engine. As an example you would simply feed into the
Bayesian process, as tokens, the indications of scoring hits (binary
yes/no) would be examined next to the other tokens in the message.
It would be the Bayes process that determines the effective number of
points you assign for each HIT based on what it's learned about it
from you. So the tags of: ADVANCE_FEE_1, ADVANCE_FEE_2 would be
represented as a token of format:
ADVANCE_FEE_1=YES or NO
ADVANCE_FEE_2=YES or NO
and each of these tokens would then be evaluated based on your
learning process.
An advantage of this would be the elimination of the process to
determine the best number of points to assign or to determine if you
even want a rule included.
Point assignments would be determined based on the statistical hits
(number of spam, number of ham) and would be tuned between a per site
or per user basis depending on the bayes engine configuration. Each
users, by means of their feedback, would tune the importance of each
rule applied.
Determining if you wanted to include a rule would be automatically
determined for you based on the resulting scoring. if you have a
rule that has an overall historical performance of 0.499 then it's
pretty obvious that it's incapable of "Seeing" your kind of spam/
ham. But if you throw together a rule and run it for a week and find
it's scoring 0.001 or 0.999 then you have evidence of how effective
the rule is and can continue to use it. It is conceivable that you
could start with All known rules and later on remove all the rules
that are nominally 0.500 to improve performance on a objective
process. It would also apply to any of the networked rules like
botnet, dcc, razor because they just have a tagline and a YES/NO
indication.
I've been working on something like this myself with great affect,
but it would be far more practical to utilize much of the knowledge
and capability that already exists in spamassassin. But I'm not
familiar enough with spamassassin to know how to gain visibility into
all the rules run and all their results (hits are easy in
PerMsgStatus, but misses are not). If someone would be willing to
give me some pointer to a roadmap of sorts it would be appreciated.
Many Thanks for those of you who have read this far for your patience
and consideration.
Re: A different approach to scoring spamassassin hits
Posted by arni <ma...@arni.name>.
Tom Allison schrieb:
>
> Many Thanks for those of you who have read this far for your patience
> and consideration.
>
Sorry for only giving you such a short reply to your long and great
post, but i have to say this now:
The proposal is brilliant and i thought about this before myself but
never got around to put it into words.
arni
Re: A different approach to scoring spamassassin hits
Posted by Marc Perkel <ma...@perkel.com>.
Tom Allison wrote:
>
> On Jun 30, 2007, at 1:20 AM, Marc Perkel wrote:
>
>>
>>
>>
>> Tom Allison wrote:
>>> For some years now there has been a lot of effective spam filtering
>>> using statistical approaches with variations on Bayesian theory,
>>> some of these are inverse Chi Square modifications to Niave Bayes or
>>> even CRM114 and other "languages" have been developed to improve the
>>> scoring of statistical analysis of spam. For all statistical
>>> processes the spamicity is always between 0 and 1.
>>> <snip>
>>>
>>> Many Thanks for those of you who have read this far for your
>>> patience and consideration.
>>
>> Tom, I suggested something somilar to that years ago and I'd still
>> like to see it tried out. I wonder what would happen if you stripped
>> ot the body and ran bayes just on the headers and the rules and let
>> bayes figure it out. You do have to have some points to start with to
>> get bayes pointed in the right direction. But you could use black
>> lists and white lists to do bayes training. Also needs more rules to
>> identify ham and not just rules to identify spam.
>
> I was under the belief that there were Ham-centric tests that would
> result in negative point scorings.
>
> Ham doesn't try to be evasive. It's pretty easy to identify. Without
> SA tagging much of it falls to <<0.5 and whitelisting would capture
> much of the exceptions.
>
> As for headers only testing -- The first five lines of stock spam is
> very telling...
>
> My question about SA is the PerMsgStatus (I think) Is this the place
> to retrieve all the rules information? I know today you can get a
> list of all the rules that HIT, but is there where you would look to
> find all the rules that were attempted? Or is there a better place
> for it?
>
There are some ham tests in SA but not nearly enough.
Re: A different approach to scoring spamassassin hits
Posted by Tom Allison <to...@tacocat.net>.
On Jun 30, 2007, at 1:20 AM, Marc Perkel wrote:
>
>
>
> Tom Allison wrote:
>> For some years now there has been a lot of effective spam
>> filtering using statistical approaches with variations on Bayesian
>> theory, some of these are inverse Chi Square modifications to
>> Niave Bayes or even CRM114 and other "languages" have been
>> developed to improve the scoring of statistical analysis of spam.
>> For all statistical processes the spamicity is always between 0
>> and 1.
>> <snip>
>>
>> Many Thanks for those of you who have read this far for your
>> patience and consideration.
>
> Tom, I suggested something somilar to that years ago and I'd still
> like to see it tried out. I wonder what would happen if you
> stripped ot the body and ran bayes just on the headers and the
> rules and let bayes figure it out. You do have to have some points
> to start with to get bayes pointed in the right direction. But you
> could use black lists and white lists to do bayes training. Also
> needs more rules to identify ham and not just rules to identify spam.
I was under the belief that there were Ham-centric tests that would
result in negative point scorings.
Ham doesn't try to be evasive. It's pretty easy to identify.
Without SA tagging much of it falls to <<0.5 and whitelisting would
capture much of the exceptions.
As for headers only testing -- The first five lines of stock spam is
very telling...
My question about SA is the PerMsgStatus (I think) Is this the place
to retrieve all the rules information? I know today you can get a
list of all the rules that HIT, but is there where you would look to
find all the rules that were attempted? Or is there a better place
for it?
Re: A different approach to scoring spamassassin hits
Posted by Marc Perkel <ma...@perkel.com>.
Tom Allison wrote:
> For some years now there has been a lot of effective spam filtering
> using statistical approaches with variations on Bayesian theory, some
> of these are inverse Chi Square modifications to Niave Bayes or even
> CRM114 and other "languages" have been developed to improve the
> scoring of statistical analysis of spam. For all statistical
> processes the spamicity is always between 0 and 1.
>
> Before this, and along side this, has been the approach of
> spamassassin wherein every email is evaluated against a library of
> rules and for each rule and number of points is assigned to it. Given
> enough points, the email is ham/spam. To accomodate the Bayesian
> process, SA was modified with a Bayes engine and the ability to add
> points depending on where the bayesian score fell (>.85, >.95...).
> And for all of these processes the score is between something negative
> and something positive depending on the total number of hits and the
> points assigned to them.
>
> It occurred to me that this process of assigning points to each "HIT"
> (either addition or subtraction of points) is slightly arbitrary.
> There is a long process of evaluating for the "most effective score"
> for each rule and then providing that as the default. The Mail Admin
> has the option to retune these various parameters as needed. To me,
> this looks like a lot of knobs I can turn on a very complex machine I
> will probably never really understand. In short, if I touch it, I
> will break it. But the arbitrary part of the process is this manual
> balancing act between how many points to apply to something and
> getting the call from the CEO about his over abundance of east
> european teenage solicitors (or lack thereof).
>
> The thought I had, and have been working on for a while, is changing
> how the scoring is done. Rather than making Bayes a part of the
> scoring process, make the scoring process a part of the Bayes
> statistical Engine. As an example you would simply feed into the
> Bayesian process, as tokens, the indications of scoring hits (binary
> yes/no) would be examined next to the other tokens in the message.
>
> It would be the Bayes process that determines the effective number of
> points you assign for each HIT based on what it's learned about it
> from you. So the tags of: ADVANCE_FEE_1, ADVANCE_FEE_2 would be
> represented as a token of format:
> ADVANCE_FEE_1=YES or NO
> ADVANCE_FEE_2=YES or NO
> and each of these tokens would then be evaluated based on your
> learning process.
>
> An advantage of this would be the elimination of the process to
> determine the best number of points to assign or to determine if you
> even want a rule included.
>
> Point assignments would be determined based on the statistical hits
> (number of spam, number of ham) and would be tuned between a per site
> or per user basis depending on the bayes engine configuration. Each
> users, by means of their feedback, would tune the importance of each
> rule applied.
>
> Determining if you wanted to include a rule would be automatically
> determined for you based on the resulting scoring. if you have a rule
> that has an overall historical performance of 0.499 then it's pretty
> obvious that it's incapable of "Seeing" your kind of spam/ham. But if
> you throw together a rule and run it for a week and find it's scoring
> 0.001 or 0.999 then you have evidence of how effective the rule is and
> can continue to use it. It is conceivable that you could start with
> All known rules and later on remove all the rules that are nominally
> 0.500 to improve performance on a objective process. It would also
> apply to any of the networked rules like botnet, dcc, razor because
> they just have a tagline and a YES/NO indication.
>
> I've been working on something like this myself with great affect, but
> it would be far more practical to utilize much of the knowledge and
> capability that already exists in spamassassin. But I'm not familiar
> enough with spamassassin to know how to gain visibility into all the
> rules run and all their results (hits are easy in PerMsgStatus, but
> misses are not). If someone would be willing to give me some pointer
> to a roadmap of sorts it would be appreciated.
>
> Many Thanks for those of you who have read this far for your patience
> and consideration.
Tom, I suggested something somilar to that years ago and I'd still like
to see it tried out. I wonder what would happen if you stripped ot the
body and ran bayes just on the headers and the rules and let bayes
figure it out. You do have to have some points to start with to get
bayes pointed in the right direction. But you could use black lists and
white lists to do bayes training. Also needs more rules to identify ham
and not just rules to identify spam.
Re: A different approach to scoring spamassassin hits
Posted by Loren Wilton <lw...@earthlink.net>.
> Just a thought - what if we had some central servers for real time
> reporting where the SA rule hits and scores were reported in real time for
> some sort of live scoring or analysis or dynamic adjusting? Just thinking
> out loud here.
Something I've wanted to see for about 4 years now; ie: as long as I've been
using SA. You could think of it as a super mass-check in realtime.
There are arguments that large hosting companies wouldn't let the data out
because it woudl compromise their mail stream. That would of course be true
if the sent the mail. If they just send the cumulative scores over the last
hour or whatever I don't see that being true; although doubtless some would
still consider that to be the case and wouldn't send it.
However, I'd bet that enough info would arive from all parts of the globe to
be able to do weekly or maybe even every few hours rescoring runs and
publish new scores, pretty much like the virus guys publish new signatures
pretty quickly.
There is the question of how to integrate the new scores with local
rescoring, and even with local rules that were scored based on the original
score of the stock rules.
I think there are a half-dozen solutions to this that would be moderately
easy to implement. The most obvious would be sending score updates either
in the form of a multiplier or an adder to the original rule score rather
than as a raw score; this would preserve local overrides while still
adjusting the score to match daily hit rates. (Don't bother me with the
obvious point of adjusting zeroed scores off of zero. That is an exception
that simply has to be handled in the score readjustment; it isn't a
concept-breaker.)
If the rescoring client at a site wanted to be fancy, it could even send an
optional email to the mail admin telling him that some local rule is bad for
his health or that some zeroed rule has now become useful and should be
unzeroed. Or the like.
Loren
Re: A different approach to scoring spamassassin hits
Posted by Marc Perkel <ma...@perkel.com>.
Loren Wilton wrote:
>>> You have a bit of a chicken and egg problem at the start. Until
>>> some learning takes place in the system.
>
> Two possibilities. The rules exist and have scores. Assume they are
> maintained, for whatever reason.
>
> 1. Until Bayes has enough info to kick in, classification is done
> by the scores. Then when Bayes kicks in the scores turn off (insofar
> as adding to themessage score, they might still show up as tokens in
> the message that Bayes will process).
>
> 2. Divide all the scores by 10 or 20. The leave them on. Pretty
> soon bayes will override almost any reasonable score combination.
>
> BTW, while ham rules are possible, SA has almost no ham rules; perhaps
> two or so. Spammers long ago found they could write their spams to
> match ham rules and thus bypass SA. Thus, no ham rules, no spmammer
> workarounds. Of course personal or ste specific ham rules will
> generally still work, since they will not be public knowledge and
> spammers won't be able to target them.
>
> I suspect you can find all rule names in PerMsgStatus. However the
> latest SA versions have implemented a 'check' plugin that actually
> runs the rules and accumulates the score. The rule running was moved
> to a plugin so that people could, at least in theory, change the order
> or the way that rules are run. It sounds like that is what you want
> to do, so a modified Check plugin may well be the way to go.
>
> I don't understand though why you are interested in the names of all
> rules run; I don't see what it buys you. Currently ALL rules are run,
> unless short-circuiting is in effect, and by default it mostly isn't.
> In any case, if a rule doesn't hit on a message, the name of the rule
> is probably irrelevent. It might have missed because the message is
> ham, but it even more likely missed because it simply targets a
> different kind of spam. So assuming that "rules not hit" === "good
> tokens" is unlikely to be the case.
>
> You should be able to get Bayes to scan the rule names hit pretty
> easily. Bayes is just about the last rule; I think Awl comes after
> it. You might want to change that order, which I suspect you can do
> in the Check plugin. You could then modifty the Check code to push the
> rule names into a special header line before calling Bayes. This
> could probably be done in Check, and could certainly be done by a
> one-off plugin that you wrote. It would be called by a special rule
> just before Bayes is called, and again, it would add the current rule
> names to a special header bayes could see.
>
> Of course you have to modify Check to drop out the scores for the
> non-byes rules. Either that or rescore all of the rules.
>
Just a thought - what if we had some central servers for real time
reporting where the SA rule hits and scores were reported in real time
for some sort of live scoring or analysis or dynamic adjusting? Just
thinking out loud here.
Re: A different approach to scoring spamassassin hits
Posted by Tom Allison <to...@tacocat.net>.
On Jun 30, 2007, at 11:55 PM, Loren Wilton wrote:
>
>> Unfortunately I'm not on the SpamAssassin Bayes modules -- I wrote
>> my own Bayes Engine because I wanted to do that and then thought
>> about including the Rules results from SpamAssassin. I don't
>> know where this might be going, but it seems to be working
>> extremely well for me based on a training set of just a couple
>> hundred emails in total.
>
> Don't see this as a problem. Someone, I forget who, has a Bayes
> chained to an SA setup, I think the Bayes comes first, but I don't
> recall. He was claiming good results from chained classifiers
> using slightly different data and methods. This seems like a
> reasonably possible contention to me.
>
> If you have a pre-existing Bayes mail filter, and it runs as a
> filter in a pipe or the like, then basically what you want to do
> seems very simple to me, at least conceptually. Just run the mail
> through SA first and then into your classifier. The rule names hit
> along with their scores will be in the header of the mail you
> process in your classifier, and thus, as long as you don't ignore
> header data, the rule names are there to process. No need even to
> modify SA. In fact you can get a header with just the rule names
> hit without the scores, so you don't have the score values being
> scored as tokens.
>
> The only case where you would have to modify SA in I think either
> Check or PMS is if you really did want to bloat every mail with the
> names of all of the rules in the SA database, rather than just
> those pertanent to the mail at hand.
>
> I hink the trick is simply looking at your mail chain and figuring
> out how to insert a call to SA before the call to your own Bayes
> module.
Actually I have this but I don't have it writting the headers into
the email. It' s sending the SA data as attached information so I
can keep track of where it came from (header/body/metadata). I'm not
sure that the scoring is going to cost me anything or cause any
performance issues compared to getting the hits/misses. I think
we're debating the cpu involved to determine a number for the score,
not the scoring process itself.
I have a question about the sub rules -- are they themselves adding
up to an overall rule by means of hit/miss?
Is there any conceptual advantage to pulling in rules and sub_rules
to this process.
And the more I think about it, the more I don't need to "bloat every
mail with the names of all the rules".
But sub_rules might be more useful.
---
By not putting in all the SA rules it might make it easier to
establish the contribution of the scoring, but you have to know the
intended target (RULE => spam or RULE => ham) which isn't an issue
with todays rules (but you never know). Once you know this, the
effectiveness of a rule would be measured by it's distance in
probability from 0.500 toward 1.00. I can track this eventually, but
I think I need to reset my database to be certain of it's value. Not
a problem, I am my own admin.
But the real challenge for me, as has always been the case with SA,
is the proper care and feeding of the application when not using the
standard spamc/spamd and spamassassin scripts. I suspect this starts
with a lot of RTFM and then I can get to some real questions. The
difficulty for me is trimming out all the steps in the application
that I won't be benefitting from. I would like to start with
something that is approximately: local "static" rules only, no user
specific preferences, no learning or bayes or white/black listing.
By local "static" I mean to use the rules based on email content
analysis without network consultation (DNS, RBL, DCC...)
Re: A different approach to scoring spamassassin hits
Posted by Loren Wilton <lw...@earthlink.net>.
> Unfortunately I'm not on the SpamAssassin Bayes modules -- I wrote my own
> Bayes Engine because I wanted to do that and then thought about including
> the Rules results from SpamAssassin. I don't know where this might be
> going, but it seems to be working extremely well for me based on a
> training set of just a couple hundred emails in total.
Don't see this as a problem. Someone, I forget who, has a Bayes chained to
an SA setup, I think the Bayes comes first, but I don't recall. He was
claiming good results from chained classifiers using slightly different data
and methods. This seems like a reasonably possible contention to me.
If you have a pre-existing Bayes mail filter, and it runs as a filter in a
pipe or the like, then basically what you want to do seems very simple to
me, at least conceptually. Just run the mail through SA first and then into
your classifier. The rule names hit along with their scores will be in the
header of the mail you process in your classifier, and thus, as long as you
don't ignore header data, the rule names are there to process. No need even
to modify SA. In fact you can get a header with just the rule names hit
without the scores, so you don't have the score values being scored as
tokens.
The only case where you would have to modify SA in I think either Check or
PMS is if you really did want to bloat every mail with the names of all of
the rules in the SA database, rather than just those pertanent to the mail
at hand.
I hink the trick is simply looking at your mail chain and figuring out how
to insert a call to SA before the call to your own Bayes module.
Loren
Re: A different approach to scoring spamassassin hits
Posted by Tom Allison <to...@tacocat.net>.
On Jun 30, 2007, at 6:29 PM, Loren Wilton wrote:
>
>> And after typing all this I'm thinking you might be right. But
>> part of this approach is to run all these rules in YES/NO fashion
>> and see if the probability is significant. For example: If I
>> tested for SOME_TEST=NO and found it was scoring a probability of
>> ~0.500 then it's indisputable that you are right.
>
> Well, this still doesn't make any real sense to me; it seems
> equivalent to the attempts at bayes poison that spammers stick into
> their spams: a bunch of words totally unrelated to the mail in the
> hopes of outweighing the useful terms. Now their trick works as a
> good spam indication because the words they pick aren't common to
> my ham mails, so it is really a good spam indication rather than
> poison. I'm not immediately convinced that will hold for the usage
> you intend. Maybe. Maybe not.
>
> However, if you want to do this, remember that bayes works on
> tokens and has a tokenizer. So SOME_RULE=YES is probably either
> two or three tokens, and you will end up scoring on the probability
> of YES and NO, along with the frequency of the rule names, which
> will be 1. So you probably want to do NO_SOME_RULE and
> YES_OTHER_RULE or the like when you build the insert list. Again
> though I'm not sure I see the point in the yes and no factors; the
> presence or absense of a word in the mail seems like a pretty good
> yes/no indication to me.
>
> Were I doing it I'd try it both ways and see if there is any
> difference in results.
I agree with you that it's probably not going to be very effective to
use a binary token (eg: SOME_RULE=YES vs SOME_RULE=NO) compared to
the presence of the rule (SOME_RULE exists implies SOME_RULE=YES).
So the method:
$list = $status->get_names_of_tests_hit ()
may cover everything that is required to evaluate this approach.
Unfortunately I'm not on the SpamAssassin Bayes modules -- I wrote my
own Bayes Engine because I wanted to do that and then thought about
including the Rules results from SpamAssassin. I don't know where
this might be going, but it seems to be working extremely well for me
based on a training set of just a couple hundred emails in total.
Re: A different approach to scoring spamassassin hits
Posted by Loren Wilton <lw...@earthlink.net>.
> And after typing all this I'm thinking you might be right. But part of
> this approach is to run all these rules in YES/NO fashion and see if the
> probability is significant. For example: If I tested for SOME_TEST=NO
> and found it was scoring a probability of ~0.500 then it's indisputable
> that you are right.
Well, this still doesn't make any real sense to me; it seems equivalent to
the attempts at bayes poison that spammers stick into their spams: a bunch
of words totally unrelated to the mail in the hopes of outweighing the
useful terms. Now their trick works as a good spam indication because the
words they pick aren't common to my ham mails, so it is really a good spam
indication rather than poison. I'm not immediately convinced that will hold
for the usage you intend. Maybe. Maybe not.
However, if you want to do this, remember that bayes works on tokens and has
a tokenizer. So SOME_RULE=YES is probably either two or three tokens, and
you will end up scoring on the probability of YES and NO, along with the
frequency of the rule names, which will be 1. So you probably want to do
NO_SOME_RULE and YES_OTHER_RULE or the like when you build the insert list.
Again though I'm not sure I see the point in the yes and no factors; the
presence or absense of a word in the mail seems like a pretty good yes/no
indication to me.
Were I doing it I'd try it both ways and see if there is any difference in
results.
Loren
Re: A different approach to scoring spamassassin hits
Posted by Tom Allison <to...@tacocat.net>.
On Jun 30, 2007, at 8:07 AM, Loren Wilton wrote:
>
>>> You have a bit of a chicken and egg problem at the start. Until
>>> some learning takes place in the system.
>
> Two possibilities. The rules exist and have scores. Assume they
> are maintained, for whatever reason.
>
> 1. Until Bayes has enough info to kick in, classification is
> done by the scores. Then when Bayes kicks in the scores turn off
> (insofar as adding to themessage score, they might still show up as
> tokens in the message that Bayes will process).
>
> 2. Divide all the scores by 10 or 20. The leave them on.
> Pretty soon bayes will override almost any reasonable score
> combination.
>
> BTW, while ham rules are possible, SA has almost no ham rules;
> perhaps two or so. Spammers long ago found they could write their
> spams to match ham rules and thus bypass SA. Thus, no ham rules,
> no spmammer workarounds. Of course personal or ste specific ham
> rules will generally still work, since they will not be public
> knowledge and spammers won't be able to target them.
>
> I suspect you can find all rule names in PerMsgStatus. However the
> latest SA versions have implemented a 'check' plugin that actually
> runs the rules and accumulates the score. The rule running was
> moved to a plugin so that people could, at least in theory, change
> the order or the way that rules are run. It sounds like that is
> what you want to do, so a modified Check plugin may well be the way
> to go.
>
> I don't understand though why you are interested in the names of
> all rules run; I don't see what it buys you. Currently ALL rules
> are run, unless short-circuiting is in effect, and by default it
> mostly isn't. In any case, if a rule doesn't hit on a message, the
> name of the rule is probably irrelevent. It might have missed
> because the message is ham, but it even more likely missed because
> it simply targets a different kind of spam. So assuming that
> "rules not hit" === "good tokens" is unlikely to be the case.
But in Bayes, you can't score on the absence of a token. Just
because the email I'm writing does not contain a certain word does
not mean it is "good". The listing of ALL rules run with a binary
YES/NO indication applied to each one would permit you to accrue
points for both the presence of and lack of a specific rule. But
this would allow you to start applying pro Ham rules as well.
But you may have a point that "rules not hit" is sufficient for
determining "good tokens" in the same manner that "viagra" is bad and
not having "viagra" permits the email to score on the other tokens
available. To further prove this out, the practice of spammers (who
I'm sure are reading this list) is to try to apply enough skew to the
Bayes to push it low and skip enough rules to keep from scoring any
hits -- the net effect is to come up with Unsure email (I work in a
ternary system). Under pure bayesian statistics, the cutoff points
for ham/spam tend to move pretty quickly from a nominal 0.3/0.7 to
0.3/0.5 giving the entire probability range of 0.500 to 1.00 over to
Spam and 0.00 to 0.300 (or even lower) to specifically Ham with a
belt of uncertainty in the middle.
And after typing all this I'm thinking you might be right. But part
of this approach is to run all these rules in YES/NO fashion and see
if the probability is significant. For example: If I tested for
SOME_TEST=NO and found it was scoring a probability of ~0.500 then
it's indisputable that you are right.
The only area of exception to this would be some kind of AWL factor
rather than a hard coded AWL override. Creative Regex can handle
this by capturing the email addresses in FROM: and providing a very
strong probability for that. Not a Whitelist, but an indication.
Not sure, haven't considered it as I never found AWL to be really
useful compared against the impact of Bayes on headers.
As for the start up effectiveness. There are a variety of ways to do
this. I consider this similar to installing linux. It might be
harder to do than buying a computer with Windows installed for you,
but the long term benefits out weigh the short term gains and how
often do you really install Linux or SpamAssassin? You can always
seed the data from captured emails.
Thank you for the information on Check. I will look into that and
see if I can come up with something that will do the trick. I have
to confess I'm coming into this backwards, I wrote a bayesian spam
filter and then started looking into SpamAssassin so my Bayes
statistical Engine is not SpamAssassins. But the results will be the
same for either approach (I hope) if you simply push rules in as meta-
data tokens into the Statistical Process.
Re: A different approach to scoring spamassassin hits
Posted by Loren Wilton <lw...@earthlink.net>.
>> You have a bit of a chicken and egg problem at the start. Until
>> some learning takes place in the system.
Two possibilities. The rules exist and have scores. Assume they are
maintained, for whatever reason.
1. Until Bayes has enough info to kick in, classification is done by the
scores. Then when Bayes kicks in the scores turn off (insofar as adding to
themessage score, they might still show up as tokens in the message that
Bayes will process).
2. Divide all the scores by 10 or 20. The leave them on. Pretty soon
bayes will override almost any reasonable score combination.
BTW, while ham rules are possible, SA has almost no ham rules; perhaps two
or so. Spammers long ago found they could write their spams to match ham
rules and thus bypass SA. Thus, no ham rules, no spmammer workarounds. Of
course personal or ste specific ham rules will generally still work, since
they will not be public knowledge and spammers won't be able to target them.
I suspect you can find all rule names in PerMsgStatus. However the latest
SA versions have implemented a 'check' plugin that actually runs the rules
and accumulates the score. The rule running was moved to a plugin so that
people could, at least in theory, change the order or the way that rules are
run. It sounds like that is what you want to do, so a modified Check plugin
may well be the way to go.
I don't understand though why you are interested in the names of all rules
run; I don't see what it buys you. Currently ALL rules are run, unless
short-circuiting is in effect, and by default it mostly isn't. In any case,
if a rule doesn't hit on a message, the name of the rule is probably
irrelevent. It might have missed because the message is ham, but it even
more likely missed because it simply targets a different kind of spam. So
assuming that "rules not hit" === "good tokens" is unlikely to be the case.
You should be able to get Bayes to scan the rule names hit pretty easily.
Bayes is just about the last rule; I think Awl comes after it. You might
want to change that order, which I suspect you can do in the Check plugin.
You could then modifty the Check code to push the rule names into a special
header line before calling Bayes. This could probably be done in Check, and
could certainly be done by a one-off plugin that you wrote. It would be
called by a special rule just before Bayes is called, and again, it would
add the current rule names to a special header bayes could see.
Of course you have to modify Check to drop out the scores for the non-byes
rules. Either that or rescore all of the rules.
Loren
Re: A different approach to scoring spamassassin hits
Posted by Tom Allison <to...@tacocat.net>.
On Jun 30, 2007, at 4:46 AM, John Andersen wrote:
>
> On Friday 29 June 2007, Tom Allison wrote:
>
>> It would be the Bayes process that determines the effective number of
>> points you assign for each HIT based on what it's learned about it
>> from you. So the tags of: ADVANCE_FEE_1, ADVANCE_FEE_2 would be
>> represented as a token of format:
>> ADVANCE_FEE_1=YES or NO
>> ADVANCE_FEE_2=YES or NO
>> and each of these tokens would then be evaluated based on your
>> learning process.
>
> Sort of like a multiple linear regression analysis, where you
> simply start
> dropping terms with low coefficients to simplify the calculation.
>
> Interesting Idea.
>
> You have a bit of a chicken and egg problem at the start. Until
> some learning takes place in the system.
>
For a purely bayesian filter this is always the case.
But I have found through mailing lists and personal experience that
this can be mitigated through a variety of approaches.
The first approach is to impliment SA after you have trained it from
some past corpus of mail you've captured. The opinion on how many
you need to be effective varies from 10's to 1,000's. This is
strictly a YMMV issue.
Personally, I use an approach of train on error (never auto-train or
train on everything but only the minimum to get right) with a result
of 10 emails gets me above 90%. But my scoring is a little vague --
I use a ternary Yes, No, Maybe scoring process. If I exclude the
Maybe I have 100% success in very short order. Including Maybe I
have 98% success after training on ~100 messages. But the worse is
over in the first day.
Another method would be to simply seed the data from a SQL script to
preload certain tokens and values. Kind of a "hack" in my opinion
but it would be effective and any discrepancies would be quickly
resolved by training. In the case of SA I would seed the rules into
the tables for the simplest, yet effective results.
Re: A different approach to scoring spamassassin hits
Posted by John Andersen <js...@pen.homeip.net>.
On Friday 29 June 2007, Tom Allison wrote:
> It would be the Bayes process that determines the effective number of
> points you assign for each HIT based on what it's learned about it
> from you. So the tags of: ADVANCE_FEE_1, ADVANCE_FEE_2 would be
> represented as a token of format:
> ADVANCE_FEE_1=YES or NO
> ADVANCE_FEE_2=YES or NO
> and each of these tokens would then be evaluated based on your
> learning process.
Sort of like a multiple linear regression analysis, where you simply start
dropping terms with low coefficients to simplify the calculation.
Interesting Idea.
You have a bit of a chicken and egg problem at the start. Until
some learning takes place in the system.
--
_____________________________________
John Andersen
Re: A different approach to scoring spamassassin hits
Posted by Tom Allison <to...@tacocat.net>.
On Jun 30, 2007, at 2:55 PM, Bart Schaefer wrote:
>
> On 6/29/07, Tom Allison <to...@tacocat.net> wrote:
>>
>> The thought I had, and have been working on for a while, is changing
>> how the scoring is done. Rather than making Bayes a part of the
>> scoring process, make the scoring process a part of the Bayes
>> statistical Engine. As an example you would simply feed into the
>> Bayesian process, as tokens, the indications of scoring hits (binary
>> yes/no) would be examined next to the other tokens in the message.
>
> There are a few problems with this.
>
> (1) It assumes that Bayesian (or similar) classification is more
> accurate than SA's scoring system. Either that, or you're willing to
> give up accuracy in the name of removing all those confusing knobs you
> don't want to touch, but it would seem to me to be better to have the
> knobs and just not touch them.
>
I know that without SA you can have >99.9% accuracy with pure
bayesian classification.
But there are specific non Bayes things that are made visible through
spamassassin rules that a typical bayes process can't catch (very
well or at all). The whole issue of "knobs" is moot under a
statistical approach because each users scoring will determine the
real importance of each particular rule hit.
> (2) For many SA rules you would be, in effect, double-counting some
> tokens. An SA scoring rule that matches a phrase, for example, is
> effectively matching a collection of tokens that are also being fed
> individually to the Bayes engine. In theory, you should not
> second-guess the system by passing such compound tokens to Bayes;
> instead it should be allowed to learn what combinations of tokens are
> meaningful when they appear together.
Bayes does not match a phrase, only words. At least that is what
most Bayes filters do.
There are some approaches that do use multiple words, but not a
"phrase". Therefore I think the intersection of Bayes and
Spamassassin rules is going to be small.
> (It might be worthwhile, though, to e.g. add tokens that are not
> otherwise present in the message, such as for the results of network
> tests.)
This is what I'm interested in and mentioned in paragraph one. There
are a lot of things you can do with SpamAssassin that just Bayes will
never do. It is exactly this type of work that I think would be most
interesting to pursue.
> (3) It introduces a bootstrapping problem, as has already been noted.
> Everyone has to train the engine and re-train it when new rules are
> developed.
>
> I've thought of a few more, but they all have to do with the benifits
> of having all those "knobs" and if you've already adopted the basic
> premise that they should be removed there doesn't seem to be any
> reason to argue that part.
>
> To summarize my opinion: If what you want is to have a Bayesian-type
> engine make all the decisions, then you should install a Bayesian
> engine and work on ways to feed it the right tokens; you should not
> install SpamAssassin and then work on ways to remove the scoring.
It makes sense to do this approach. However it would not make sense
to try and reinvent the fantastic amount of useful work that has come
from SpamAssassin. That would take a very long time to address.
SpamAssassin has some really great ways of finding the right tokens.
Why would I consider trying to duplicate all that effort.
Re: A different approach to scoring spamassassin hits
Posted by Bart Schaefer <ba...@gmail.com>.
On 6/29/07, Tom Allison <to...@tacocat.net> wrote:
>
> The thought I had, and have been working on for a while, is changing
> how the scoring is done. Rather than making Bayes a part of the
> scoring process, make the scoring process a part of the Bayes
> statistical Engine. As an example you would simply feed into the
> Bayesian process, as tokens, the indications of scoring hits (binary
> yes/no) would be examined next to the other tokens in the message.
There are a few problems with this.
(1) It assumes that Bayesian (or similar) classification is more
accurate than SA's scoring system. Either that, or you're willing to
give up accuracy in the name of removing all those confusing knobs you
don't want to touch, but it would seem to me to be better to have the
knobs and just not touch them.
(2) For many SA rules you would be, in effect, double-counting some
tokens. An SA scoring rule that matches a phrase, for example, is
effectively matching a collection of tokens that are also being fed
individually to the Bayes engine. In theory, you should not
second-guess the system by passing such compound tokens to Bayes;
instead it should be allowed to learn what combinations of tokens are
meaningful when they appear together.
(It might be worthwhile, though, to e.g. add tokens that are not
otherwise present in the message, such as for the results of network
tests.)
(3) It introduces a bootstrapping problem, as has already been noted.
Everyone has to train the engine and re-train it when new rules are
developed.
I've thought of a few more, but they all have to do with the benifits
of having all those "knobs" and if you've already adopted the basic
premise that they should be removed there doesn't seem to be any
reason to argue that part.
To summarize my opinion: If what you want is to have a Bayesian-type
engine make all the decisions, then you should install a Bayesian
engine and work on ways to feed it the right tokens; you should not
install SpamAssassin and then work on ways to remove the scoring.