You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by Magnus Anderson <ma...@sonic2000.org> on 2007/10/13 17:40:12 UTC

Global Bayes and AWL

Hi,

I have read this thread,
http://www.nabble.com/forum/ViewPost.jtp?post=819176&framed=y

This is also what I am searching for to do. Make SpamAssassin score against
both a AWL/Bayes by the user and a AWL/Bayes by the system.

What I was thinking on was to make a new set of rules for SA that checks
agains the AWL and Bayes again, but this time as a specific user, like
<default>. 

I copied the /usr/share/spamassassin/60_awl.cf and 23_bayes.cf to
/etc/mail/spamassassin and renamed all BAYES_* and AWL to GLOBAL_BAYES_* and
GLOBAL_AWL.

Then I added "user_awl_sql_override_username" and
"user_bayes_sql_override_username" to the new rules.

This however made CGPSA, that I use against CommuniGate Pro, to run AWL
saves against the MySQL table as <default> to.

It also wrote output like "Merging duplicate GLOBAL_AWL and AWL".

Is this not possible at all, has someone made this work?
-- 
View this message in context: http://www.nabble.com/Global-Bayes-and-AWL-tf4618683.html#a13190805
Sent from the SpamAssassin - Users mailing list archive at Nabble.com.


RE: Global Bayes and AWL

Posted by Giampaolo Tomassoni <g....@libero.it>.
> -----Original Message-----
> From: Magnus Anderson [mailto:magnus@sonic2000.org]
> Sent: Saturday, October 13, 2007 5:40 PM
> 
> 
> Hi,
> 
> I have read this thread,
> http://www.nabble.com/forum/ViewPost.jtp?post=819176&framed=y
> 
> This is also what I am searching for to do. Make SpamAssassin score
> against
> both a AWL/Bayes by the user and a AWL/Bayes by the system.
> 
> What I was thinking on was to make a new set of rules for SA that
> checks
> agains the AWL and Bayes again, but this time as a specific user, like
> <default>.
> 
> I copied the /usr/share/spamassassin/60_awl.cf and 23_bayes.cf to
> /etc/mail/spamassassin and renamed all BAYES_* and AWL to
> GLOBAL_BAYES_* and
> GLOBAL_AWL.
> 
> Then I added "user_awl_sql_override_username" and
> "user_bayes_sql_override_username" to the new rules.
> 
> This however made CGPSA, that I use against CommuniGate Pro, to run AWL
> saves against the MySQL table as <default> to.
> 
> It also wrote output like "Merging duplicate GLOBAL_AWL and AWL".
> 
> Is this not possible at all, has someone made this work?

It is not impossible, but it would borrow its own speed cost.

Some time ago I wished have a three-level layered Bayes: site level,
organization level and mailbox level.

The idea was to reshape the bayes DB store code and, probably, scoring code,
such that

a) during mail scanning, a token unknown by the user would get "scored"
thanks to the organizational or site one (if any);

b) new tokens learned (or auto-learned) by Bayes would contribute to all the
three levels.

>From a store standpoint, this means that tokens shouldn't have any ham/spam
count anymore, but instead there should be a table listing tokens belonging
to a given mail and a table listing mails received by each user. In this
latter table, there should be a ham/spam flag.

When an incoming mail is scanned and tokens are extracted, for each token
the code should "count" how many times the user (auto-) reported that token
as being ham or spam or, if there are no occurrences of that token in the
user layer, how many times that token had been reported as ham or spam at
organizational level (that is: by all users in a domain/organization). Then,
if there is again no occurrence of the token, how many times that token had
been tagged as spammy at site level (that is: by every user in every
organization), if any.

This reasoning could even be changed somehow in order to statistically
prioritize user preferences over organizational ones over site ones, which
would be much preferred the previous idea since simply spreading the mail
corpus in three levels would easily result in a unreliably too small user
and even a organizational virtual corpus. However, this would mean to tune
the well-known Bayes classification equations to this need, which should be
done carefully and not released before a review from some Bayes'
theory-savvy person.

A further benefit steaming from a multi-layer approach would be easy and
reliable expiration of bayes entries, by simply deleting mails arrived
before the expire period, then tokens not anymore referred by any e-mail.
This is something most serious sql server could even do automatically after
deleting any token whose last-seen time is before a given threshold.

Also, actually AWL owns its own table to do its work. This design could
instead use two further fields on the "mails" table with the source mail
address and ip address in them, and a further field in the usermails table
with the computed SA score in it. AWL could use this data in order to do its
dirty job, thereby obtaining data expiration for free.

Of course, since there were so much impact in the Bayes code, I surely
preferred this design be in the mainstream SA code, in order to avoid to
"reinvent the wheel" each time I had to update SA.

The problem is that this design would be much more complex than the actual
one and the question is: would it be eadible by everybody but the tiniest
ISPs using SA? It probably would be good to me, with some hundreds e-mails
received per day. But what if one has to scan 10,000,00 mails/day? Sure one
can use smart sql servers with statistical query optimizers and the like,
but this way too computing the bayes score in an incoming mail would
probably take a couple of seconds in the average, as opposed to the current
few tents of second...

So, flexibility often comes at speed expenses and I guess many in this list
would not appreciate.

Giampaolo