You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by micah anderson <mi...@riseup.net> on 2020/12/07 20:56:44 UTC

per-user bayes

Hi all,

I've got a site-wide bayes mysql setup. It keeps getting poisoned
quickly, because the user patterns are far too divergent from each
other. One person's spam is another person's ham, nobody is happy.

A per-user setup would let each user do their own thing, but I don't see
how I can do that because our system doesn't have individual system
users and I don't see that there are options in the bayes sql
configuration or per-user tables possible.

There is this bayes_sql_override_username configuration option, but this
is a configuration option that I can only set once, and is not
dynamic. There is this hint in the documentation that you can also use
this config option to trick sa-learn to learn data as a specific user,
but there is not much more information.

Has someone out there done this, and can show how you have done it?

At this point my options are to turn down the score for bayes, so it has
less of an impact, maybe turn off bayes auto-learning, or just simply
disabling bayes altogether.

thanks for any information

-- 
        micah

Re: per-user bayes

Posted by Kris Deugau <kd...@vianet.ca>.
micah anderson wrote:
> Kris Deugau <kd...@vianet.ca> writes:
> 
>> There will only be one database and set of tables, but one of the fields
>> in each table is the user identifier.  Fair warning - if you go full
>> per-user on a large system, this will MASSIVELY balloon the size of your
>> Bayes database, and most users will idle below the learning thresholds
>> for quite a long time.
> 
> Can you give an idea of the size calculation? I'm wanting to do this,
> but I need to figure out how much space I need to allocate per user!

The SA docs estimate 5-10M per user for file-based per-user Bayes with 
the default token expiry settings.  I'd expect about the same in SQL, 
with anywhere up to 3x bloat over time due to token churn.  (Checking my 
personal mailbox, I have just over 5M in bayes_tokens, but bayes_seen 
has grown over time to 83M.  However, the message-ids stored there 
aren't being expired.)

Sitewide, with ~1.7M active tokens (expiry set at 2.1M currently), the 
database occupies about 342M on disk here, with a 156M SQL dump.  This 
comes out to about 200 bytes per token of used storage.  A single user 
with default settings (and plenty of learning) will probably settle down 
to somewhere between ~110K and ~140K tokens, so you can probably expect 
their data to occupy anywhere from the minimal 5M on up to close to 30M. 
  Multiply by the number of users and that's what you would have to look 
at provisioning for storage.  Even at a minimal steady-state you're 
likely looking at 100G for 20K users.

If you have more than a handful of users, you're probably better off 
looking for ways to group your users with a small number of Bayes 
datasets rather than full-on per-user.  I haven't tried, but you might 
be able to use bayes_sql_override_username in userprefs (also storable 
in SQL) to assign users to a particular dataset, with a fallback to a 
global default.  The documentation reads to me like this should work 
(note the last sentence):

        bayes_sql_override_username
            Used by BayesStore::SQL storage implementation.

            If this options is set the BayesStore::SQL module will
            override the set username with the value given.  This could
            be useful for implementing global or group bayes databases.

-kgd

Re: per-user bayes

Posted by Benny Pedersen <me...@junc.eu>.
hg user skrev den 2020-12-09 08:57:
> I believe that a SA plugin (like bayes) is able to know the envelope
> MAIL FROM and RCPT TO values... is it correct? If it is possible we
> "just" have to modify the bayes plugin

provide this patch first and ask later :=)

bayes does not fokus on specifik headers

Re: per-user bayes

Posted by hg user <me...@gmail.com>.
I believe that a SA plugin (like bayes) is able to know the envelope MAIL
FROM and RCPT TO values... is it correct? If it is possible we "just" have
to modify the bayes plugin

On Tue, Dec 8, 2020 at 10:13 PM Benny Pedersen <me...@junc.eu> wrote:

> micah anderson skrev den 2020-12-08 21:54:
> > Kris Deugau <kd...@vianet.ca> writes:
> >
> >> There will only be one database and set of tables, but one of the
> >> fields
> >> in each table is the user identifier.  Fair warning - if you go full
> >> per-user on a large system, this will MASSIVELY balloon the size of
> >> your
> >> Bayes database, and most users will idle below the learning thresholds
> >> for quite a long time.
> >
> > Can you give an idea of the size calculation? I'm wanting to do this,
> > but I need to figure out how much space I need to allocate per user!
> >
> > Thanks for the clarifications, this is super helpful.
>
> i use fuglu, where pr user bayes is simple, and now that fuglu have
> solved the problem in that recipients envelope address is now
> caseInsEnsive used in bayes userdatabase it just works with fuglu
>
> but there is more on my wish list, i have not yet pr user retrain mails
> classifed incorrect, currently only autolearn is working
>
> with global bayes one should keep the database as big as possible, and
> well trained for all users, if its manuel trained it would be best, its
> just lots of time users need to do this for very little
>
> fuglu do use spamd, and if i recall it also spamc, i have verifyed it is
> running pr user now
>
> lets see if mimedefang can do it better
>
> in amavisd you can make sasl usermaps to use bayes user maps, i know it
> exists, but have never succesfully got that to work
>

Re: per-user bayes

Posted by Benny Pedersen <me...@junc.eu>.
micah anderson skrev den 2020-12-08 21:54:
> Kris Deugau <kd...@vianet.ca> writes:
> 
>> There will only be one database and set of tables, but one of the 
>> fields
>> in each table is the user identifier.  Fair warning - if you go full
>> per-user on a large system, this will MASSIVELY balloon the size of 
>> your
>> Bayes database, and most users will idle below the learning thresholds
>> for quite a long time.
> 
> Can you give an idea of the size calculation? I'm wanting to do this,
> but I need to figure out how much space I need to allocate per user!
> 
> Thanks for the clarifications, this is super helpful.

i use fuglu, where pr user bayes is simple, and now that fuglu have 
solved the problem in that recipients envelope address is now 
caseInsEnsive used in bayes userdatabase it just works with fuglu

but there is more on my wish list, i have not yet pr user retrain mails 
classifed incorrect, currently only autolearn is working

with global bayes one should keep the database as big as possible, and 
well trained for all users, if its manuel trained it would be best, its 
just lots of time users need to do this for very little

fuglu do use spamd, and if i recall it also spamc, i have verifyed it is 
running pr user now

lets see if mimedefang can do it better

in amavisd you can make sasl usermaps to use bayes user maps, i know it 
exists, but have never succesfully got that to work

Re: per-user bayes

Posted by Dean Carpenter <de...@areyes.com>.
 

On 2020-12-09 9:48 am, deano-spamassassin@areyes.com wrote: 

> On 2020-12-09 4:41 am, @lbutlr wrote: 
> 
> On 08 Dec 2020, at 13:54, micah anderson <mi...@riseup.net> wrote:
> Kris Deugau <kd...@vianet.ca> writes: There will only be one database and set of tables, but one of the fields in each table is the user identifier. Fair warning - if you go full per-user on a large system, this will MASSIVELY balloon the size of your Bayes database, and most users will idle below the learning thresholds for quite a long time.

> Can you give an idea of the size calculation? I'm wanting to do this, but I need to figure out how much space I need to allocate per user!

That would be pretty hard to predict as it would vary a lot based on the
users and the mail.

I don't think Bayes is really that big (a few MB max?)

It's not big. Here's my personal spamassassin database (just a few
users, but SA has been running for years and years ... About 48MB 

> mysql> SELECT TABLE_NAME AS `Table`, ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 ) AS `Size (KB)` FROM information_schema.TABLES WHERE TABLE_SCHEMA = "spamassassin" ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC;
> +-------------------+-----------+
> | Table | Size (KB) |
> +-------------------+-----------+
> | bayes_token | 48160 |
> | awl | 1040 |
> | bayes_vars | 32 |
> | bayes_seen | 16 |
> | bayes_global_vars | 16 |
> | bayes_expire | 16 |
> +-------------------+-----------+
> 6 rows in set (0.00 sec)

I did it again on a test server - same corpus, latest SA etc. It's been
trained on ham/spam. 

> MariaDB [spamassassin]> SELECT TABLE_NAME AS `Table`, ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024 ) AS `Size (MB)` FROM information_schema.TABLES WHERE TABLE_SCHEMA = "spamassassin" ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC;
> +-------------------+-----------+
> | Table | Size (MB) |
> +-------------------+-----------+
> | bayes_token | 118 |
> | txrep | 17 |
> | bayes_seen | 3 |
> | bayes_vars | 0 |
> | awl | 0 |
> | bayes_expire | 0 |
> | bayes_global_vars | 0 |
> +-------------------+-----------+
> 7 rows in set (0.001 sec)

So a bit bigger. 

Re: per-user bayes

Posted by de...@areyes.com.
 

On 2020-12-09 4:41 am, @lbutlr wrote: 

> On 08 Dec 2020, at 13:54, micah anderson <mi...@riseup.net> wrote:
> 
>> Kris Deugau <kd...@vianet.ca> writes:
> There will only be one database and set of tables, but one of the fields in each table is the user identifier. Fair warning - if you go full per-user on a large system, this will MASSIVELY balloon the size of your Bayes database, and most users will idle below the learning thresholds for quite a long time.

> Can you give an idea of the size calculation? I'm wanting to do this, but I need to figure out how much space I need to allocate per user!

That would be pretty hard to predict as it would vary a lot based on the
users and the mail.

I don't think Bayes is really that big (a few MB max?)

It's not big. Here's my personal spamassassin database (just a few
users, but SA has been running for years and years ... About 48MB 

> mysql> SELECT TABLE_NAME AS `Table`, ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 ) AS `Size (KB)` FROM information_schema.TABLES WHERE TABLE_SCHEMA = "spamassassin" ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC;
> +-------------------+-----------+
> | Table | Size (KB) |
> +-------------------+-----------+
> | bayes_token | 48160 |
> | awl | 1040 |
> | bayes_vars | 32 |
> | bayes_seen | 16 |
> | bayes_global_vars | 16 |
> | bayes_expire | 16 |
> +-------------------+-----------+
> 6 rows in set (0.00 sec)
 

Re: per-user bayes

Posted by "@lbutlr" <kr...@kreme.com>.
On 08 Dec 2020, at 13:54, micah anderson <mi...@riseup.net> wrote:
> Kris Deugau <kd...@vianet.ca> writes:

>> There will only be one database and set of tables, but one of the fields 
>> in each table is the user identifier.  Fair warning - if you go full 
>> per-user on a large system, this will MASSIVELY balloon the size of your 
>> Bayes database, and most users will idle below the learning thresholds 
>> for quite a long time.

> Can you give an idea of the size calculation? I'm wanting to do this,
> but I need to figure out how much space I need to allocate per user!

That would be pretty hard to predict as it would vary a lot based on the users and the mail.

I don't think Bayes is really that big (a few MB max?)

-- 
Varium et mutabile semper Femina.


Re: per-user bayes

Posted by micah anderson <mi...@riseup.net>.
Kris Deugau <kd...@vianet.ca> writes:

> There will only be one database and set of tables, but one of the fields 
> in each table is the user identifier.  Fair warning - if you go full 
> per-user on a large system, this will MASSIVELY balloon the size of your 
> Bayes database, and most users will idle below the learning thresholds 
> for quite a long time.

Can you give an idea of the size calculation? I'm wanting to do this,
but I need to figure out how much space I need to allocate per user!

Thanks for the clarifications, this is super helpful.

-- 
        micah

Re: per-user bayes

Posted by Kris Deugau <kd...@vianet.ca>.
Benoit Panizzon wrote:
> Hi
> 
>> This may help
>>
>> <http://svn.apache.org/repos/asf/spamassassin/branches/duncf_masses/sql/README.bayes>
> 
> I sort of have the same issue. Unfortunately that does not help, it
> merely explains how to store bayes data in a database. But there is
> still only one 'global' database on your mail platform which applies to
> all your customers.
> 
> Especially in Switzerland, with four national languages, this causes
> the bayes filter not to be very efficient.
> 
> What we would need, is for the bayes module a possibility to store
> bayes data per 'recipient' not just globally.
> 
> So SpamAssassin would need to somehow pass the recipient(s) to the bayes
> module.

When using spamc/spamd, this is the default, so long as each user 
calling spamc has a unique argument for -u (or a distinct local Unix 
user on the calling system, since spamc will automatically set the user 
to the local Unix username when called without -u).

There will only be one database and set of tables, but one of the fields 
in each table is the user identifier.  Fair warning - if you go full 
per-user on a large system, this will MASSIVELY balloon the size of your 
Bayes database, and most users will idle below the learning thresholds 
for quite a long time.

Here, we get per-user behaviours when calling SA from MIMEDefang for 
outbound mail by replacing the stock library-level integration with a 
custom call to spamc.  (As it happens we share a Bayes DB between 
inbound and outbound mail, and use bayes_sql_override_username to force 
it to be sitewide instead of per-user.)

IIRC Amavis has some support for doing this when calling SA through the 
library interface, but you lose the efficiency benefit of only calling 
SA once on multirecipient messages.

After a bit of searching and reading I suspect you'd either have to just 
convert the library call into a spamc call, or port huge chunks of spamd 
internals into Amavis or MIMEDefang to get them to do library-level 
per-user SA processing.

-kgd

Re: per-user bayes

Posted by "@lbutlr" <kr...@kreme.com>.
On 08 Dec 2020, at 08:36, Benoit Panizzon <be...@imp.ch> wrote:
> Adding the list back to CC as I believe this is an interesting topic
> many have pondered over.

Forgot to fix the reply to on this list for some reason. Fixed now.

> Yes, I see that is states 'per user' but I still don't see, how that
> 'bayes user' is being set on a per recipient base.
> 
> On the email platform there is ONE config file for spamassassin. So if I
> set the user with: 
> 
> bayes_sql_override_username	   someusername
> 
> That is the username under which the bayes data is being stored for all
> recipients (thousands of mailboxes on a big ISP mailserver)


It can be. It can also be, for example, %u (It may be more complicated than that). Or perhaps sa_username_maps?

> How do I tell SpamAssassin to pass the recipient to the bayes
> filter while scanning an email?

Through the SQL query, IIRC. 

-- 
Nothing like grilling a kosher dog over human hair to bring out the
	subtle flavors.


Re: per-user bayes

Posted by Benoit Panizzon <be...@imp.ch>.
Hi

Adding the list back to CC as I believe this is an interesting topic
many have pondered over.

Yes, I see that is states 'per user' but I still don't see, how that
'bayes user' is being set on a per recipient base.

On the email platform there is ONE config file for spamassassin. So if I
set the user with: 

bayes_sql_override_username	   someusername

That is the username under which the bayes data is being stored for all
recipients (thousands of mailboxes on a big ISP mailserver)

How do I tell SpamAssassin to pass the recipient to the bayes
filter while scanning an email?

Mit freundlichen Grüssen

-Benoît Panizzon-
-- 
I m p r o W a r e   A G    -    Leiter Commerce Kunden
______________________________________________________

Zurlindenstrasse 29             Tel  +41 61 826 93 00
CH-4133 Pratteln                Fax  +41 61 826 93 01
Schweiz                         Web  http://www.imp.ch
______________________________________________________

Re: per-user bayes

Posted by Benoit Panizzon <be...@imp.ch>.
Hi

> This may help
> 
> <http://svn.apache.org/repos/asf/spamassassin/branches/duncf_masses/sql/README.bayes>

I sort of have the same issue. Unfortunately that does not help, it
merely explains how to store bayes data in a database. But there is
still only one 'global' database on your mail platform which applies to
all your customers.

Especially in Switzerland, with four national languages, this causes
the bayes filter not to be very efficient.

What we would need, is for the bayes module a possibility to store
bayes data per 'recipient' not just globally.

So SpamAssassin would need to somehow pass the recipient(s) to the bayes
module.

Mit freundlichen Grüssen

-Benoît Panizzon-
-- 
I m p r o W a r e   A G    -    Leiter Commerce Kunden
______________________________________________________

Zurlindenstrasse 29             Tel  +41 61 826 93 00
CH-4133 Pratteln                Fax  +41 61 826 93 01
Schweiz                         Web  http://www.imp.ch
______________________________________________________

Re: per-user bayes

Posted by "@lbutlr" <kr...@kreme.com>.
On 07 Dec 2020, at 13:56, micah anderson <mi...@riseup.net> wrote:
> A per-user setup would let each user do their own thing, but I don't see
> how I can do that because our system doesn't have individual system
> users and I don't see that there are options in the bayes sql
> configuration or per-user tables possible.

This may help

<http://svn.apache.org/repos/asf/spamassassin/branches/duncf_masses/sql/README.bayes>

-- 
"Dignity intact! Dignity intact!" -- Aisling Bee, dancing on a pier in her pants.