You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by LuKreme <kr...@kreme.com> on 2014/10/08 23:26:25 UTC

Site-wide bayes and individual bayes

Is it possible to have a site-wide bayes AND individual bayes for some users (or all users)?

And, if not, is it generally better to do sitewide?

And, is it possible to take all the individual bayes and combine them into a stitewide db?

-- 
"You've got to dance like nobody's watching." - Kathy Mattea

Re: Site-wide bayes and individual bayes

Posted by Reindl Harald <h....@thelounge.net>.

Am 12.10.2014 um 18:59 schrieb LuKreme:
> On 10 Oct 2014, at 06:49 , RW <rw...@googlemail.com> wrote:
>>> And, if not, is it generally better to do sitewide?
>>
>> It's hard to say, there are advantages and disadvantages either way.
>
> OK, so specific example then.
>
> Small server with a few dozen email users spread over several domains. Almost none of these users does any spam training at all, the rest just delete unwanted messages (not even marking them as junk) or even worse, just ignore them. One user is very aggressive in marking Spam and in keeping the Inbox clear of all spam.
>
> I am of two minds. First, that everyone else would benefit from this user’s actions or, alternatively, that the user’s aggressive tagging will actually ‘poison’ the bayes db for the other users who maybe do not think that endless emails from pinterest or some political candidate are actually spam.

if nobody trains his user specific bayes (like here) site-wide is the 
way to go, just because until a user has flagged 200 ham messages his 
bayes won#t get used regardless of the amount of spam marked ones

merge "a users aggressive training" site-wide means you need to trust 
that users actions - means: he needs to be careful and not just flag 
anything he don't want to see as spam

if it is really one or two users like here i would stay at a normal 
site-wide bayes, i realized that with IMAP shared folders where those 
users see a ham/spam folder to move messages there and are advised to be 
carfeul in case of ham samples not leak sensitive content

i review that stuff, save the eml messages to the training folders on 
the mailserver and call the sa-learn script, until now a nearly 100% 
result over 8 weeks production (99% spam catched, no false positives)

Re: Site-wide bayes and individual bayes

Posted by Ted Mittelstaedt <te...@ipinc.net>.

On 10/12/2014 9:59 AM, LuKreme wrote:
> On 10 Oct 2014, at 06:49 , RW<rw...@googlemail.com>  wrote:
>>> And, if not, is it generally better to do sitewide?
>>
>> It's hard to say, there are advantages and disadvantages either
>> way.
>
> OK, so specific example then.
>
> Small server with a few dozen email users spread over several
> domains. Almost none of these users does any spam training at all,
> the rest just delete unwanted messages (not even marking them as
> junk) or even worse, just ignore them. One user is very aggressive in
> marking Spam and in keeping the Inbox clear of all spam.
>
> I am of two minds. First, that everyone else would benefit from this
> user’s actions or, alternatively, that the user’s aggressive tagging
> will actually ‘poison’ the bayes db for the other users who maybe do
> not think that endless emails from pinterest or some political
> candidate are actually spam.
>

For starters your problem isn't SPAM it's HAM.

You can get all the spam you want.  Just parse the mail log file every
day for a few weeks, looking for delivery attempts to nonexistent 
mailboxes.  When you see repeated delivery attempts to a specific 
mailbox then create an email address on that nonexistent mailbox and 
redirect all the email into it into a spam box

My experience is that once spammers think they have "discovered" an
email address they will never leave it alone, they will send increasing
amounts of spam to that address.

If you are lucky enough to never have spammers trying to probe your
server, you can create your honeypot email addresses, just make them up,
and then take these email addresses and post them into the Unsubscribe 
links on spam.  That is a good way to contaminate spammers mailing lists
with honeypot addresses.  A legitimate mailsender will ignore these, a
spammer will happily pull addresses out of unsubscribe replies.

That's your centralized spam source.  Do this for a couple dozen 
nonexistent email addresses on your server domains and you will have
all the input you want for the Bayes learner.

By definition ANY email to a nonexistent address (not an old address
that was closed down years ago) is unsolicited, AKA SPAM.

As for desired political mail, on my servers I classify all of it as
spam, I can think of maybe only 2 users over the last decade who have
complained about not getting it and for those it's easy to do an
all_spam_to to them and then tell them they will have to do their own
spam filtering.

Since overwhelmingly the political email I have seen coming in is the
offensive conservative anti-women, anti-blacks, anti-latinos, beg for
more money email, I have to say that I'm not particularly concerned 
about the wishes of customers who WANT that kind of mail - I'm quite
happy if they go find another provider.

And, naturally, that kind of email is never ever appropriate for a
business and no employee in a business is ever going to dare complain to 
their bosses that they aren't getting it.

If the politicos want to drown people in hate mail, they have paper
mail to do it - might as well make them help reduce my taxes by
subsidizing the US Post Office with their hate mail, that's about the 
only thing that's good about it.

Anyway, as I said HAM is the problem.  If you don't have large 
quantities of ham, Bayes won't work.  Of course, nothing is preventing
you from copying people's folders  (if they are using IMAP) into one
giant mailbox and using that as a HAM source.  You can probably assume
that if a user has gone to the trouble of saving mail to a folder that
it is ham.

Ted

Re: Site-wide bayes and individual bayes

Posted by LuKreme <kr...@kreme.com>.

On 10 Oct 2014, at 06:49 , RW <rw...@googlemail.com> wrote:
>> And, if not, is it generally better to do sitewide?
> 
> It's hard to say, there are advantages and disadvantages either way.

OK, so specific example then.

Small server with a few dozen email users spread over several domains. Almost none of these users does any spam training at all, the rest just delete unwanted messages (not even marking them as junk) or even worse, just ignore them. One user is very aggressive in marking Spam and in keeping the Inbox clear of all spam.

I am of two minds. First, that everyone else would benefit from this user’s actions or, alternatively, that the user’s aggressive tagging will actually ‘poison’ the bayes db for the other users who maybe do not think that endless emails from pinterest or some political candidate are actually spam.

-- 
"You see, in this world there's two kinds of people, my friend: Those
with loaded guns and those who dig. You dig."

Re: Site-wide bayes and individual bayes

Posted by John Hardin <jh...@impsec.org>.

On Fri, 10 Oct 2014, RW wrote:

> On Wed, 8 Oct 2014 15:26:25 -0600
> LuKreme wrote:
>
>> Is it possible to have a site-wide bayes AND individual bayes for
>> some users (or all users)?
>
> Not as things stand.

Not as things stand, possibly absent a hack like: any user who wants to 
use the site-wide bayes has symlinks to the shared bayes database files in 
their local dir.

Not sure how well that would work in practice (locking if you autolearn), 
and it would be somewhat tedious to maintain.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Maxim VI: If violence wasn’t your last resort, you failed to resort
   to enough of it.
-----------------------------------------------------------------------
  862 days since the first successful private support mission to ISS (SpaceX)

Re: Site-wide bayes and individual bayes

Posted by RW <rw...@googlemail.com>.

On Wed, 8 Oct 2014 15:26:25 -0600
LuKreme wrote:

> Is it possible to have a site-wide bayes AND individual bayes for
> some users (or all users)?

Not as things stand. You could use Bayes for one and a separate filter
for the other.

> And, if not, is it generally better to do sitewide?

It's hard to say, there are advantages and disadvantages either way.

> And, is it possible to take all the individual bayes and combine them
> into a stitewide db?

It should be fairly straightforward to combine the results from running 
sa-learn --backup on multiple accounts. It's just a matter of
combining the total ham/spam message counts and the counts for each
token.