You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by mouss <mo...@ml.netoyen.net> on 2009/03/03 23:16:21 UTC

Bye Bye Bayes

I finally disabled Bayes, because I think it doesn't bring me what I want:

- train on error doesn't seem enough, and I can understand it

- train on everything isn't reasonable. even myself wouldn't do that,
because while I can see spam and feed sa, I don't check all my mail to
be sure the messages I didn't see are ham.

- it's too fragile in my opinion. and I got to this conclusion a lot
time ago when testing dspam. By fragile, I mean that it depends too much
on how/when/... you train it

- in a site wide setup, it's hard to come up with a "serious" system
(get feedback but stay safe against dumb users)

- in a per user setup, you get the storage cost. but that's not all:
you're just ignoring the problem. lusers can't/don't train bayes...

of course, if I'm writing this, it's to get opinions.

Re: Bye Bye Bayes

Posted by Kai Schaetzl <ma...@conactive.com>.

Karsten Bräckelmann wrote on Wed, 04 Mar 2009 02:25:51 +0100:

> That's bayes_auto_learn_threshold_spam and nonspam respectively, I
> guess? Keep in mind that threshold is not the actual score, so you
> aren't learning all spam with a score of 8+ then.

Right, I know. That's where the spam quarantine comes into play. All spam 
in it (= everything with score 5 or higher) gets learned in the night.
That's absolutely necessary as we don't get much spam. 96% of the mail 
that is accepted is ham (or spam that comes in because the user opted out, 
there's no distinction because there's no detection). The remainder is 
either a virus or other bad content or High Scoring spam. Low scoring spam 
is almost non-existent.

> 
> Kai, given a nonspam threshold of -2, how exactly do you (manually)
> learn ham? That would be interesting. And what's the ham/spam ratio?

I just checked and have to admit we must have removed the
bayes_auto_learn_threshold_ham -2
some time ago as 0.01 seems to be reliable enough. Only the 
bayes_auto_learn_threshold_spam 8
is in effect now.
But I believe -2 would also deliver enough ham for autolearning. Score 
distribution of the last 40.000 or so messages on the same server.

-15 6 
-4 3,364 
-3 4,249 
-2 9,982 
-1 4,760 
0 13,995 
1 1,267 
2 789 
3 387 

Bayes from that machine:

0.000          0      66285          0  non-token data: nspam
0.000          0      85888          0  non-token data: nham
0.000          0    1864402          0  non-token data: ntokens

As you see, because of the structure of the incoming mail, the ham exceeds 
the spam and the gap is probably steadily growing. This is also reflected 
in the rule hits. The no. 1 rule that hits is Bayes_00 (it hits 99.7% of 
all ham). Bayes_99 is only at around position 25, but with a 100% accuracy 
and the no. 1 rule hitting spam (hitting about 50% of spam).

On servers where I get in some spam trap email and let part of it flow 
thru the MTA rejection the picture is very different. For instance the 
server for my own domains has only 25% ham. Bayes_99 is the no. 1 hitting 
rule with an accuracy of 95.8% (again, not checked if the remainder really 
was ham). With all the URIBL rules and BAYES_00 (accuracy 99.9%) as 
runners up.

So, all in all Bayes works very much for me. Especially in those cases 
where no other rule hits (typically some spamvertized site not yet on a 
URIBL) it's most often the only rule that hits. That's why I moved it to 
5.0 a while ago. Works very well. I think if you use DCC or Razor you may 
get similar results for these rules and may not need to rely so much on 
Bayes. I do not use *any* network rules except for the URIBL stuff which 
isn't shut off by "skip_rbl_checks 1".

(Figures are taken from mailwatch rule hits tables.)

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com

Re: Bye Bye Bayes

Posted by Karsten Bräckelmann <gu...@rudersport.de>.

On Wed, 2009-03-04 at 01:39 +0100, Kai Schaetzl wrote:
> Karsten Bräckelmann wrote on Wed, 04 Mar 2009 00:00:25 +0100:
> 
> > Works really well for me.

While that indeed is true, it's kind of out-of-context. :)  It probably
should be read in connection with my specific learning flavor, explained
in my long-ish post.

> Indeed, very well for us. Not just for one domain. In my opinion site-wide 
> Bayes is the only one that makes sense unless your *single* users really 
> send and receive *lots* of mail. The ordinary domain/user just doesn't get 
> enough mail for a nicely working bayes db.
> 
> We auto-learn all spam with scores 8 or higher. Users usually do not 
> learn. The score for autolearning spam has been lowered to -2 to avoid 
> learning spam that slips thru as ham.

That's bayes_auto_learn_threshold_spam and nonspam respectively, I
guess? Keep in mind that threshold is not the actual score, so you
aren't learning all spam with a score of 8+ then.

Kai, given a nonspam threshold of -2, how exactly do you (manually)
learn ham? That would be interesting. And what's the ham/spam ratio?

On a related note, IIRC the ham threshold once was lowered to -1, and
subsequently raised back to a positive zero due to complaints.

Nice to read good news and "works for me" comments even site-wide.
Generally, only those who got problems vent it. ;)

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Bye Bye Bayes

Posted by Kai Schaetzl <ma...@conactive.com>.

Karsten Bräckelmann wrote on Wed, 04 Mar 2009 00:00:25 +0100:

> Works really well for me.

Indeed, very well for us. Not just for one domain. In my opinion site-wide 
Bayes is the only one that makes sense unless your *single* users really 
send and receive *lots* of mail. The ordinary domain/user just doesn't get 
enough mail for a nicely working bayes db.

We auto-learn all spam with scores 8 or higher. Users usually do not 
learn. The score for autolearning spam has been lowered to -2 to avoid 
learning spam that slips thru as ham.

We learn all quarantined spam once per night, so that we get low-scoring 
spam into Bayes as well. That works fine because the FP rate for spam is 
almost non-existent. We also learn spam from some spam-traps.

The only problems I've ever had with Bayes are expiry problems (search 
this list). There's something wrong with the algorithm getting used for 
the estimate method. I only got rid of them by going to SQL storage and 
manually expiring the SQL db with my own queries.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com

Re: Bye Bye Bayes

Posted by Ned Slider <ne...@unixmail.co.uk>.

Karsten Bräckelmann wrote:
> On Tue, 2009-03-03 at 23:16 +0100, mouss wrote:
>> I finally disabled Bayes, because I think it doesn't bring me what I want:
> 
> Works really well for me. Quick guesstimate is 99% of my spam hits
> BAYES_80 or higher, most of them 95+. Ham typically scores 00, IIRC
> almost always below 50. And no these numbers are not made up. :)
> 

I see exactly the same. Low volume server, a few domains, not many 
users. I manually train all spam and auto-learning does just fine for 
learning ham. Bayes is site-wide, not per user.

Here's my Bayes stats for last week (typical, spam only):

## Bayes Statistics ##
    1253 BAYES_99
       9 BAYES_95
      13 BAYES_80
       4 BAYES_60
      28 BAYES_50
    1307 Total Spam

Re: Bye Bye Bayes

Posted by Karsten Bräckelmann <gu...@rudersport.de>.

On Tue, 2009-03-03 at 23:16 +0100, mouss wrote:
> I finally disabled Bayes, because I think it doesn't bring me what I want:

Works really well for me. Quick guesstimate is 99% of my spam hits
BAYES_80 or higher, most of them 95+. Ham typically scores 00, IIRC
almost always below 50. And no these numbers are not made up. :)

This is per-user, err, human, though. Multiple addresses, single me.

> - train on error doesn't seem enough, and I can understand it

Agreed. Of course I do that, but read on...

> - train on everything isn't reasonable. even myself wouldn't do that,
> because while I can see spam and feed sa, I don't check all my mail to
> be sure the messages I didn't see are ham.

Definitely. I do NOT even bother to scan, let alone train mailing posts,
bugzilla bulk, etc. These are filtered early. It's pretty much Inbox
(direct, personal mail) or spam here.

I developed some special habits for training long ago.

First of all, auto-learn *is* enabled. For ham. The auto-learn spam
threshold is way up to never trigger, effectively disabled. I do train
all my non-auto-learned ham -- occasionally. That's like once or twice a
year... I'm too lazy. Cause I do paranoidly review the ham before
learning. Got a ham backup folder for that, populated automatically.
Auto-learning ham generally performs just great for me.

Then I do learn spam manually. Aided by mail-filters. For example, all
16+ scoring spam with low-ish Bayes scores below 80 are getting dumped
to a copy folder, for quick review, training and flaming.

Every now and then I do train lower scoring spam, too. Funnily enough,
these usually tend to score high on Bayes anyway, there are other hits
missing for a solid 15.

FWIW, I am likely to eventually implement *my* flavor of auto-learning
low scorers to be done automatically while they come in.

Why I do it that way?  Easy. There's no way I can hold up to learning
800 spams a day. Don't get that many hams. Remember, ham == Inbox here,
no mailing lists, etc.  So I don't bother training the lions share that
easily triggers 95+ anyway.

It's basically an attempt to limit learning spam, to not bias my Bayes
beyond necessity. Performs really well for me for years.

> - it's too fragile in my opinion. and I got to this conclusion a lot
> time ago when testing dspam. By fragile, I mean that it depends too much
> on how/when/... you train it

I mentioned I plan to implement my selective auto-learning flavor. This
is related to exactly this.

Some days I wake up and find a bunch of "new" spam that slipped through.
Bummer. After learning these, I'll often never see one of their type
again -- known to Bayes, end up in the header-logging folder...

This implies, that I plan to switch to "train 10+ scoring spam with
low-ish Bayes automatically". I am prepared to UN-learn FPs. Have never
ever seen one with that score anyway. :)  And the remaining < 0.5% that
scores below 10 is easy enough to train manually.

> - in a site wide setup, it's hard to come up with a "serious" system
> (get feedback but stay safe against dumb users)
> 
> - in a per user setup, you get the storage cost. but that's not all:
> you're just ignoring the problem. lusers can't/don't train bayes...
> 
> of course, if I'm writing this, it's to get opinions.

There you go. :)  Hope you like it.  Hey, I just leaked my s3creet
training strategy! ;)

  guenther  -- hrm, I just know there was something I wanted to add,
               but just forgot...

-- 
char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Re: Bye Bye Bayes

Posted by RW <rw...@googlemail.com>.

On Tue, 03 Mar 2009 23:16:21 +0100
mouss <mo...@ml.netoyen.net> wrote:

> I finally disabled Bayes, because I think it doesn't bring me what I
> want:
> 
> - train on error doesn't seem enough, and I can understand it
> 
> - train on everything isn't reasonable. even myself wouldn't do that,
> because while I can see spam and feed sa, I don't check all my mail to
> be sure the messages I didn't see are ham.

A good compromise is to augment autolearning by filing ham that doesn't
hit BAYES_00 but has autolearn=no in an unsure folder. Ham is very
easily learned so this doesn't involve much work in the long-term.
Obviously you learn any spam too, but that's the easy part.

You can also aid autolearning by adding some personalised negative
scoring rules e.g. pattern matching on your message-id's in
reference/in-reply-to headers.

Re: Bye Bye Bayes

Posted by Matus UHLAR - fantomas <uh...@fantomas.sk>.

On 04.03.09 06:17, John Hardin wrote:
> I used to have a couple of users who treated their Trash folder as 
> long-term read-message storage. After reading most messages they'd move 
> them to Trash, and _never_ _purge_ _it_. I couldn't break them of this 
> habit, even after purging their Trash folder from the server a couple of 
> times. ("Oops! Disk failure! Well, that was trash, you can afford to lose 
> that.")

We set up courier's imap server to remove files after being in in trash for
more than 7 days... Luckily, we have documented that long time ago, so they
cannot comply...

-- 
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Boost your system's speed by 500% - DEL C:\WINDOWS\*.*

Re: Bye Bye Bayes

Posted by Martin Gregorie <ma...@gregorie.org>.

On Wed, 2009-03-04 at 16:31 +0100, Kai Schaetzl wrote:
> John Hardin wrote on Wed, 4 Mar 2009 06:17:16 -0800 (PST):
> 
> > ("Oops! Disk failure! Well, that was trash, you can afford to lose 
> > that.")
> 
> thanks for the laugh :-)
>
How many of you have seen the BOFH (Bastard Operator From Hell) stories?
http://www.theregister.co.uk/odds/bofh/

They may amuse some of you...


Martin

Re: Bye Bye Bayes

Posted by Kai Schaetzl <ma...@conactive.com>.

John Hardin wrote on Wed, 4 Mar 2009 06:17:16 -0800 (PST):

> ("Oops! Disk failure! Well, that was trash, you can afford to lose 
> that.")

thanks for the laugh :-)

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com

Re: Bye Bye Bayes

Posted by Dave Pooser <da...@pooserville.com>.

> I used to have a couple of users who treated their Trash folder as
> long-term read-message storage.

I have a user like that at $DAYJOB. I used to ask him if he kept his car
title and other important documents in the wastebasket under his desk at
home.
-- 
Dave Pooser
Cat-Herder-in-Chief, Pooserville.com
"Sarcasm Error:
    Abort, Retry, Bite Me?"
                    -Legostar Galactica

Re: Bye Bye Bayes

Posted by John Hardin <jh...@impsec.org>.

On Wed, 4 Mar 2009, Kai Schaetzl wrote:

> LuKreme wrote on Tue, 3 Mar 2009 19:02:06 -0700:
>
>> How is it the same? Already read messages in inbox means the user has
>> "accepted" those messages without trashing them or junking them.
>
> If you can make sure that your users *really* delete or move spam to the 
> right places, then it works, yes.

That, of course, is the crux of the biscuit.

I used to have a couple of users who treated their Trash folder as 
long-term read-message storage. After reading most messages they'd move 
them to Trash, and _never_ _purge_ _it_. I couldn't break them of this 
habit, even after purging their Trash folder from the server a couple of 
times. ("Oops! Disk failure! Well, that was trash, you can afford to lose 
that.")

> But I fear there is a chance that users just "walk" over spam and let it 
> stay as (depending on the mail client) it may just not be visible 
> anymore which may be good enough for them.

Or delete it rather than moving it to .Junk

I'll modify my earlier comment - it sounds good, assuming you have a high 
degree of users behaving they way you want them to.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Failure to plan ahead on someone else's part does not constitute
   an emergency on my part.                 -- David W. Barts in a.s.r
-----------------------------------------------------------------------
  4 days until Daylight Saving Time begins in U.S. - Spring Forward

Re: Bye Bye Bayes

Posted by Kai Schaetzl <ma...@conactive.com>.

LuKreme wrote on Tue, 3 Mar 2009 19:02:06 -0700:

> How is it the same? Already read messages in inbox means the user has  
> "accepted" those messages without trashing them or junking them.

and the message may not have been learned by score.
If you can make sure that your users *really* delete or move spam to the 
right places, then it works, yes. But I fear there is a chance that users 
just "walk" over spam and let it stay as (depending on the mail client) it 
may just not be visible anymore which may be good enough for them.
So, there's a chance of undesired "infection" with spam.

> False junk would get pulled out of .Junk into the inbox and relearned  
> as ham.

How? By the user? When? What about vacation?
I wouldn't trust too much that users "do the right thing". Depends on your 
user base.

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com

Re: Bye Bye Bayes

Posted by mouss <mo...@ml.netoyen.net>.

LuKreme a écrit :
> On 6-Mar-2009, at 15:10, mouss wrote:
>>> How is it the same? Already read messages in inbox means the user has
>>> "accepted" those messages without trashing them or junking them.
>>
>> This is wrong. it is not true for my own mail. I visit my mailbox,
>> looking for important messages. and only when I have the time (which
>> maybe days later), I move missed spam to the Junk folder.
> 
> Right, but those message that ARE Spam wouldn't be marked as read, would
> they? 

They will. Think "Thunderbird", not "outlook". when you select a message
in TB (and this need not be manual. when you open a folder in TB, last
unread message is generally selected), it is read. whether this is a
feature or bug is not the question: any approach used here should work
with the MUAs that are "supported" (and I can't say TB is not supported!).

> Also, *your* use of mail is rather more sophisticated than the
> regular user.
> 

I've seen "normal" users do similar things:  open a message, wonder what
to do with it, take some time to ask, then they say "ok, so that was
spam"....

others read the spam message and skip it ("what are these admins doing
here? they keep telling us what not do, instead of stopping this junk...").

>> and on the other side, I voluntarily mark some messages as unread. to
>> see them in bold.
> 
> So those messages would not be auto-learned as ham until such time that
> they were marked as read.
> 

which may be days later, or even never. yes, I overload "read/unread"
flag, but I am not the only one. and this flag wasn't designed for
training a filter either, so I can't come and say "the read/unread flag
will be used for training the filter".

>>> .Junk means the user, or the user's MUA, has flagged a message that is
>>> not tagged as spam.
>>>
>>> False junk would get pulled out of .Junk into the inbox and relearned as
>>> ham.
>>
>> this one is ok. The problem is with
>> - missed spam not yet moved to the junk folder
>> - false positives, which may be missed (if the junk folder is full of
>> junk, ya know what...), not yet found, ... etc.
> 
> Right, false positives in the junk folder would get learned as spam, and
> you only get unlearned if the user moved the message to the Inbox.
> 
> but as I think more and more, I change what I want to do slightly.  My
> current thinking is this:
> 
> If there is a .sa-ham folder, learn the messages in it as ham.
> else learn the read messages in INBOX

I use Junk/Innocent for this. The reason is that in many clients, the
Junk folder is special:
- It is localized. My TB shows "Indésirables"
- it is listed near the top. you don't have to scroll down

Also, SA isn't the only piece to train.

> 
> If there is an .sa-spam folder, learn the messages in it as spam
> else learn the messages in Junk
> 

and Junk/Spam/ for this (same reason as above).

> This gives people who setup .sa-{ham,spam} folders compete control over
> what is learned and still does some learning from everything else.
> 
> Keeping in mind that Junk is only mail that the MUA or the user thinks
> is spam that was not tagged as spam my SA.
> 

yes. this is one of the problems.

anyway, Thanks everybody for the feedback (of course, more feedback is
welcome, maybe offlist to avoid annoying everybody ;-p)

Re: Bye Bye Bayes

Posted by LuKreme <kr...@kreme.com>.

On 6-Mar-2009, at 15:10, mouss wrote:
>> How is it the same? Already read messages in inbox means the user has
>> "accepted" those messages without trashing them or junking them.
>
> This is wrong. it is not true for my own mail. I visit my mailbox,
> looking for important messages. and only when I have the time (which
> maybe days later), I move missed spam to the Junk folder.

Right, but those message that ARE Spam wouldn't be marked as read,  
would they? Also, *your* use of mail is rather more sophisticated than  
the regular user.

> and on the other side, I voluntarily mark some messages as unread. to
> see them in bold.

So those messages would not be auto-learned as ham until such time  
that they were marked as read.

>> .Junk means the user, or the user's MUA, has flagged a message that  
>> is
>> not tagged as spam.
>>
>> False junk would get pulled out of .Junk into the inbox and  
>> relearned as
>> ham.
>
> this one is ok. The problem is with
> - missed spam not yet moved to the junk folder
> - false positives, which may be missed (if the junk folder is full of
> junk, ya know what...), not yet found, ... etc.

Right, false positives in the junk folder would get learned as spam,  
and you only get unlearned if the user moved the message to the Inbox.

but as I think more and more, I change what I want to do slightly.  My  
current thinking is this:

If there is a .sa-ham folder, learn the messages in it as ham.
else learn the read messages in INBOX

If there is an .sa-spam folder, learn the messages in it as spam
else learn the messages in Junk

This gives people who setup .sa-{ham,spam} folders compete control  
over what is learned and still does some learning from everything else.

Keeping in mind that Junk is only mail that the MUA or the user thinks  
is spam that was not tagged as spam my SA.

-- 
if you ever get that chimp of your back, if you ever find the thing
	you lack, ah but you know you're only having a laugh.  Oh, oh
	here we go again -- until the end.

Re: Bye Bye Bayes

Posted by mouss <mo...@ml.netoyen.net>.

LuKreme a écrit :
> On Mar 3, 2009, at 17:07, John Hardin <jh...@impsec.org> wrote:
> 
>> On Tue, 3 Mar 2009, LuKreme wrote:
>>
>>> I am considering the following:
>>>
>>> Autolearn read mail in the inbox as ham
>>>  Autolearn mail in .Junk and .SPAM as spam
>>>
>>> This is pretty east with maildir.
>>
>> How is that different from using the built-in autolearning based on
>> message score?
> 
> How is it the same? Already read messages in inbox means the user has
> "accepted" those messages without trashing them or junking them.

This is wrong. it is not true for my own mail. I visit my mailbox,
looking for important messages. and only when I have the time (which
maybe days later), I move missed spam to the Junk folder.

and on the other side, I voluntarily mark some messages as unread. to
see them in bold.

> 
> .Junk means the user, or the user's MUA, has flagged a message that is
> not tagged as spam.
> 
> False junk would get pulled out of .Junk into the inbox and relearned as
> ham.

this one is ok. The problem is with
- missed spam not yet moved to the junk folder
- false positives, which may be missed (if the junk folder is full of
junk, ya know what...), not yet found, ... etc.

Re: Bye Bye Bayes

Posted by LuKreme <kr...@kreme.com>.

On 4-Mar-2009, at 07:06, John Hardin wrote:
> On Tue, 3 Mar 2009, LuKreme wrote:
>> On Mar 3, 2009, at 17:07, John Hardin <jh...@impsec.org> wrote:
>>> On Tue, 3 Mar 2009, LuKreme wrote:
>>> > I am considering the following:
>>> > > Autolearn read mail in the inbox as ham
>>> > Autolearn mail in .Junk and .SPAM as spam
>>> > > This is pretty east with maildir.
>>> How is that different from using the built-in autolearning based  
>>> on message score?
>>
>> How is it the same? Already read messages in inbox means the user  
>> has "accepted" those messages without trashing them or junking them.
>
> Sorry, I didn't register that part. I thought it was just "messages  
> in the inbox".
>
> Bear in mind some mail clients will mark a message "read" if you  
> only highlight the title line. Auto-preview can be annoying that way  
> sometimes.

Yep, and I think THAT has caused me to decide against doing this.   
Instead I have changed to thinking about  having it autolearn as ham  
messages that are read and are NOT in .Junk* .SPAM* /cur /new   
or .Trash* -- but again, just mulling it over.

>> .Junk means the user, or the user's MUA, has flagged a message that  
>> is not tagged as spam.
>
> Okay, I was assuming that was your SA spam quarantine, not your  
> equivalent of the user's spam training folder.

I believe both the mozilla email programs (Tbird, Netscrape, Postbox)  
and Apple Mail.app use "Junk" for messages the MUAs think are  
spammish, not sure about any other clients.  Our SA spam quarantine  
is .SPAM

>> False junk would get pulled out of .Junk into the inbox and  
>> relearned as ham.
>>
>> Haven't done it, still mulling.
>
> Now that you've explained it in more detail it sounds better.

Better, but not good, perhaps.  I've half a mind to simply forget auto- 
learning for the virtual users completely and make them use sa-ham sa- 
spam to manually train, and if they don't?  Yeah, too bad.  OTOH, I'm  
a little tired and cranky today...

-- 
But just because you've seen me on your TV Doesn't mean I'm any
	more enlightened than you

Re: Bye Bye Bayes

Posted by John Hardin <jh...@impsec.org>.

On Tue, 3 Mar 2009, LuKreme wrote:

> On Mar 3, 2009, at 17:07, John Hardin <jh...@impsec.org> wrote:
>
>> On Tue, 3 Mar 2009, LuKreme wrote:
>> 
>> > I am considering the following:
>> > 
>> > Autolearn read mail in the inbox as ham
>> > Autolearn mail in .Junk and .SPAM as spam
>> > 
>> > This is pretty east with maildir.
>> 
>> How is that different from using the built-in autolearning based on 
>> message score?
>
> How is it the same? Already read messages in inbox means the user has 
> "accepted" those messages without trashing them or junking them.

Sorry, I didn't register that part. I thought it was just "messages in the 
inbox".

Bear in mind some mail clients will mark a message "read" if you only 
highlight the title line. Auto-preview can be annoying that way sometimes.

> .Junk means the user, or the user's MUA, has flagged a message that is 
> not tagged as spam.

Okay, I was assuming that was your SA spam quarantine, not your equivalent 
of the user's spam training folder.

> False junk would get pulled out of .Junk into the inbox and relearned as 
> ham.
>
> Haven't done it, still mulling.

Now that you've explained it in more detail it sounds better.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Failure to plan ahead on someone else's part does not constitute
   an emergency on my part.                 -- David W. Barts in a.s.r
-----------------------------------------------------------------------
  4 days until Daylight Saving Time begins in U.S. - Spring Forward

Re: Bye Bye Bayes

Posted by LuKreme <kr...@kreme.com>.

On Mar 3, 2009, at 17:07, John Hardin <jh...@impsec.org> wrote:

> On Tue, 3 Mar 2009, LuKreme wrote:
>
>> I am considering the following:
>>
>> Autolearn read mail in the inbox as ham
>>  Autolearn mail in .Junk and .SPAM as spam
>>
>> This is pretty east with maildir.
>
> How is that different from using the built-in autolearning based on  
> message score?

How is it the same? Already read messages in inbox means the user has  
"accepted" those messages without trashing them or junking them.

.Junk means the user, or the user's MUA, has flagged a message that is  
not tagged as spam.

False junk would get pulled out of .Junk into the inbox and relearned  
as ham.

Haven't done it, still mulling.
>>

> Failure to plan ahead on someone else's part does not constitute
>  an emergency on my part.                 -- David W. Barts in a.s.r

A.s.r? I've seen this on a sign at my print shop.

Re: Bye Bye Bayes

Posted by John Hardin <jh...@impsec.org>.

On Tue, 3 Mar 2009, LuKreme wrote:

> I am considering the following:
>
>  Autolearn read mail in the inbox as ham
>   Autolearn mail in .Junk and .SPAM as spam
>
> This is pretty east with maildir.

How is that different from using the built-in autolearning based on 
message score?

>> - in a site wide setup, it's hard to come up with a "serious" system
>> (get feedback but stay safe against dumb users)
>
> That's why I'm looking at autolearn options.

It's not that hard if you're willing to accept the fact that manual 
training is _manual_ training.

Set up ham and spam training mail folders for your users. If you have a 
large user base you may only want to set that up for a subset of users - 
i.e. just the clueful ones. The users are given instruction in how to put 
FP and FN messages into their training folders. If you trust the user, you 
can script sa-learn to learn directly from their training folders. If not, 
the admin (or a clueful subordinate) has to review the users' training 
folders periodically and move the messages the user correctly classified 
to the folders that SA trains from.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   Failure to plan ahead on someone else's part does not constitute
   an emergency on my part.                 -- David W. Barts in a.s.r
-----------------------------------------------------------------------
  5 days until Daylight Saving Time begins in U.S. - Spring Forward

Re: Bye Bye Bayes

Posted by LuKreme <kr...@kreme.com>.

On Mar 3, 2009, at 15:16, mouss <mo...@ml.netoyen.net> wrote:

> I finally disabled Bayes, because I think it doesn't bring me what I  
> want:
>
> - train on error doesn't seem enough, and I can understand it
>
> - train on everything isn't reasonable. even myself wouldn't do that,
> because while I can see spam and feed sa, I don't check all my mail to
> be sure the messages I didn't see are ham.

I am considering the following:

   Autolearn read mail in the inbox as ham
    Autolearn mail in .Junk and .SPAM as spam

This is pretty east with maildir.

>
>
> - it's too fragile in my opinion. and I got to this conclusion a lot
> time ago when testing dspam. By fragile, I mean that it depends too  
> much
> on how/when/... you train it
>
> - in a site wide setup, it's hard to come up with a "serious" system
> (get feedback but stay safe against dumb users)
>
> - in a per user setup, you get the storage cost. but that's not all:
> you're just ignoring the problem. lusers can't/don't train

That's why I'm looking at autolearn options.