You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@spamassassin.apache.org by GRP Productions <gr...@hotmail.com> on 2005/03/13 10:21:12 UTC
Bayes DB does not grow anymore
Hello,
for some days now my bayesian DB does not seem to grow. Its size remains
stable. It is read with no problems by SA 3.0.2, but nothing new is written.
I send an email to me, it is classified as BAYES_50. I sa-learn it as spam,
send it again, and it is still BAYES_50 (I expected to see it as BAYES_99).
I use SpamAssassin 3.0.2. No configuration change has been done recently. It
used to work fine.
I've tried --sync, --force-expire, but no luck.
Any help would be appreciated
Thanks
Greg
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Bayes DB does not grow anymore
Posted by jdow <jd...@earthlink.net>.
From: "Kai Schaetzl" <ma...@conactive.com>
> > in a degree I have set my SA score to be more or less equal with the
> > BAYES_99 score (around 8).
>
> Your BAYES_99 score is 8? I would never do this. General rule is that no
single
> rule should be able to mark a message as ham or spam. That cries for false
> positives.
I'd not do that with Bayes scores. However, there are a few rules that
are iron clad spam detectors here and they get VERY high scores. They
are unique to me and uniquely usable by me so I don't bother to pass
them along. (I have a string if wrong names associated with products
people spam me about that I use to send a score well over 5 to SA. And
I have some additional PayPal antispam of my own which involve some
fancy dancing with meta rules that get an automatic 105 to make sure
they never get through to anything but my spam folder. I do scan the
spam folder, though. If I didn't scan it I'd not be so vicious about
some of my spam scores.
{^_-}
Re: Bayes DB does not grow anymore
Posted by Kai Schaetzl <ma...@conactive.com>.
GRP Productions wrote on Fri, 18 Mar 2005 10:38:29 +0200:
> It seems SURBL is now enabled by default. It has also changed its name to
> URIDNSBL :-)
SURBL refers generally to those xx_SURBL rules and to URIDNSBL since the only
other distributed rules is SBL and SURBL started it all.
I do not use SARE rules (although I am trying to find time to
> look at them, as I am aware of their credibility). I use Gray's rules
> (http://files.grayonline.id.au), they seem quite efficient.
I wasn't aware of that site, but now that I visited it, I remember I visited it
at least once. Use whatever works for you. After all, all this stuff isn't done
to make you try out again and again but to help you focus your time on the
important things.
> I understand what you say. The point is, what should be the criteria to
> understand if the time for an expiration has come? I mean, supposing we take
> only the size in consideration, could be a problem. What if some old tokens
> are still common nowadays in spam mail?
This is not a problem. Expiry isn't done by "addition time", but by access time
(short: atime). So, items which didn't occur recently drop to the "end" of the
db and get removed by expiry. There's always the chance that old tokens which
haven't been seen for a long time "come back". But the chance is slimmer the
older the atime of that token is. There's probably some statistical curve
algorithm which could be used to determine the best "break point". Because of
the way dbx databases work expiry can't be done this way, though.
> As I told you, since my last post I have reset everything. It seems to me
> it works fine, and it learns rapidly. It gives me no reason not to trust it,
> in a degree I have set my SA score to be more or less equal with the
> BAYES_99 score (around 8).
Your BAYES_99 score is 8? I would never do this. General rule is that no single
rule should be able to mark a message as ham or spam. That cries for false
positives.
Of course I keep doing mistake-based learning,
> but most of the times I feed it with 'subjective' spam mail (ie. mail that
> my users don't want to receive, but is definitely not spam).
What kind of mail is that? Newsletters they once subscribed to and don't like
anymore? They should unsubscribe instead of declaring it as spam.
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org
Re: Bayes DB does not grow anymore
Posted by GRP Productions <gr...@hotmail.com>.
>Thanks for the offer. You can send it to the email address I use for this
>list,
>or you could just send me an FTP URL for retrieval.
Sorry I did not find the time to do this, but I will try to send it during
the weekend.
>Oh, yes. You need to have SURBL switched on via the init.pre (I think it's
>off
>by default) and you should use custom rules. I use a set of carefully
>chosen
>rulesets mostly from SARE and updated via rulesdujour and some more rules
>of my
>own accumulated over time.
It seems SURBL is now enabled by default. It has also changed its name to
URIDNSBL :-) I do not use SARE rules (although I am trying to find time to
look at them, as I am aware of their credibility). I use Gray's rules
(http://files.grayonline.id.au), they seem quite efficient.
>I think on a heavy traffic machine it's preferrable to have it off,
>especially
>when using MailScanner. Otherwise the expiry can kick in at random times
>every
>few hours (you can set a minimum time, though, f.i. one day). Some people
>run a
>scheduled expiry three times a day. That's an advice which often comes up
>on
>the Mailscanner list (which is a very helpful list, btw).
>Depends on how often you need it (whether it reaches the limit you want to
>hold
>more often or not). Starting with one expiry per night should be fine, but
>you
>should occasionally expire manually and look at the output, in case there
>are
>problems.
>No. One should get rid of really old tokens, they are only "ballast" in the
>db.
>I don't know how a big db behaves on a busy site. Ours contain 1 Mio.
>tokens
>and have a size of 40 MB. They work very well with no ressource hogging.
>But I
>have only a few thousand messages running thru each of our servers, there's
>probably none which gets more than 10.000 a day. If you get 100.000 it may
>be
>different.
I understand what you say. The point is, what should be the criteria to
understand if the time for an expiration has come? I mean, supposing we take
only the size in consideration, could be a problem. What if some old tokens
are still common nowadays in spam mail? You could say it doesn't matter it
will be started again and recognize all the bad stuff. In that sense, we
could just stop maintaining Bayes completely.
>That's what we do. I only learn messages which were categorized wrong. Not
>by
>Bayes, but by SA. Most messages which get a score lower than 5 still get a
>BAYES_99 which means that Bayes identifies them all. Nevertheless, I learn
>these messages because they are spam and it reassures Bayes that they are
>spam.
>BTW: I have set BAYES_99 to 3.0, because it's so accurate for us.
As I told you, since my last post I have reset everything. It seems to me
it works fine, and it learns rapidly. It gives me no reason not to trust it,
in a degree I have set my SA score to be more or less equal with the
BAYES_99 score (around 8). Of course I keep doing mistake-based learning,
but most of the times I feed it with 'subjective' spam mail (ie. mail that
my users don't want to receive, but is definitely not spam). I monitor it
constantly and I am happy about it.
>No problem :-) I tend to be a bit snappy on first messages which look to me
>like the author could have done a bit more research, but once we are over
>that
>stage I hope I can give some good advice based on my experience.
I have to admit that our communication was valuable to me, I learned so much
about how the whole thing works. Once again, I appreciate it.
Greg
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Bayes DB does not grow anymore
Posted by Kai Schaetzl <ma...@conactive.com>.
GRP Productions wrote on Tue, 15 Mar 2005 01:12:53 +0200:
> >I have been trying to get something from CVS for several days now, no luck.
>
> Send me your email in private (grpprod@hotmail.com) to send it to you.
Thanks for the offer. You can send it to the email address I use for this list,
or you could just send me an FTP URL for retrieval.
> I will probably start again from scratch. One point: Do you think I should
> put custom rules inside /etc/mail/spamassassin or the default installation
> is enough?
Oh, yes. You need to have SURBL switched on via the init.pre (I think it's off
by default) and you should use custom rules. I use a set of carefully chosen
rulesets mostly from SARE and updated via rulesdujour and some more rules of my
own accumulated over time.
> Yes I just added this. Should auto_expire remain always at 0?
I think on a heavy traffic machine it's preferrable to have it off, especially
when using MailScanner. Otherwise the expiry can kick in at random times every
few hours (you can set a minimum time, though, f.i. one day). Some people run a
scheduled expiry three times a day. That's an advice which often comes up on
the Mailscanner list (which is a very helpful list, btw).
Depends on how often you need it (whether it reaches the limit you want to hold
more often or not). Starting with one expiry per night should be fine, but you
should occasionally expire manually and look at the output, in case there are
problems.
Also, do you
> think it would be better if the db NEVER expired?
No. One should get rid of really old tokens, they are only "ballast" in the db.
I don't know how a big db behaves on a busy site. Ours contain 1 Mio. tokens
and have a size of 40 MB. They work very well with no ressource hogging. But I
have only a few thousand messages running thru each of our servers, there's
probably none which gets more than 10.000 a day. If you get 100.000 it may be
different.
Would this value of 500000
> achieve that? I don't want to come at work some day and see my tokens were
> lost again :-(
Just look at what the dump says about your oldest token. If your bayes
"performance" is good than the hold time is probably of no interest, but if the
spam detection from bayes is bad and you have a short hold time one of the
things I would look at is the short hold time.
>
> In general, should I do as you said, ie. trust the autolearn system and
> never use sa-learn again, provided that I do not have the time to do full
> training.
That's what we do. I only learn messages which were categorized wrong. Not by
Bayes, but by SA. Most messages which get a score lower than 5 still get a
BAYES_99 which means that Bayes identifies them all. Nevertheless, I learn
these messages because they are spam and it reassures Bayes that they are spam.
BTW: I have set BAYES_99 to 3.0, because it's so accurate for us.
>
> Thanks for giving me so much of your time, and being so patient with my
> silly questions.
No problem :-) I tend to be a bit snappy on first messages which look to me
like the author could have done a bit more research, but once we are over that
stage I hope I can give some good advice based on my experience.
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org
RE: Sudden spam to this email address
Posted by Rob McEwen <ro...@powerviewsystems.com>.
David B Funk said:
>geocities is pretty good about taking crap down once they're notified,
Yes... but it often takes them a couple of days to get this done... even
when kiddy pron is involved.
I wish geocities would respond faster to such complaints.
Also, much higher volumes of spam mail with geocities.com URLs hit my server
than legit mail with geocities.com URLs.
Rob McEwen
Re: Sudden spam to this email address
Posted by David B Funk <db...@engineering.uiowa.edu>.
On Mon, 14 Mar 2005, Jeff Chan wrote:
> Well when they can sell spams that don't advertise a web site
> for the same price as those that do, let us know. Until
> then SURBLs have them.
>
> Jeff C.
OK, how about 419'ers or stock scammers?
The child porn sites that use: http://beam.to/adultworld
or http://angels.hk.to or a page at geocities?
geocities is pretty good about taking crap down once they're
notified, but that angels.hk.to site has been around for months.
--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{
Re: Sudden spam to this email address
Posted by Matt Kettler <mk...@evi-inc.com>.
Stuart Johnston wrote:
> Hey, SURBLs are GREAT, no doubt about it but lets not kid ourselves.
> It is a long way from a 100% spam solution.
I think Jeff's point is that SURBL is one test spammers have a limited
ability to adapt to without cutting into their bottom line. Not that
it's perfect.
Re: Sudden spam to this email address
Posted by Stuart Johnston <st...@ebby.com>.
Jeff Chan wrote:
> On Tuesday, March 15, 2005, 9:02:44 AM, Stuart Johnston wrote:
>
>>SURBLs have them... most of the time... eventually... Er, yeah.
>
>
> Just to check, are you using ob.surbl.org and jp.surbl.org
> in multi.surbl.org, i.e.:
In the last ~24 hours:
All SA > 5: 32540
*_SURBL: 22361 (69%)
JP_SURBL: 20157 (62%)
OB_SURBL: 19900 (61%)
This is after a couple of DNSBLs at SMTP which may skew my stats.
Re: Sudden spam to this email address
Posted by Jeff Chan <je...@surbl.org>.
On Tuesday, March 15, 2005, 9:02:44 AM, Stuart Johnston wrote:
> SURBLs have them... most of the time... eventually... Er, yeah.
Just to check, are you using ob.surbl.org and jp.surbl.org
in multi.surbl.org, i.e.:
urirhssub URIBL_JP_SURBL multi.surbl.org. A 64
body URIBL_JP_SURBL eval:check_uridnsbl('URIBL_JP_SURBL')
describe URIBL_JP_SURBL Has URI in JP at http://www.surbl.org/lists.html
tflags URIBL_JP_SURBL net
score URIBL_JP_SURBL 4.0
They tend to catch new domains pretty quickly.
Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/
Re: Sudden spam to this email address
Posted by Stuart Johnston <st...@ebby.com>.
Jeff Chan wrote:
>
> Well when they can sell spams that don't advertise a web site
> for the same price as those that do, let us know. Until
> then SURBLs have them.
SURBLs have them... most of the time... eventually... Er, yeah.
Hey, SURBLs are GREAT, no doubt about it but lets not kid ourselves. It
is a long way from a 100% spam solution.
Re: Sudden spam to this email address
Posted by Jeff Chan <je...@surbl.org>.
On Monday, March 14, 2005, 10:31:29 PM, Matt Kettler wrote:
> I am 100% certain that there are spammers subscribed to this list, or are
> getting the messages in some manner or another. It's rather obvious why
> they do it. Spam tools seem to quickly adapt to subjects discussed here.
> List harvesting is a bonus.
Well when they can sell spams that don't advertise a web site
for the same price as those that do, let us know. Until
then SURBLs have them.
Jeff C.
--
Jeff Chan
mailto:jeffc@surbl.org
http://www.surbl.org/
Re: Sudden spam to this email address
Posted by Matt Kettler <mk...@comcast.net>.
At 11:35 PM 3/14/2005, Greg Allen wrote:
>Does posting to this list open me up to dweebs harvesting email addresses?
Without a doubt, yes.
I am 100% certain that there are spammers subscribed to this list, or are
getting the messages in some manner or another. It's rather obvious why
they do it. Spam tools seem to quickly adapt to subjects discussed here.
List harvesting is a bonus.
RE: Sudden spam to this email address
Posted by Matt Kettler <mk...@comcast.net>.
At 11:53 PM 3/14/2005, Greg Allen wrote:
>Yep, I just found the culprit.
>
>The below 2 websites volunteer SA users-list email addresses for all the
>world to harvest. I found my email address in Google from posting here on
>this list.
One of many.. As I pointed out before, there's probably multiple spammers
who are directly subscribed to the list.
>Be warned, if you post to this list use a throw-away email address unless
>you are looking to have a good test account for SA. :-)
That should be said of *any* mailing list that's open to public
subscription. Period. They're all vulnerable to being mined regardless of
what web archives they have. All the spammer needs to do is subscribe a
"legitimate" account to the list and run all the messages through their
list-mining software. As long as that address is only used for harvesting,
and not used as a drop box for spam runs, nobody is ever likely to be the
wiser to it.
Let's face it, telling the difference between a lurker who subscribes but
never posts, and a spammer who mines but never posts is pretty much
impossible. It's like trying to tell if a stranger is a spammer. The guy at
the table behind you at lunch could be a spammer, and you'd not know. Only
a few of the really big-time spammers get their pictures circulated.
The spammer is not easily recognized. The spammer is among us, and looks
very much like us. Don't be fooled into thinking the spammer isn't there
just because you can't see him. It's in his best interest to be here, and
it's also in his best interest to blend in and not be noticed. Don't
underestimate the spammers, some may be stupid, but some are also clever
(albeit morally deficient).
Spying on one's adversaries is a battle tactic which is thousands of years
old. It goes on all the time between governments, militiaries, police and
criminals, companies, neighbors. Why not here?
I'm sure at least some spammers know to spy on their adversaries... to spy
on us... here... on this list.
And I'm SURE they have no moral problems with doing so.
Re: Sudden spam to this email address
Posted by Bob Proulx <bo...@proulx.com>.
Mike Burger wrote:
> The second link definitely gets you to, what appear to be, the raw list
> archive files.
I did not see any "raw list archives" at this moment. But I did see
the mail address in the mail archives here. This one for example.
http://spamassassin.apache.org/mail/users/200503
> In addition, the actual "archives", that are viewable to the world, show
> the senders' email addresses.
Yes, but so does the mailing list. Anyone can subscribe to the
mailing list. And mailing lists that provide anonymity have been
around before but usually they have their own set of really bad
problems. Basically web forums today are the anonymous media today.
There can be no illusion that your mail address is secret after
posting to a public mailing list. So any spammer could get it from
there directly by subscribing regardless of how it was handled in mail
archives. I think obfuscating addresses is just closing the barn door
after the animals have already escaped. It just frustrates you and
annoys the pig.[1] But even mailing addresses only known by friends
will get leaked out because a friend will sign you up for an email
greeting card or some other such frivolous thing and get you on a
spammer's list.
However I think the true leak is web pages. I have seen studies
showing that between one to four weeks after an email address shows up
on a web site that it will start collecting spam. And almost all
mailings lists are gateway'd to web pages somewhere on the 'net these
days.
When I web search for my email address it scary how many hits come
back. I have old addresses from the late 1980's that are still found
by web searches. Yet I still get very little spam to my mailbox.
RBLs, greylisting, virus filtering, spamassassin. Sad that those are
needed. But that is the way of things. Fortunately they are very
effective.
Bob
[1] Let's see how long the OT followup thread goes about that analogy. :-)
Re: Sudden spam to this email address
Posted by Mike Burger <mb...@bubbanfriends.org>.
Not his point.
The second link definitely gets you to, what appear to be, the raw list
archive files.
The first link got me a blank page.
In addition, the actual "archives", that are viewable to the world, show
the senders' email addresses.
Seems to me that whatever's generating the list archives, the raw files
should be hidden from the world. It also occurs to me that apache.org
should either be using a list manager whose archives feature hides the
email addresses (MailMan comes to mind) or a tool that properly masks the
addresses...I believe mail2html, or somesuch.
But that's just my 2 cents worth.
On Mon, 14 Mar 2005, Thomas Cameron wrote:
> I don't post terribly frequently, but I certaibly do post to this list (and
> many others). Ditto for Usenet. No throw-away addresses for me.
>
> I use SpamAssassin with Pyzor, Razor, DCC, and network checks, ClamAV, and
> greylisting.
>
> I can remember one spam message that made it into my Inbox this year. One.
>
> I can't shout from the roof tops loudly or often enough: "SpamAssassin
> works!" :-)
>
> Thomas
>
> ----- Original Message ----- From: "Greg Allen" <ga...@netrox.net>
> To: <us...@spamassassin.apache.org>
> Sent: Monday, March 14, 2005 10:53 PM
> Subject: RE: Sudden spam to this email address
>
>
>> Yep, I just found the culprit.
>>
>> The below 2 websites volunteer SA users-list email addresses for all the
>> world to harvest. I found my email address in Google from posting here on
>> this list.
>>
>> aspn.activestate.com/ASPN/ Mail/Message/spamassassin-users
>>
>> spamassassin.apache.org/mail/users
>>
>>
>> Be warned, if you post to this list use a throw-away email address unless
>> you are looking to have a good test account for SA. :-)
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Greg Allen [mailto:gallen@netrox.net]
>> Sent: Monday, March 14, 2005 11:36 PM
>> To: users@spamassassin.apache.org
>> Subject: Sudden spam to this email address
>>
>>
>> Does posting to this list open me up to dweebs harvesting email addresses?
>>
>> I'm suddenly getting BS spams to this email address, and they have to be
>> coming from one of two sources. This list being one of the options.
>>
>> Thanks.
>>
>>
>
--
Mike Burger
http://www.bubbanfriends.org
Visit the Dog Pound II BBS
telnet://dogpound2.citadel.org or http://dogpound2.citadel.org
To be notified of updates to the web site, visit
http://www.bubbanfriends.org/mailman/listinfo/site-update, or send a
message to:
site-update-request@bubbanfriends.org
with a message of:
subscribe
Re: Sudden spam to this email address
Posted by Thomas Cameron <th...@camerontech.com>.
I don't post terribly frequently, but I certaibly do post to this list (and
many others). Ditto for Usenet. No throw-away addresses for me.
I use SpamAssassin with Pyzor, Razor, DCC, and network checks, ClamAV, and
greylisting.
I can remember one spam message that made it into my Inbox this year. One.
I can't shout from the roof tops loudly or often enough: "SpamAssassin
works!" :-)
Thomas
----- Original Message -----
From: "Greg Allen" <ga...@netrox.net>
To: <us...@spamassassin.apache.org>
Sent: Monday, March 14, 2005 10:53 PM
Subject: RE: Sudden spam to this email address
> Yep, I just found the culprit.
>
> The below 2 websites volunteer SA users-list email addresses for all the
> world to harvest. I found my email address in Google from posting here on
> this list.
>
> aspn.activestate.com/ASPN/ Mail/Message/spamassassin-users
>
> spamassassin.apache.org/mail/users
>
>
> Be warned, if you post to this list use a throw-away email address unless
> you are looking to have a good test account for SA. :-)
>
>
>
>
>
>
> -----Original Message-----
> From: Greg Allen [mailto:gallen@netrox.net]
> Sent: Monday, March 14, 2005 11:36 PM
> To: users@spamassassin.apache.org
> Subject: Sudden spam to this email address
>
>
> Does posting to this list open me up to dweebs harvesting email addresses?
>
> I'm suddenly getting BS spams to this email address, and they have to be
> coming from one of two sources. This list being one of the options.
>
> Thanks.
>
>
RE: Sudden spam to this email address
Posted by Greg Allen <ga...@netrox.net>.
Yep, I just found the culprit.
The below 2 websites volunteer SA users-list email addresses for all the
world to harvest. I found my email address in Google from posting here on
this list.
aspn.activestate.com/ASPN/ Mail/Message/spamassassin-users
spamassassin.apache.org/mail/users
Be warned, if you post to this list use a throw-away email address unless
you are looking to have a good test account for SA. :-)
-----Original Message-----
From: Greg Allen [mailto:gallen@netrox.net]
Sent: Monday, March 14, 2005 11:36 PM
To: users@spamassassin.apache.org
Subject: Sudden spam to this email address
Does posting to this list open me up to dweebs harvesting email addresses?
I'm suddenly getting BS spams to this email address, and they have to be
coming from one of two sources. This list being one of the options.
Thanks.
Sudden spam to this email address
Posted by Greg Allen <ga...@netrox.net>.
Does posting to this list open me up to dweebs harvesting email addresses?
I'm suddenly getting BS spams to this email address, and they have to be
coming from one of two sources. This list being one of the options.
Thanks.
Re: Bayes DB does not grow anymore
Posted by GRP Productions <gr...@hotmail.com>.
>I have been trying to get something from CVS for several days now, no luck.
Send me your email in private (grpprod@hotmail.com) to send it to you.
>Bayes needs constant training, but this doesn't mean it needs any manual
>training. Once it's up and running and "well-greased" it should take care
>of
>itself by auto-learning (bayes_auto_learn 1, don't know if on by default).
>About 70 or 80% of our spam and ham (especially the spam) is autolearned.
I will probably start again from scratch. One point: Do you think I should
put custom rules inside /etc/mail/spamassassin or the default installation
is enough?
>Actually, with those "few" tokens you won't loose much if you throw it away
>;-)
>As I said upping that should help, no need to throw it away unless you
>think
>that's easier (if most spam you get scores at BAYES_50 it might be better
>to
>start over than to convince the db that it's spam).
I'll probably do it.
> > bayes_auto_expire 0
>
> > bayes_expiry_max_db_size 500000
>I assume you just added>/changed that?
Yes I just added this. Should auto_expire remain always at 0? Also, do you
think it would be better if the db NEVER expired? Would this value of 500000
achieve that? I don't want to come at work some day and see my tokens were
lost again :-(
In general, should I do as you said, ie. trust the autolearn system and
never use sa-learn again, provided that I do not have the time to do full
training.
Thanks for giving me so much of your time, and being so patient with my
silly questions.
Best regards,
Greg
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Bayes DB does not grow anymore
Posted by Kai Schaetzl <ma...@conactive.com>.
GRP Productions wrote on Mon, 14 Mar 2005 03:41:40 +0200:
> Indeed, this is the CVS version :-)
I have been trying to get something from CVS for several days now, no luck.
> This is perhaps because I have been using only 'mistake-based' training (ie
> training only when false classificaiton happens). However this used to work
> fine.
Bayes needs constant training, but this doesn't mean it needs any manual
training. Once it's up and running and "well-greased" it should take care of
itself by auto-learning (bayes_auto_learn 1, don't know if on by default).
About 70 or 80% of our spam and ham (especially the spam) is autolearned.
>
> >your "hold time" is quite low, it's about a month. I think we haven tokens
> >from
> >even a year ago. That's maybe a bit too much, but I strongly suggest upping
> >your bayes_expiry_max_db_size to something like 500.000 or so. Since you
> >have a
> >much higher flux of messages than we have on that machine you are literally
> >"burning" your db to uselessness.
>
> So what would you suggest? I certainly dont want to lose everything that has
> been learned till now.
Actually, with those "few" tokens you won't loose much if you throw it away ;-)
As I said upping that should help, no need to throw it away unless you think
that's easier (if most spam you get scores at BAYES_50 it might be better to
start over than to convince the db that it's spam).
> Nope, there is definitely only the one comng with MS. I never use SA from
> the command line anyway.
Well, let's go back:
you sa-learn a message, it says it learned, you dump magic and see there's no
change, you look in the directory and there's no journal. There *has* to be at
least one additional Bayes db. Or something happens which I haven't heard of in
my about three years of using SA+Bayes. What's the output of "sa-learn --dump
magic"? Don't specify a config file!
> bayes_path /var/spool/MailScanner/bayes/bayes
and what's in your /etc/mail/spamassassin/local.conf?
> bayes_auto_expire 0
ok, that means it won't expire. Of course, if it doesn't grow this isn't
necessary ... ;-)
> bayes_expiry_max_db_size 500000
I assume you just added>/changed that?
> If I get it you mean that the tokens are lost very quickly?
Yes. However, now that I know that your bayes_expiry is off we have a different
case? Since when has it been off? Since Feb. 11 as your dump magic suggests?
Your oldest token is Feb. 2. So that either means your started the db that day
or you are burning your tokens in 10 days. That's one problem, upping to a
higher ceiling, as you already did, should take care of that. The other problem
is that it's apparently not growing. One of the reasons is, of course, that you
only learn by mistake. So, how often is that done? How many do you actually add
this way? The second part of this other problem is that even if you learn it
doesn't seem to learn. I don't see another possibility as that it uses
different dbs.
I think am
> confused , if bayes works with tokens, why does it need nspam and nham? Or
> are they just counters?
It's just the number of spam and ham messages you learned to it. Yes, it's more
or less informational only.
>
> In general, do you think that setting bayes_expiry_max_db_size would be
> enough?
To cure the fast expiration, yes, but you didn't expire for the last 30 days,
anyway.
> One final thing: Why even if i manually expire, the date of last expiration
> remains old?
Same reason as above: you work on different dbs. What does the expire output
show?
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org
Re: Bayes DB does not grow anymore
Posted by GRP Productions <gr...@hotmail.com>.
>That's okay, the problem just is one cannot be sure how accurate it is.
>Knowing
>that you use MS would have been useful, anyway :-)
>(BTW: my version of Mailwatch can't show this, do you use a CVS version?)
Indeed, this is the CVS version :-)
>See the number of tokens, we have ten times yours with less learned mail.
>That
>means that our db has much more tokens to qualify an email as ham or spam.
>Also
This is perhaps because I have been using only 'mistake-based' training (ie
training only when false classificaiton happens). However this used to work
fine.
>your "hold time" is quite low, it's about a month. I think we haven tokens
>from
>even a year ago. That's maybe a bit too much, but I strongly suggest upping
>your bayes_expiry_max_db_size to something like 500.000 or so. Since you
>have a
>much higher flux of messages than we have on that machine you are literally
>"burning" your db to uselessness.
So what would you suggest? I certainly dont want to lose everything that has
been learned till now.
>And you learned by specifying the config file? I suspect that you are at
>least
>occasionally using two SA configurations, the one coming with MS and the
>one
>coming with SA.
Nope, there is definitely only the one comng with MS. I never use SA from
the command line anyway.
>Oh. Still possible, though. You don't need to have one, but on high volume
>systems it's highly recommended. Check your SA config (whereever it is :-)
>for
>bayes_learn_to_journal 1. I don't know if it is 1 by default, though. What
>do
>you have starting with bayes in your config file?
# grep bayes /opt/MailScanner/etc/spam.assassin.prefs.conf
# be created as /var/spool/spamassassin/bayes_msgcount, etc.
#bayes_path /var/spool/spamassassin/bayes
#bayes_file_mode 0600
bayes_path /var/spool/MailScanner/bayes/bayes
bayes_file_mode 0666
# MailScanner: big bayes_toks.new files wasting space.
bayes_auto_expire 0
bayes_expiry_max_db_size 500000
bayes_ignore_header X-MailScanner
bayes_ignore_header X-MailScanner-SpamCheck
bayes_ignore_header X-MailScanner-SpamScore
bayes_ignore_header X-MailScanner-Information
# use_bayes 0
>Don't know if this would be of any help. As I said, I suspect you are using
>at
>least two different bayes dbs. At least when you do it from the command
>line.
>Run an "updatedb" and then "locate bayes" (this may not locate all files,
>f.i.
>not in /var !).
I think there is only one.
>MS, of course, can only use one and doesn't have a chance of confusing
>that, so
>when it uses SA that learns and checks the same db. And so far that part
>seems
>to be okay (except for the bigger size of bayes_seen, but as I said, this
>may
>be normal for your setup, I really don't know). But you burn your tokens
>too
>fast. At least that's what I think.
If I get it you mean that the tokens are lost very quickly? I think am
confused , if bayes works with tokens, why does it need nspam and nham? Or
are they just counters?
In general, do you think that setting bayes_expiry_max_db_size would be
enough?
One final thing: Why even if i manually expire, the date of last expiration
remains old?
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Bayes DB does not grow anymore
Posted by Kai Schaetzl <ma...@conactive.com>.
GRP Productions wrote on Mon, 14 Mar 2005 00:32:42 +0200:
> You are right, I am using MailWatch. I just posted this output to be easy
> for one to see the actual dates without having to convert.
That's okay, the problem just is one cannot be sure how accurate it is. Knowing
that you use MS would have been useful, anyway :-)
(BTW: my version of Mailwatch can't show this, do you use a CVS version?)
Here is the
> actual output:
>
> # /usr/bin/sa-learn -p /opt/MailScanner/etc/spam.assassin.prefs.conf --dump
> magic
> 0.000 0 3 0 non-token data: bayes db version
> 0.000 0 49740 0 non-token data: nspam
> 0.000 0 47167 0 non-token data: nham
> 0.000 0 123325 0 non-token data: ntokens
I didn't look at this closely before, but I think this ratio indicates a
problem, f.i. this is from our own mail server (just getting our own mail, not
our clients'):
0.000 0 30089 0 non-token data: nspam
0.000 0 12515 0 non-token data: nham
0.000 0 1001630 0 non-token data: ntokens
See the number of tokens, we have ten times yours with less learned mail. That
means that our db has much more tokens to qualify an email as ham or spam. Also
your "hold time" is quite low, it's about a month. I think we haven tokens from
even a year ago. That's maybe a bit too much, but I strongly suggest upping
your bayes_expiry_max_db_size to something like 500.000 or so. Since you have a
much higher flux of messages than we have on that machine you are literally
"burning" your db to uselessness.
> No it isn't. This is exactly the point I mentioned.
But you didn't prove it ;-)
But as I said earlier,
> sa-learn claims it has learned, even from the web interface:
> >SA Learn: Learned from 1 message(s) (1 message(s) examined).
And you learned by specifying the config file? I suspect that you are at least
occasionally using two SA configurations, the one coming with MS and the one
coming with SA.
> This is getting more suspicious: there is no bayes_journal file!
Oh. Still possible, though. You don't need to have one, but on high volume
systems it's highly recommended. Check your SA config (whereever it is :-) for
bayes_learn_to_journal 1. I don't know if it is 1 by default, though. What do
you have starting with bayes in your config file?
> -rw-rw-rw- 1 root nobody 1236 Mar 14 00:22 bayes.mutex
> -rw-rw-rw- 1 root nobody 10452992 Mar 14 00:22 bayes_seen
> -rw-rw-rw- 1 root nobody 5509120 Mar 14 00:02 bayes_toks
bayes_seen is quite high. I haven't ever seen that it is higher than bayes_toks
on our systems. But maybe that's normal for high volume systems, I don't know.
On the Mailscanner list many people complain about very big bayes_seen files.
Someone else on this list should comment on the size.
> I can assure you noone has touched anything inside this directory. If this
> is the reason for the problems I've been facing, is there a way to recreate
> the file without having to lose my current data? (perhaps by copying the
> above files somewhere, execute sa-learn --clear and some time later restore
> the above files?)
Don't know if this would be of any help. As I said, I suspect you are using at
least two different bayes dbs. At least when you do it from the command line.
Run an "updatedb" and then "locate bayes" (this may not locate all files, f.i.
not in /var !).
MS, of course, can only use one and doesn't have a chance of confusing that, so
when it uses SA that learns and checks the same db. And so far that part seems
to be okay (except for the bigger size of bayes_seen, but as I said, this may
be normal for your setup, I really don't know). But you burn your tokens too
fast. At least that's what I think.
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org
RE: 2 pops
Posted by Matt Kettler <mk...@evi-inc.com>.
At 02:53 PM 3/14/2005, S M.C Butler wrote:
> >1) what does this have to do with the thread "Re: Bayes DB does not grow
> >anymore"?
> >
>
>oops I replied to that mail to get the mailing list address and forgot to
>delete the inline text, sorry about that.
Even if you did remove the inline text, it's still going to show up as a
reply to that thread... The "In-Reply-To:" header will give you away.. From
your original post:
>In-Reply-To: <BA...@phx.gbl>
Based on this, any threading mail readers and list archives will burry your
post as a reply, rather than showing it as a new thread.
Take a look at the GMANE archives for an example of threading:
http://news.gmane.org/gmane.mail.spam.spamassassin.general
The big difference in a mail client is that threading mail clients
generally allow you to collapse threads and you don't see the posts under
them when collapsed.
When posting a new thread it's really in your best interest to just create
a new message, and not try to hijack a reply into being something it's not.
RE: 2 pops
Posted by "S M.C Butler" <si...@icmethods.com>.
>
>1) what does this have to do with the thread "Re: Bayes DB does not grow
>anymore"?
>
oops I replied to that mail to get the mailing list address and forgot to
delete the inline text, sorry about that.
>2) man fetchmail
thx, I'll check it out.
Re: 2 pops
Posted by Matt Kettler <mk...@evi-inc.com>.
At 02:20 PM 3/14/2005, S M.C Butler wrote:
>Hi, I would like to have my mail forwarded to my ISP's account and then
>popped to my server where I can run spam assassin and finally popped a
>second time to my PC. How do I get this 2-level pop mechanism going? How can
>I pop from my ISP account to my server in a way that will allow me to do a
>second pop from /var/mail/username to my pc
>
> Thx in advance.
1) what does this have to do with the thread "Re: Bayes DB does not grow
anymore"?
2) man fetchmail
2 pops
Posted by "S M.C Butler" <si...@icmethods.com>.
Hi, I would like to have my mail forwarded to my ISP's account and then
popped to my server where I can run spam assassin and finally popped a
second time to my PC. How do I get this 2-level pop mechanism going? How can
I pop from my ISP account to my server in a way that will allow me to do a
second pop from /var/mail/username to my pc
Thx in advance.
>-----Original Message-----
>From: GRP Productions [mailto:grpprod@hotmail.com]
>Sent: Sunday, March 13, 2005 2:33 PM
>To: users@spamassassin.apache.org
>Subject: Re: Bayes DB does not grow anymore
>
>>That is the output of --dump magic? I haven't ever seen it formatted that
>>nicely. I assume you skipped the first line, but there's also missing the
>>expire atime delta. So, where do you got this from? Not directly from
>>sa-learn
>>--dump magic I'd say. You are running SA thru some interface? You should
>>have
>>said something about the whereabouts of your installation.
>
>You are right, I am using MailWatch. I just posted this output to be easy
>for one to see the actual dates without having to convert. Here is the
>actual output:
>
># /usr/bin/sa-learn -p /opt/MailScanner/etc/spam.assassin.prefs.conf --dump
>magic
>0.000 0 3 0 non-token data: bayes db version
>0.000 0 49740 0 non-token data: nspam
>0.000 0 47167 0 non-token data: nham
>0.000 0 123325 0 non-token data: ntokens
>0.000 0 1107319073 0 non-token data: oldest atime
>0.000 0 1110636450 0 non-token data: newest atime
>0.000 0 1108137790 0 non-token data: last journal sync
>atime
>0.000 0 1108129534 0 non-token data: last expiry atime
>0.000 0 804361 0 non-token data: last expire atime
>delta
>0.000 0 3475 0 non-token data: last expire
>reduction count
>
>>Ok. Get the values. Then learn a message to it. Make sure it says that it
>>actually learned, then check the values again. Is either the spam or ham
>>count
>>increased by one or not?
>
>No it isn't. This is exactly the point I mentioned. But as I said earlier,
>sa-learn claims it has learned, even from the web interface:
>>SA Learn: Learned from 1 message(s) (1 message(s) examined).
>
>>Ok, this finally looks a bit suspicious. No sync and no expire for a
>month.
>>If
>>it doesn't sync you don't get new tokens. Check in your bayes directory
>how
>>big
>>your bayes_journal is. I'd think it's quite big. Do a sync now. (Please
>>don't
>>do it via an interface, do it on the command line.) What's the output? Is
>>the
>>journal gone and the number of tokens increased now? If so, you need to
>>investigate why it doesn't sync anymore. Also do an expire then.
>
>This is getting more suspicious: there is no bayes_journal file!
>
># ll /var/spool/MailScanner/bayes/
>total 11780
>drwxrwxrwx 2 root nobody 4096 Mar 14 00:22 .
>drwxr-xr-x 4 root nobody 4096 Mar 13 11:55 ..
>-rw-rw-rw- 1 root nobody 1236 Mar 14 00:22 bayes.mutex
>-rw-rw-rw- 1 root nobody 10452992 Mar 14 00:22 bayes_seen
>-rw-rw-rw- 1 root nobody 5509120 Mar 14 00:02 bayes_toks
>
>I can assure you noone has touched anything inside this directory. If this
>is the reason for the problems I've been facing, is there a way to recreate
>the file without having to lose my current data? (perhaps by copying the
>above files somewhere, execute sa-learn --clear and some time later restore
>the above files?)
>
>Thanks for your help
>
>_________________________________________________________________
>Express yourself instantly with MSN Messenger! Download today it's FREE!
>http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Bayes DB does not grow anymore
Posted by GRP Productions <gr...@hotmail.com>.
>That is the output of --dump magic? I haven't ever seen it formatted that
>nicely. I assume you skipped the first line, but there's also missing the
>expire atime delta. So, where do you got this from? Not directly from
>sa-learn
>--dump magic I'd say. You are running SA thru some interface? You should
>have
>said something about the whereabouts of your installation.
You are right, I am using MailWatch. I just posted this output to be easy
for one to see the actual dates without having to convert. Here is the
actual output:
# /usr/bin/sa-learn -p /opt/MailScanner/etc/spam.assassin.prefs.conf --dump
magic
0.000 0 3 0 non-token data: bayes db version
0.000 0 49740 0 non-token data: nspam
0.000 0 47167 0 non-token data: nham
0.000 0 123325 0 non-token data: ntokens
0.000 0 1107319073 0 non-token data: oldest atime
0.000 0 1110636450 0 non-token data: newest atime
0.000 0 1108137790 0 non-token data: last journal sync
atime
0.000 0 1108129534 0 non-token data: last expiry atime
0.000 0 804361 0 non-token data: last expire atime
delta
0.000 0 3475 0 non-token data: last expire
reduction count
>Ok. Get the values. Then learn a message to it. Make sure it says that it
>actually learned, then check the values again. Is either the spam or ham
>count
>increased by one or not?
No it isn't. This is exactly the point I mentioned. But as I said earlier,
sa-learn claims it has learned, even from the web interface:
>SA Learn: Learned from 1 message(s) (1 message(s) examined).
>Ok, this finally looks a bit suspicious. No sync and no expire for a month.
>If
>it doesn't sync you don't get new tokens. Check in your bayes directory how
>big
>your bayes_journal is. I'd think it's quite big. Do a sync now. (Please
>don't
>do it via an interface, do it on the command line.) What's the output? Is
>the
>journal gone and the number of tokens increased now? If so, you need to
>investigate why it doesn't sync anymore. Also do an expire then.
This is getting more suspicious: there is no bayes_journal file!
# ll /var/spool/MailScanner/bayes/
total 11780
drwxrwxrwx 2 root nobody 4096 Mar 14 00:22 .
drwxr-xr-x 4 root nobody 4096 Mar 13 11:55 ..
-rw-rw-rw- 1 root nobody 1236 Mar 14 00:22 bayes.mutex
-rw-rw-rw- 1 root nobody 10452992 Mar 14 00:22 bayes_seen
-rw-rw-rw- 1 root nobody 5509120 Mar 14 00:02 bayes_toks
I can assure you noone has touched anything inside this directory. If this
is the reason for the problems I've been facing, is there a way to recreate
the file without having to lose my current data? (perhaps by copying the
above files somewhere, execute sa-learn --clear and some time later restore
the above files?)
Thanks for your help
_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE!
http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/
Re: Bayes DB does not grow anymore
Posted by Kai Schaetzl <ma...@conactive.com>.
GRP Productions wrote on Sun, 13 Mar 2005 22:54:22 +0200:
> Perhaps I have not been clear enough. It's not only that the files' size is
> constant. I am pasting the output of dump magic,
That is the output of --dump magic? I haven't ever seen it formatted that
nicely. I assume you skipped the first line, but there's also missing the
expire atime delta. So, where do you got this from? Not directly from sa-learn
--dump magic I'd say. You are running SA thru some interface? You should have
said something about the whereabouts of your installation.
and I have to explain that
> the nham and nspam values are the same for many days now.
Ok. Get the values. Then learn a message to it. Make sure it says that it
actually learned, then check the values again. Is either the spam or ham count
increased by one or not?
> work fine. If I send to myself a message from Yahoo, with subject 'Viagra
> sex teen ........" and other nice words, I certainly do not want it to pass.
> Bayes classifies it as 50% spam. I tried to sa-learn --forget, and then
> re-learn, still is BAYES_50.
Again, this is NOT how Bayes works. You can't learn it one message and then
expect it to flag that message as spam next time. Bayes does not work like
this!
And that it classifies that message as 50%, which means, it cannot determine if
it's ham or spam, just says that the tokens in the db are not good enough for
that message. Or maybe it contains enough hammy tokens, whatever.
> Number of Spam Messages: 49,740
> Number of Ham Messages: 47,167
> Number of Tokens: 123,325
> Oldest Token: Wed, 2 Feb 2005 06:37:53 +0200
> Newest Token: Sat, 12 Mar 2005 16:07:30 +0200
Says it added/changed time a token yesterday.
> Last Journal Sync: Fri, 11 Feb 2005 18:03:10 +0200
> Last Expiry: Fri, 11 Feb 2005 15:45:34 +0200
> Last Expiry Reduction Count: 3,475 tokens
Ok, this finally looks a bit suspicious. No sync and no expire for a month. If
it doesn't sync you don't get new tokens. Check in your bayes directory how big
your bayes_journal is. I'd think it's quite big. Do a sync now. (Please don't
do it via an interface, do it on the command line.) What's the output? Is the
journal gone and the number of tokens increased now? If so, you need to
investigate why it doesn't sync anymore. Also do an expire then.
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org
Re: Bayes DB does not grow anymore
Posted by GRP Productions <gr...@hotmail.com>.
>This doesn't prove anything. sa-learn --dump magic shows you what's inside.
>Also, Bayes is not a checksum system like Razor, that's its strength. If
>you
>learn something to it that means that it extracts tokens (short pieces)
>from
>the message and adjusts its internal probability for them being ham or spam
>by
>a certain factor. Or if it doesn't know that token yet it adds it.
>That the size doesn't grow can have several reasons, f.i. expiry or the
>fact
>that the db format seems to have some "air" in it, so that it grows in
>jumps
>and not continually.
Perhaps I have not been clear enough. It's not only that the files' size is
constant. I am pasting the output of dump magic, and I have to explain that
the nham and nspam values are the same for many days now. This is not
normal, since we are talking about a very busy server (more than 4,000
messages per day). This behaviour has not always been the case, it used to
work fine. If I send to myself a message from Yahoo, with subject 'Viagra
sex teen ........" and other nice words, I certainly do not want it to pass.
Bayes classifies it as 50% spam. I tried to sa-learn --forget, and then
re-learn, still is BAYES_50. The nham and nspam values used to increase very
rapidly (sometimes by a value of 200-300 per day). No errors are produced. I
wouldn't have noticed the particular problem, but fortunately during the
last days we started having more spam than usual to be passing. Also, I
tried to force an expiration many times, but as you can see the expiration
did not take place. Its definitely not a file permission issue.
Thanks
Number of Spam Messages: 49,740
Number of Ham Messages: 47,167
Number of Tokens: 123,325
Oldest Token: Wed, 2 Feb 2005 06:37:53 +0200
Newest Token: Sat, 12 Mar 2005 16:07:30 +0200
Last Journal Sync: Fri, 11 Feb 2005 18:03:10 +0200
Last Expiry: Fri, 11 Feb 2005 15:45:34 +0200
Last Expiry Reduction Count: 3,475 tokens
_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar - get it now!
http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/
Re: Bayes DB does not grow anymore
Posted by Kai Schaetzl <ma...@conactive.com>.
GRP Productions wrote on Sun, 13 Mar 2005 11:21:12 +0200:
> for some days now my bayesian DB does not seem to grow. Its size remains
> stable. It is read with no problems by SA 3.0.2, but nothing new is written.
> I send an email to me, it is classified as BAYES_50. I sa-learn it as spam,
> send it again, and it is still BAYES_50 (I expected to see it as BAYES_99).
>
This doesn't prove anything. sa-learn --dump magic shows you what's inside.
Also, Bayes is not a checksum system like Razor, that's its strength. If you
learn something to it that means that it extracts tokens (short pieces) from
the message and adjusts its internal probability for them being ham or spam by
a certain factor. Or if it doesn't know that token yet it adds it.
That the size doesn't grow can have several reasons, f.i. expiry or the fact
that the db format seems to have some "air" in it, so that it grows in jumps
and not continually.
Kai
--
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com
IE-Center: http://ie5.de & http://msie.winware.org