You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@spamassassin.apache.org by Linda Walsh <sa...@tlinx.org> on 2009/04/01 04:57:30 UTC

Re: user-db size, content confusions (how many toks?)

Matt Kettler wrote:
>> I see 3 DB's in my user directory (.spamassassin).
>>    auto-whitelist (~80MB),      bayes_seen (~40MB),     bayes_toks (~20MB)
>> Was trying to find relation of 'bayes_expiry_max_db_size' to the physical
>> size of the above files.
---

> expiry will only affect bayes_toks. Currently neither auto-whitelist nor
> bayes_seen have any expiry mechanism at all.
---
So they just grow without limit?  How often are they loaded?
Does only "spamd" access the auto-whitelist?

Optimally, I would assume spamd opens it upon start, but it needs to update
the disk file periodically (sync the db) for reliability.  How often does
it 'sync'?


> bayes_seen can safely be deleted if you need to. It keeps track of what
> messages have already been learned to prevent relearning them. However,
> unless you're likely to re-feed messages to SA, bayes_seen isn't stictly
> neccesary.
---
	Only refeeding would usually be 'ham', because I might rerun over
an "Inbox", that might have old messages in it.  I don't rerun "ham" training
often -- except to "despam" a message (one that was marked spam and shouldn't
have been).



>> I'm finding some answers, I've run into some seeming "contradictions".  
>> ...
>> ---
>> First prob(contradiction).  dbg above says "token count: 0".  (This is
>> with
>> a combined bayes db size of 60MB (_seen, _toks).
> Are you sure your sa-learn was using the same DB path?
---
	Sure??  It listed the same filename (default location
/home/<user>/.spamassasssin/<bayes...>).  Other than that, I haven't
tried to trace perl running spamassassin, to see if it is really accessing
the same file.  Only going off the 'debug' messages (which correspond to the
settings in "user_prefs" that's in the default location dir.


> From the sounds of it, sa-learn is using a directory with an empty DB.
----
	Yeah...Doesn't make sense to me -- how would "sa-learn --dump magic"
use a different location?  I.e. it showed ~500K tokens...


>> I.e. isn't 'ntokens' = 491743 mean slightly under 500K tokens 
> Yep, looks like you have 491,743 tokens to me.

>> It's like the sa-learn magic shows a 'db' corresponding to my old limit
>> (that I think is still being 'auto-expired', so might not have pruned
>> figure as it runs about once per 24 hours, if I understand normal spamd
>> workings).
> Approximately. Also, be aware that in order for spamd to use new
> settings it needs to be restarted.
----
	Having changed the user_prefs files back to the default
setting (i.e. deleted my previous addition) -- 2 days ago, and system was
rebooted 1day14hours ago, I'm certain spamd has been restarted.
YET: all db sizes are the same as before (no reduction in size
corresponding to going 'back' to a default 150K limit), though sa-learn
run with dbg and force-expire indicated 0 tokens -- but sa-learn w/dump magic
indicates 500K tokens.  How can "expire" say 0 toks but dump-magic say 500K?

	File timemstamps show all 3-db files have been updated today.
(Presumably by spamd processing email as it comes in).  But file sizes
still are @ sizes indicated at top of this message: 80/40/20-MB.


>> So is the --magic output, maybe what is seen and being
>> 'size-controlled' by auto-expire?
> Yes, at least, it should be.


>> Why isn't 'sa-learn --force expire' seeing the TOKENs indicated in
>> sa-learn --dump magic?  
> That is particularly strange to me, and it sounds like there's some
> problems there.
---
*sigh*

> 
> Can you give a bit of detail, ie: what paths are you looking at for the
> files, what version of SA,
---
	SA = old version of 3.1.7.
	Which at very least points to an upgrade possibly solving the problem,
BUT, this was working at one point, and don't know why it 'stopped'.  I'm
generally uncomfortable with fixing things that were working just because they
have randomly stopped working without knowing *why*, (though that discomfort has
become something I've just more had to deal with as the Microsoft SW
maintenance method becomes the norm (update and see if bug is gone...yes?  ok,
bug gone; (unclear if fixed or hidden, unclear about effects of other changes in
a new version...)


>> Am I misinterpreting the debug output?
> No, you don't seem to be.
---
	Thanks for the confirmation of my 'reality'.  Really, the most logical
and time-efficient way to proceed is likely to upgrade to newer version at some
point soon (and ignore my discontent regarding 'not knowing' why or what caused
the break).

*sigh*
Linda

>>
>>
>

Re: user-db size, excess growth...limits ignored

Posted by Matt Kettler <mk...@verizon.net>.

RW wrote:
> On Wed, 01 Apr 2009 12:27:22 -0700
> Linda Walsh <sa...@tlinx.org> wrote:
>
>   
>> 01234567890123456789012345678901234567890123456789012345678901234567890123456789
>> Matt Kettler wrote:
>>     
>
>   
>>>>  How often does the whitelist get sync'd to disk?
>>>>         
>>> In the case of the whitelist, it's per-message.
>>>       
>> -----
>> 	*ouch* -- you mean each message writes out an 80MB white-list
>> file? That's alot of I/O per message, no wonder spamd seems to be
>> slowing down...
>>     
>
> I think it's fairly safe to assume that the Berkeley DB libraries were
> not written by people who dropped-out in the second week of
> C-programming 101, and never learned any more sophisticated way of
> accessing a database file than reading it in and then writing it out. 
>
> http://en.wikipedia.org/wiki/Berkeley_DB
>
> http://en.wikipedia.org/wiki/Mmap
>
>   
True, I did not mean to imply the entire file is written per message. I
meant that *a* write occurs on a per-message basis.

Re: user-db size, excess growth...limits ignored

Posted by RW <rw...@googlemail.com>.

On Wed, 01 Apr 2009 12:27:22 -0700
Linda Walsh <sa...@tlinx.org> wrote:

> 01234567890123456789012345678901234567890123456789012345678901234567890123456789
> Matt Kettler wrote:

> >>  How often does the whitelist get sync'd to disk?
> > In the case of the whitelist, it's per-message.
> -----
> 	*ouch* -- you mean each message writes out an 80MB white-list
> file? That's alot of I/O per message, no wonder spamd seems to be
> slowing down...

I think it's fairly safe to assume that the Berkeley DB libraries were
not written by people who dropped-out in the second week of
C-programming 101, and never learned any more sophisticated way of
accessing a database file than reading it in and then writing it out. 

http://en.wikipedia.org/wiki/Berkeley_DB

http://en.wikipedia.org/wiki/Mmap

Re: user-db size, excess growth...limits ignored

Posted by LuKreme <kr...@kreme.com>.

On 2-Apr-2009, at 14:10, Linda Walsh wrote:
> LuKreme wrote:
>> On 1-Apr-2009, at 13:27, Linda Walsh wrote:
>>> *ouch* -- you mean each message writes out an 80MB white-list  
>>> file? That's alot of I/O per message, no wonder spamd seems to be  
>>> slowing down...
>> Nooooo.... these are DB files.  Data is added to them, this does  
>> not necessitate rewriting the entire file.
> ---
>
> Yeah -- then this refers back to the bug about there being no way  
> to  prune
> that file -- it just slowly grows and needs to be read in when spamd  
> starts(?)

Erm... You are familiar with how DB files work?  Data is looked up in  
them, the entire database is not read into memory.  The entire point  
of a DB file is to have a structure that it is relatively easy to look  
up against.

The size of the database is largely irrelevant and I bet you would be  
hard pressed to see much difference running with an 8MB, 80MB or 800MB  
database file.

-- 
Hudd: 'I've just done this radio show where I never met any of the
	other actors and I didn't understand what any of it was about'
Moore: 'Ah, yes I expect that's the thing I'm in.'

Re: user-db size, excess growth...limits ignored

Posted by Jonas Eckerman <jo...@frukt.org>.

Linda Walsh skrev:

> Yeah -- then this refers back to the bug about there being no way to  prune
> that file -- it just slowly grows and needs to be read in when spamd 
> starts(?)

No.

The AWL is stored in a database, and spamd does not read the whole 
database into memory. It just looks up and updates the address pairs as 
needed.

The same principle is true for the bayes database.

> So the only real harm is the increased read-initialization and the run-time
> AWL length?

I don't know what you mean with "run-time AWL length", but I don't think 
the time to open a Berkley DB grows much because the file grows.

What will become slower as the file grows is the database updates and to 
a lesser degree the lookups.

If the AWL or bayes database grows enough for this to actually do harm, 
I'd suggest moving to a SQL database (where expiration of old address 
pairs is pretty easy to implement).


Regards
/Jonas

Re: user-db size, excess growth...limits ignored

Posted by Linda Walsh <sa...@tlinx.org>.

  LuKreme wrote:
> On 1-Apr-2009, at 13:27, Linda Walsh wrote:
>> *ouch* -- you mean each message writes out an 80MB white-list file? 
>> That's alot of I/O per message, no wonder spamd seems to be slowing 
>> down...
> 
> Nooooo.... these are DB files.  Data is added to them, this does not 
> necessitate rewriting the entire file.
---

Yeah -- then this refers back to the bug about there being no way to  prune
that file -- it just slowly grows and needs to be read in when spamd starts(?)
and spamd needs to keep that info around as the basis for its AWL scoring, no?
So the only real harm is the increased read-initialization and the run-time
AWL length?

Re: user-db size, excess growth...limits ignored

Posted by LuKreme <kr...@kreme.com>.

On 1-Apr-2009, at 13:27, Linda Walsh wrote:
> *ouch* -- you mean each message writes out an 80MB white-list file?  
> That's alot of I/O per message, no wonder spamd seems to be slowing  
> down...

Nooooo.... these are DB files.  Data is added to them, this does not  
necessitate rewriting the entire file.

-- 
I have a love child who sends me hate mail

Re: user-db size, excess growth...limits ignored

Posted by Linda Walsh <sa...@tlinx.org>.

01234567890123456789012345678901234567890123456789012345678901234567890123456789
Matt Kettler wrote:
> Linda Walsh wrote:
>> Matt Kettler wrote:
>>>> I see 3 DB's in my user directory (.spamassassin).
>>>>    auto-whitelist (~80MB),   bayes_seen (~40MB),   bayes_toks (~20MB)

>>> expiry will only affect bayes_toks. Currently neither auto-whitelist nor
>>> bayes_seen have any expiry mechanism at all.
>> ---
>> So they just grow without limit?
> Yep. Not ideal, and there's bugs open on both.

>>  How often does the whitelist get sync'd to disk?
> In the case of the whitelist, it's per-message.
-----
	*ouch* -- you mean each message writes out an 80MB white-list file?
That's alot of I/O per message, no wonder spamd seems to be slowing down...


>>     Having changed the user_prefs files back to the default
>> setting (i.e. deleted my previous addition) -- 2 days ago, and system was
>> rebooted 1day14hours ago, I'm certain spamd has been restarted.
> Hmm, can you set bayes_expiry_max_db_size in a user_prefs file? That
> seems like an option that might be privileged and only honored at the
> site-wide level. An absurdly large value can bog the whole server down
> when processing mail, so an end user could DoS your machine if allowed
> to set this.
----
	I *thought* I could set it -- certainly, the only place I
*increased* the tokens beyond the *default* was in user-prefs. That
*seems to have worked in bumping up the toks to 500K, but, now,
lowering it, is being ignored.  Perhaps the user-pref option to set
#tokens changed and an old version allowed it and raised it to 500K,
but newer version disallows so I can't 'relower' it (though I'd think
global 150K limit would have been re-applied).



> That said, 3.1.7 is vulnerable to CVE-2007-0451 and CVE-2007-2873.
> 
> You should seriously consider upgrading for the first one.

-----
	While I was supporting multiple local users at one point, I'm only
local user, so local-user escalation to create local service denial isn't
top-most concern.  Doesn't mean shouldn't upgrade for other reasons.


I'm still *Greatly* concerned about an 80MB file being written to disk
potentially on every email message incoming.  That's seems a high
overhead, or are their mitigating factors that decrease that amount
under 99% of the cases?

Tnx,
Linda

Re: user-db size, content confusions (how many toks?)

Posted by Matt Kettler <mk...@verizon.net>.

Linda Walsh wrote:
> Matt Kettler wrote:
>>> I see 3 DB's in my user directory (.spamassassin).
>>>    auto-whitelist (~80MB),      bayes_seen (~40MB),     bayes_toks
>>> (~20MB)
>>> Was trying to find relation of 'bayes_expiry_max_db_size' to the
>>> physical
>>> size of the above files.
> ---
>
>> expiry will only affect bayes_toks. Currently neither auto-whitelist nor
>> bayes_seen have any expiry mechanism at all.
> ---
> So they just grow without limit?
Yep. Not ideal, and there's bugs open on both.
>   How often are they loaded?
IIRC, at the creation of a Mail::SpamAssassin instance, but I'm not well
versed in that aspect of the code.
> Does only "spamd" access the auto-whitelist?
Well, any Mail::SpamAssassin instance. (spamd, the "spamassassin"
script, etc). spamc, on the other hand, is not a Mail::SpamAssassin
instance, and doesn't access *any* of the SA config files or databases.

>
> Optimally, I would assume spamd opens it upon start, but it needs to
> update
> the disk file periodically (sync the db) for reliability.  How often does
> it 'sync'?
In the case of the whitelist, it's per-message.

In the case of the bayes_seen, every time a message is learned.
>
>> bayes_seen can safely be deleted if you need to. It keeps track of what
>> messages have already been learned to prevent relearning them. However,
>> unless you're likely to re-feed messages to SA, bayes_seen isn't stictly
>> neccesary.
> ---
>     Only refeeding would usually be 'ham', because I might rerun over
> an "Inbox", that might have old messages in it.  I don't rerun "ham"
> training
> often -- except to "despam" a message (one that was marked spam and
> shouldn't
> have been).
>
>
>
>>> I'm finding some answers, I've run into some seeming
>>> "contradictions".  ...
>>> ---
>>> First prob(contradiction).  dbg above says "token count: 0".  (This is
>>> with
>>> a combined bayes db size of 60MB (_seen, _toks).
>> Are you sure your sa-learn was using the same DB path?
> ---
>     Sure??  It listed the same filename (default location
> /home/<user>/.spamassasssin/<bayes...>).  Other than that, I haven't
> tried to trace perl running spamassassin, to see if it is really
> accessing
> the same file.  Only going off the 'debug' messages (which correspond
> to the
> settings in "user_prefs" that's in the default location dir.
>
>
>> From the sounds of it, sa-learn is using a directory with an empty DB.
> ----
>     Yeah...Doesn't make sense to me -- how would "sa-learn --dump magic"
> use a different location?  I.e. it showed ~500K tokens...
>
>
>>> I.e. isn't 'ntokens' = 491743 mean slightly under 500K tokens 
>> Yep, looks like you have 491,743 tokens to me.
>
>>> It's like the sa-learn magic shows a 'db' corresponding to my old limit
>>> (that I think is still being 'auto-expired', so might not have pruned
>>> figure as it runs about once per 24 hours, if I understand normal spamd
>>> workings).
>> Approximately. Also, be aware that in order for spamd to use new
>> settings it needs to be restarted.
> ----
>     Having changed the user_prefs files back to the default
> setting (i.e. deleted my previous addition) -- 2 days ago, and system was
> rebooted 1day14hours ago, I'm certain spamd has been restarted.
Hmm, can you set bayes_expiry_max_db_size in a user_prefs file? That
seems like an option that might be privileged and only honored at the
site-wide level. An absurdly large value can bog the whole server down
when processing mail, so an end user could DoS your machine if allowed
to set this.



>
> YET: all db sizes are the same as before (no reduction in size
> corresponding to going 'back' to a default 150K limit), though sa-learn
> run with dbg and force-expire indicated 0 tokens -- but sa-learn
> w/dump magic
> indicates 500K tokens.  How can "expire" say 0 toks but dump-magic say
> 500K?
That's a big mystery to me. Doesn't make sense.
>
>     File timemstamps show all 3-db files have been updated today.
> (Presumably by spamd processing email as it comes in).  But file sizes
> still are @ sizes indicated at top of this message: 80/40/20-MB.
>
>
>>> So is the --magic output, maybe what is seen and being
>>> 'size-controlled' by auto-expire?
>> Yes, at least, it should be.
>
>
>>> Why isn't 'sa-learn --force expire' seeing the TOKENs indicated in
>>> sa-learn --dump magic?  
>> That is particularly strange to me, and it sounds like there's some
>> problems there.
> ---
> *sigh*
>
>>
>> Can you give a bit of detail, ie: what paths are you looking at for the
>> files, what version of SA,
> ---
>     SA = old version of 3.1.7.
>     Which at very least points to an upgrade possibly solving the
> problem,
> BUT, this was working at one point, and don't know why it 'stopped'.  I'm
> generally uncomfortable with fixing things that were working just
> because they
> have randomly stopped working without knowing *why*, (though that
> discomfort has
> become something I've just more had to deal with as the Microsoft SW
> maintenance method becomes the norm (update and see if bug is
> gone...yes?  ok,
> bug gone; (unclear if fixed or hidden, unclear about effects of other
> changes in
> a new version...)
Understood.

That said, 3.1.7 is vulnerable to CVE-2007-0451 and CVE-2007-2873.

You should seriously consider upgrading for the first one.

http://wiki.apache.org/spamassassin/Security
<http://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2007-2873>   
>
>
>>> Am I misinterpreting the debug output?
>> No, you don't seem to be.
> ---
>     Thanks for the confirmation of my 'reality'.  Really, the most
> logical
> and time-efficient way to proceed is likely to upgrade to newer
> version at some
> point soon (and ignore my discontent regarding 'not knowing' why or
> what caused
> the break).
>
> *sigh*
> Linda
>
>>>
>>>
>>
>