You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Marc Slemko <ma...@worldgate.com> on 1997/10/18 07:25:18 UTC
on large numbers of virtual hosts and memory use

Below is pulled from a thread on comp.unix.solaris.

The basic issue is that each virtual host configured into Apache
takes some memory size.  Say 4k is a good number.  Stronghold appears
to, for whatever reason, take a _lot_ more.  If you have thousands
and thousands of virtual hosts, this memory use adds up.

It is set in the parent, then the child processes shouldn't play with it.
Since the child processes don't play with it, any modern system will
not allocate pages for it but will simply flag it COW.  So far so
good; the memory isn't used, so it doesn't add to physical memory
overhead.

The trick is that many or most systems reserve swap space for pages
flaged as COW so that it doesn't risk running out of swap when
a process decides it wants to actually write to these pages.
There lies the problem; you gobble huge amounts of swap.

The workaround, since we know the child should never be writing to
them so COW is simply an excuse for not doing shared memory, is
to actually implement that via shared memory.  Then you get rid
of the pages mapped in each child, and are much happier.  This
doesn't look to be _that_ major an undertaking to me.  Comments?


---------- Forwarded message ----------
>Path: scanner.worldgate.com!news.he.net!newsfeed.direct.ca!newsfeed.internetmci.com!207.69.200.61!mindspring!news.mindspring.com!demon.mindspring.com!news
>From: news@demon.mindspring.com (News Reader)
>Newsgroups: comp.unix.solaris
>Subject: Re: reserving swap (was: Solaris 2.6 fd limits ? 256)
>Date: 18 Oct 1997 04:34:58 GMT
>Organization: MindSpring Enterprises Inc.
>Lines: 91
>Distribution: inet
>Message-ID: <62...@camel12.mindspring.com>
>References: <34...@risq.qc.ca> <62...@griffin.itc.gu.edu.au> <62...@camel18.mindspring.com> <62...@griffin.itc.gu.edu.au>
>NNTP-Posting-Host: aslan.mindspring.net
>Keywords: mmap(), MAP_NORESERVE, fork(), reserved swap
>Xref: scanner.worldgate.com comp.unix.solaris:120551    

In article <62...@griffin.itc.gu.edu.au>,
Sean Vickery  <S....@its.gu.BLOODY.VIKINGS.edu.au> wrote:
>
>Solaris malloc() (always?) allocates heap memory using sbrk(), not mmap().

If you link against libmapmalloc, then you use versions of malloc() that
use mmap() instead of sbrk().  I discovered later from the mmap() man
page (which I've read a hundred times and overlooked this) that

     "...MAP_NORESERVE mappings are inherited
     across  fork(2);  at  the  time of the fork(2) swap space is
     reserved in the child for all private pages  that  currently
     exist  in the parent; thereafter the child's mapping behaves
     as described above."

which states that swap space is reserved across a fork() anyway, even if it
is mmapped MAP_NORESERVE.

>I'm pretty sure that sbrk() would always reserves swap space, in parent and
>child.  I'm guessing too now, but it makes sense.  One wouldn't want to be
>writing into some malloc()ed pages when suddenly one gets a SIGBUS, like one
>does in the case where one's using MAP_NORESERVED mmap() pages and there's
>not enough swap when a page actually gets written to for the first time.
>If malloc did this it would be ridiculous:  `Malloc() told me when it returned
>successfully that the system had enough memory to give me some; now its
>changed its mind.'

This can only happen if you actually run out of physical swap and totally
exhaust RAM.  Good system planning and no serious memory leaks can make this
unlikely.  With copy on write via forks(), the system can waste enormous
amounts of resources because so much is reserved, but so little is used.

[snipped]
>I wonder if you've really got a problem:  you have to configure up heaps
>of swap space on the system, sure, but it's not actually being used or slowing
>down Apache in any way, so why worry about that?

We worry because we need to allocate massive amounts of swap, potentially
over 30 GB and up, (combined on all servers) that will never get used.  That
can cost a lot of money and cause some headaches.

>> 
>> Here is a line from the Apache source code (http_main.c about line 686 :
>> 
>> m = mmap((caddr_t)0, SCOREBOARD_SIZE, PROT_READ | PROT_WRITE, MAP_ANON |
>> MAP_SHARED | MAP_NORESERVE, -1, 0);
>
>This looks to be explicitly allocating a block of shared memory, which isn't
>what I thought we were talking about.  MAP_ANON is non-standard, a Linuxism I
>think, and isn't defined by Solaris' include files.  When Apache is being compiled for Solaris, does it define MAP_ANON to be zero?

I copy and pasted the wrong line.  This line would not be #def'd and would
not be compiled, but it demonstrates the change anyway.

>
>Clearly Apache isn't using mmap() to allocate the large block of memory that
>you are concerned about.

No, I don't even think they thought about the problem.  We only first
noticed it about 8 months ago when a system was having problems with
"could not grow stack", or "could not fork(), no space available."
It seemed to be out of memory, but had actually used very little physical
swap.  The problem only becomes significant when we have a large number of 
virtual hosts in Apache's config, or we are using Stronghold.  We could
shrink the daemon size down by splitting it up, but we would have to use
the LISTEN directive.  Using that caused us to run into the stdio FILE
struct problem as well as a number of other log splitting difficulties.  The
whole thing just got too ugly.

>
>If what Apache wants to do is to share some data in memory between a parent
>and children processes, perhaps it should do so explicitly: call shm_open(),
>ftruncate() and mmap(..., MAP_SHARED, ...) in the parent, then fork() and
>in the child, to insure against it writing to the memory, call mprotect(...
>PROT_READ).  Would this work, or would the child have to call shm_open, etc
>itself too?  That would certainly work, though would require a few extra lines
>of code.  From your description, implementing things this way would appear to
>give the desired functionality, without the need to reserve large amounts of
>swap or rely on copy-on-write.

True, but that would require a good rewrite of Apache.  Oracle does use
shared memory this way.  We have about 6-7 oracle daemons using 26MB of
memory each but only about 40 MB of reserved swap is "used".  The
no-overcommit feature, while useful to keep machines from thrashing when
they really do run out of memory, can seriously waste resources under some
circumstances and should have an off switch.


-- mikeh AT mindspring.net
MindSpring Web Hosting Engineering



---------- Forwarded message ----------
>Path: scanner.worldgate.com!news.maxwell.syr.edu!newsfeed.internetmci.com!192.48.96.124!in4.uu.net!ozemail!news.mel.aone.net.au!newsfeed-in.aone.net.au!news.mel.connect.com.au!munnari.OZ.AU!bunyip.cc.uq.edu.au!newshost.gu.edu.au!usenet
>From: Sean Vickery <S....@its.gu.BLOODY.VIKINGS.edu.au>
>Newsgroups: comp.unix.solaris
>Subject: Re: reserving swap (was: Solaris 2.6 fd limits ? 256)
>Date: 17 Oct 1997 09:14:40 GMT
>Organization: Griffith University, Queensland, Australia
>Lines: 64
>Distribution: inet
>Message-ID: <62...@griffin.itc.gu.edu.au>
>References: <34...@risq.qc.ca> <62...@camel20.mindspring.com> <62...@griffin.itc.gu.edu.au> <62...@camel18.mindspring.com>
>NNTP-Posting-Host: centaur.itc.gu.edu.au
>Keywords: mmap(), MAP_NORESERVE, fork(), reserved swap
>Xref: scanner.worldgate.com comp.unix.solaris:120433    

On 16 Oct 1997, News Reader <ne...@demon.mindspring.com>
wrote in comp.unix.solaris:
> The parent writes the data to malloc'd memory not mmap'd memory(I assume).  It
> then forks and its children read and access that data but never change or
> write to it (so they never get their own private copies).  Unless you rewrite
> malloc() to use MAP_NORESERVE in it's calls to mmap() (if that is what it
> does), then I see no easy solution to this problem.  But I'm still guessing.

Mike,

Solaris malloc() (always?) allocates heap memory using sbrk(), not mmap().
I'm pretty sure that sbrk() would always reserves swap space, in parent and
child.  I'm guessing too now, but it makes sense.  One wouldn't want to be
writing into some malloc()ed pages when suddenly one gets a SIGBUS, like one
does in the case where one's using MAP_NORESERVED mmap() pages and there's
not enough swap when a page actually gets written to for the first time.
If malloc did this it would be ridiculous:  `Malloc() told me when it returned
successfully that the system had enough memory to give me some; now its
changed its mind.'

Now, gnumalloc does use mmap() when you ask it for a large (>120k or so)
block.  You could easily patch the gnumalloc source to include MAP_NORESERVE,
but then Apache would have to be aware that it may have more memory mapped
than can be backed in swap, the scenario I described in the previous
paragraph.  If you have plenty of swap, you may as well ignore this.

I wonder if you've really got a problem:  you have to configure up heaps
of swap space on the system, sure, but it's not actually being used or slowing
down Apache in any way, so why worry about that?

> >Some details, code even, from your application may prove helpful.
> 
> Here is a line from the Apache source code (http_main.c about line 686 :
> 
> m = mmap((caddr_t)0, SCOREBOARD_SIZE, PROT_READ | PROT_WRITE, MAP_ANON |
> MAP_SHARED | MAP_NORESERVE, -1, 0);

This looks to be explicitly allocating a block of shared memory, which isn't
what I thought we were talking about.  MAP_ANON is non-standard, a Linuxism I
think, and isn't defined by Solaris' include files.  When Apache is being compiled for Solaris, does it define MAP_ANON to be zero?

> I added MAP_NORESERVE to this call and every other mmap() call in the apache
> source.  It made no difference.

Clearly Apache isn't using mmap() to allocate the large block of memory that
you are concerned about.

> [top, pmap and swap -s output snipped]

If what Apache wants to do is to share some data in memory between a parent
and children processes, perhaps it should do so explicitly: call shm_open(),
ftruncate() and mmap(..., MAP_SHARED, ...) in the parent, then fork() and
in the child, to insure against it writing to the memory, call mprotect(...
PROT_READ).  Would this work, or would the child have to call shm_open, etc
itself too?  That would certainly work, though would require a few extra lines
of code.  From your description, implementing things this way would appear to
give the desired functionality, without the need to reserve large amounts of
swap or rely on copy-on-write.

Sean.
--
Sean Vickery <S....@its.gu.BLOODY.VIKINGS.edu.au> Ph: +61 (0)7 3875 6410
Systems Programmer         Information Services          Griffith University
Copyright (C) 1997 All rights reserved.  Remove the smeared Nordics to email.



---------- Forwarded message ----------
>Path: scanner.worldgate.com!rover.ucs.ualberta.ca!news.bc.net!logbridge.uoregon.edu!newsfeed.internetmci.com!207.69.200.61!mindspring!news.mindspring.com!demon.mindspring.com!news
>From: news@demon.mindspring.com (News Reader)
>Newsgroups: comp.unix.solaris
>Subject: Re: reserving swap (was: Solaris 2.6 fd limits ? 256)
>Date: 16 Oct 1997 17:34:08 GMT
>Organization: MindSpring Enterprises Inc.
>Lines: 171
>Distribution: inet
>Message-ID: <62...@camel18.mindspring.com>
>References: <34...@risq.qc.ca> <62...@griffin.itc.gu.edu.au> <62...@camel20.mindspring.com> <62...@griffin.itc.gu.edu.au>
>NNTP-Posting-Host: aslan.mindspring.net
>Keywords: mmap(), MAP_NORESERVE, fork(), reserved swap
>Xref: scanner.worldgate.com comp.unix.solaris:120339    

In article <62...@griffin.itc.gu.edu.au>,
Sean Vickery  <S....@its.gu.BLOODY.VIKINGS.edu.au> wrote:
>On 15 Oct 1997, News Reader <ne...@demon.mindspring.com> wrote
>in comp.unix.solaris:
>> In article <62...@griffin.itc.gu.edu.au>,
>> Sean Vickery  <S....@its.gu.BLOODY.VIKINGS.edu.au> wrote:
>> >
>> >Solaris is perfectly capable of mapping pages without reserving swap.  Simply
>> >pass the MAP_NORESERVE flag to mmap(2).  [snip]
>> 
>> I don't think that will work in this situation.  mmap(2) doesn't even come
>> into play here.  The problem is that the daemons fork() to handle new
>> requests and the reserve memory count increases appropriately for that
>> child's address space.  Since fork() does copy on write (correct for Solaris?)
>> and the large blocks of memory never get written to, hugh amounts of
>> virtual memory (reserved swap) get used while the actual memory usage doesn't
>> increase much.  I'm just guessing here so I may be out in left field.  I
>> tried adding MAP_NORESERVE to mmap() calls in several programs but it made no
>> difference.
>
>Mike,
>
>The mmap() system call is often at work behind the scenes, especially whenever
>shared libraries are involved.  It's one of the basic ways to have more pages
>mapped into a process's virtual address space.  (Others are sbrk() and exec.)
>Certainly fork(2) does copy-on-write.
>
>I don't understand what you mean by `the large blocks of memory never get
>written to', so I'll continue to attempt to answer your question in the most
>general case.  What are these large blocks of memory that never get written
>to?  And wouldn't that a bit inefficient?

The parent writes the data to malloc'd memory not mmap'd memory(I assume).  It
then forks and its children read and access that data but never change or
write to it (so they never get their own private copies).  Unless you rewrite
malloc() to use MAP_NORESERVE in it's calls to mmap() (if that is what it
does), then I see no easy solution to this problem.  But I'm still guessing.

>Some details, code even, from your application may prove helpful.

Here is a line from the Apache source code (http_main.c about line 686 :

m = mmap((caddr_t)0, SCOREBOARD_SIZE, PROT_READ | PROT_WRITE, MAP_ANON |
MAP_SHARED | MAP_NORESERVE, -1, 0);

<END>

I added MAP_NORESERVE to this call and every other mmap() call in the apache
source.  It made no difference.

>
>From reading this, it seems pretty clear to me that if one had mmap()ed
>a large chunk of memory with the MAP_NORESERVE flag, didn't write to any
>of it (thus no private pages are required to be created), then fork()ed,
>then no additional swap space would be consequently reserved for the
>large chunk.

Theoretically, yes.

>If you'd like us to have a bash at a better answer, give us some details
>about daemon you're writing.

It's actually Apache and Stronghold (which uses the Apache source code).
Apache, with about 6 Class C's on an Ultra2 uses nearly 6MB per daemon.
Stronghold with two Class C's uses over 30MB per daemon, 1 class C is around
15MB.  I compiled these daemons with MAP_NORESERVE in all mmap() calls:

from top:
  PID USERNAME THR PRI NICE  SIZE   RES STATE   TIME    CPU COMMAND
 1997 nobody     1  35    0   15M 1932K sleep   0:00  0.13% httpsd

The parent's RES size is over 9 MB, but the children only around 2 MB.


/usr/proc/bin/pmap 1997
1997:   ./httpsd -d /var/stronghold -f conf/httpsd_6.conf
00010000  860K read/exec          dev: 162,2   ino: 2778085
000F6000   60K read/write/exec    dev: 162,2   ino: 2778085
0010500012536K read/write/exec
0011700012464K     [ heap ]
EF580000   64K read/write/shared
EF5A0000   16K read/exec          /usr/lib/nss_files.so.1
EF5B3000    4K read/write/exec    /usr/lib/nss_files.so.1
EF5C0000   28K read/exec          /usr/lib/libw.so.1
EF5D6000    4K read/write/exec    /usr/lib/libw.so.1
EF5E0000   12K read/exec          /usr/lib/libmp.so.1
EF5F2000    4K read/write/exec    /usr/lib/libmp.so.1
EF600000  508K read/exec          /usr/lib/libc.so.1
EF68E000   32K read/write/exec    /usr/lib/libc.so.1
EF696000    8K read/write/exec
EF6A0000   12K read/exec          /usr/lib/libintl.so.1
EF6B2000    4K read/write/exec    /usr/lib/libintl.so.1
EF6C0000  388K read/exec          /usr/lib/libnsl.so.1
EF730000   36K read/write/exec    /usr/lib/libnsl.so.1
EF739000   32K read/write/exec
EF760000   28K read/exec          /usr/lib/libsocket.so.1
EF776000    8K read/write/exec    /usr/lib/libsocket.so.1
EF780000   84K read/exec          /usr/lib/libm.so.1
EF7A4000    8K read/write/exec    /usr/lib/libm.so.1
EF7B0000    4K read/exec/shared   /usr/lib/libdl.so.1
EF7C0000    4K read/write/exec
EF7D0000  104K read/exec          /usr/lib/ld.so.1
EF7F9000    8K read/write/exec    /usr/lib/ld.so.1
EFFF6000   40K read/write/exec
EFFF6000   40K     [ stack ]


This is on Solaris 2.5.1 with the following in /etc/system:
set rlim_fd_cur=512
set rlim_fd_max=1024
set shmsys:ism_off = 1
set shmsys:shminfo_shmmax=8388608
set shmsys:shminfo_shmmin=1
set shmsys:shminfo_shmmni=100
set shmsys:shminfo_shmseg=10
set semsys:seminfo_semmns=200
set semsys:seminfo_semmni=70


Here is another from Apache (with 6 class C's in its conf file) and with
MAP_NORESERVE added to all mmap() calls (this is on a Solaris 2.6 box):

  PID USERNAME PRI NICE  SIZE   RES STATE   TIME   WCPU    CPU COMMAND
 18813  nobody  23    0 5612K  620K sleep   0:00  0.00%  0.00% httpd.test

/usr/proc/bin/pmap 18813
18813:  ./httpd.test -f conf/httpd.conf.test
00010000    268K read/exec         dev:32,8 ino:265016
00062000     12K read/write/exec   dev:32,8 ino:265016
00065000   3972K read/write/exec     [ heap ]
EF600000     16K read/exec         /usr/lib/nss_files.so.1
EF613000      4K read/write/exec   /usr/lib/nss_files.so.1
EF620000     12K read/exec         /usr/lib/libmp.so.2
EF632000      4K read/write/exec   /usr/lib/libmp.so.2
EF640000    588K read/exec         /usr/lib/libc.so.1
EF6E2000     24K read/write/exec   /usr/lib/libc.so.1
EF6E8000      8K read/write/exec     [ anon ]
EF6F0000      4K read/write/exec     [ anon ]
EF700000    444K read/exec         /usr/lib/libnsl.so.1
EF77E000     32K read/write/exec   /usr/lib/libnsl.so.1
EF786000     24K read/write/exec     [ anon ]
EF790000      4K read/write/shared   [ anon ]
EF7A0000     32K read/exec         /usr/lib/libsocket.so.1
EF7B7000      4K read/write/exec   /usr/lib/libsocket.so.1
EF7B8000      4K read/write/exec     [ anon ]
EF7C0000      4K read/exec/shared  /usr/lib/libdl.so.1
EF7D0000    112K read/exec         /usr/lib/ld.so.1
EF7FB000      8K read/write/exec   /usr/lib/ld.so.1
EF7FD000      4K read/write/exec     [ anon ]
EFFF9000     28K read/write/exec     [ stack ]
 total     5612K

With 100 of these running, "swap" space dropped from 2.1GB to around 1.5 GB.
According to top 3.4 on Solaris 2.6.

vmstat
 procs     memory            page            disk          faults      cpu
 r b w   swap  free  re  mf pi po fr de sr s1 s3 s6 --   in   sy   cs us sy id
 0 0 0   5840  5504   0   7  0  0  0 60  0  0  0  0  0   11   39   25  1  1 98


swap -l
swapfile             dev  swaplo blocks   free
/dev/dsk/c0t1d0s1   32,9       8 1475912 1475912
/dev/dsk/c0t3d0s4   32,28      8 2511032 2511032

swap -s
total: 39328k bytes allocated + 411544k reserved = 450872k used, 1635400k available

-- mikeh AT mindspring.net