You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@httpd.apache.org by Neil Gunton <ne...@nilspace.com> on 2009/01/04 23:09:05 UTC

Issues with mod_disk_cache and htcacheclean

I posted this on the users list, but was advised to post it to dev as 
well, since it seemed relevant to developers. Hope that's ok...

I am using Apache 2.2.9 on Linux AMD64, built from source. There is one
server running two builds of Apache - a lightweight front-end caching
reverse proxy configuration using mod_disk_cache, and a heavyweight
mod_perl back end. I use caching to relieve load on the server when many
people request the same page at once. The website is dynamic and
contains millions of page permutations. Thus the cache has a tendency to
get fairly large, unless it is pruned. So I have been trying to use
htcacheclean to achieve this. There have been some issues, which I will
outline below.

First, I found that htcacheclean was not able to keep up with pruning
the cache. It just kept growing. I initially ran htcacheclean in daemon
mode, thus:

htcacheclean -i -t -n -d60 -p/var/cache/www -l1000M

CacheDirLevels was 3 and CacheDirLength 1.

The cache would just keep getting bigger, to multiple GB. Additionally,
even doing a du on the cache could take hours to complete.

I also noticed that iowait would spike when I tried running htcacheclean
in non-daemon mode. It would not keep up at all using the -n ("nice")
option; when I took that off, the iowait would go through the roof and
the process would take hours to complete. This was on a quad core AMD64
server with 4 x 10k SCSI drives in hardware RAID0.

Upon investigation, I discovered that the cache was a lot deeper than I
expected. In addition to the three levels specified in CacheDirLevels,
there were then additional levels of subdirectories beneath ".vary"
subdirs. For each .header file, there was a .vary subdir with three
levels of directory below that. Simply traversing this tree with du
could take a long time - hours sometimes, depending on how long the
server had been running without a cache clear.

I discovered that the .vary subdirs were caused by my configuration,
which was introducing a Vary http header. This came from two sources:
First, mod_deflate. I found this out from this helpful page:

http://www.digitalsanctuary.com/tech-blog/general/apache-mod_deflate-and-mod_cache-issues.html

So I disabled mod_deflate, since it seemed to be producing a huge number
of cache entries for each file - a different one for every browser. But
after disabling mod_deflate, the .vary subdirs were still there. I also
had this line in my config:

Header add Vary "Cookie"

This is necessary because users on my site set options for how the site
is displayed. When I tried disabling this cookie Vary header, the number
of directories went down substantially, to the expected three levels.
The cache structure was much simpler, and it seemed that htcacheclean
could keep up with this. However, the site was broken - since the same
page for different users with different options would be cached only
once. So someone who had "no ads" or "no pics" would request a page that
someone else had recently requested (with different options), and they
would get that other person's options. Not good. So I had to switch the
vary header for cookies back on, so that pages would get differentiated
in the cache based on cookie. But now I was back to square one - six
effective levels of subdirectory, which htcacheclean could not keep up with.

After some thought, I ended up changing CacheDirLevels to 2, to try to
reduce the depth of the tree. Now I had fewer subdirs, but more files in
each one.

Also, the size of the cache, via du, always seems to be much higher than
specified for htcacheclean. I lowered the limit to 100M, but still the
cache is regularly up at 180MB or 200MB. This seems counter-intuitive,
since htcacheclean doesn't appear to be taking the true size of the
cache into account (i.e. including all the subdirs, which also take up
space and presumably are what cause the discrepancy).

I also noticed something else: htcacheclean was leaving behind .header
files. When it cleaned the .vary subdirs, it seemed to leave behind the
corresponding .header files. These would accumulate, causing the iowait
to gradually increase, presumably due to the size of the directories. I
would rotate (clear) the cache manually at midnight. The behavior I
would see (via munin monitoring tool) was that iowait would then remain
at zero for about 12 hours, but then would gradually become visible as
the number of .header files would accumulate.

So I wrote a perl script which could go through the cache, and look for
.header files, and for each one found, see if a corresponding .vary
subdir exists for it. If not, then the .header file is deleted. I then
run another script to prune empty subdirectories. Currently I run this
combination every 10 minutes - first a non-daemon invocation of
htcacheclean, followed by the header prune script, followed by the empty
subdirs pruning script. This seems to keep the cache small, and iowait
is not noticeable any more, since the "junk" .header files are now
disposed of regularly.

However, I'm not sure why I need to run this kind of hacked up bespoke
version of cache management, when htcacheclean should surely be capable
of doing the job itself.

All of this brings up a few questions:

1. Why does mod_disk_cache generate six levels of subdirectory when
CacheDirLevels is clearly set to 3? I realize what it's trying to do,
(each page might have many variations and so those variations must be
differentiated by subdir) but the additional levels cause an exponential
increase in the number of directories that must be traversed. It seems
absurd when this causes trouble for a relatively well-specced server.
Since starting this investigation, I have moved to a completely new
server, a 4 core Xeon 2.33GHz, with 8 x 10k Raptor SATA drives in
hardware RAID10 configuration. The performance is excellent, but when I
tried using mod_disk_cache with CacheDirLevels at 3 and cookie Vary
headers on, it still could not keep up with pruning. Even simply
traversing this kind of structure with du is clearly not scalable. Could
we not have the three main levels of directory, but then have a
different setting for the number of subdirs below the .vary dirs?
Usually there is just one file at the leaf of the .vary subdirs, so
having three additional levels seems like a bit of overkill. We should
be able to tune the subdir levels to minimize the depth of the cache as
makes sense.

2. Why does htcacheclean not keep the cache at the stated size limit? If
you say -l100M and then do a du and it says 200M, then that is
counter-intuitive, and actually wrong in real terms. It gets worse with
the larger caches - when I had 3 levels and cookie Vary headers on, the
limit for htcacheclean was 1000M, but the cache would grow to 3GB and up.

3. Why are .header files left over by htcachelean when it has deleted
the .vary subdirectory? Is this something like a memory leak, but with
files? I would have thought that if the cached content (.data) file has
gone away, then why bother keeping the .header file around. It clogs up
the cache directory and makes traversing the tree more work. If it's
kept for 304 "unchanged" responses then I can understand that, but then
why do these files still seem to pile up even after the related page
would have clearly expired anyway? Surely better to just delete them
when the .vary subdir is deleted. In any case, I didn't notice the
.header files being left over when the Vary header was disabled, so I
think this might be a straightforward "leak" when using Vary.

4. Will I be causing any potential problems for Apache by my deleting
the leftover .header files myself (ones which have no corresponding
.vary subdir)? Does that cause apache or htcacheclean to have potential
issues if you do this while they are running? If they are junk then I
can't see it being a problem, but it's unclear currently if they are
actually used or not.

I wasn't sure if I should post this on the dev list, since it seems to
be more directed at the developers than other users. But the list
guidelines said that "Configuration and support questions should be
addressed to a user support group", and this seems to be that, so I'll
post it here first.

Thanks for any insights or feedback.

Neil

Re: Issues with mod_disk_cache and htcacheclean

Posted by Neil Gunton <ne...@nilspace.com>.
Plüm, Rüdiger, VF-Group wrote:
> Can you try with the following additional patch and a clean cache?
> Afterwards there should only be very very few orphaned header files
> left.:
> 
> Index: modules/cache/mod_disk_cache.c
> ===================================================================
> --- modules/cache/mod_disk_cache.c      (revision 732705)
> +++ modules/cache/mod_disk_cache.c      (working copy)
> @@ -558,6 +558,8 @@
>          str_to_copy = dobj->hdrsfile ? dobj->hdrsfile : dobj->datafile;
>          if (str_to_copy) {
>              char *dir, *slash, *q;
> +            char *dot;
> +            char *hdrs_file;
> 
>              dir = apr_pstrdup(p, str_to_copy);
> 
> @@ -586,6 +588,18 @@
>                   }
>                   slash = strrchr(q, '/');
>                   *slash = '\0';
> +                 /*
> +                  * Check if we just deleted a vary directory. If we did, the
> +                  * corresponding header file is of no use anymore. So delete
> +                  * it.
> +                  */
> +                 dot = strrchr(slash + 1, '.');
> +                 if (dot && (strcmp(dot + 1, CACHE_VDIR_SUFFIX) == 0)) {
> +                     *dot = '\0';
> +                     hdrs_file = apr_pstrcat(p, dir, "/", slash + 1,
> +                                             CACHE_HEADER_SUFFIX, NULL);
> +                     apr_file_remove(hdrs_file, p);
> +                 }
>              }
>          }
>      }
> 
> Regards
> 
> Rüdiger
> 

Ok, I applied this patch, rebuilt httpd_proxy and am now running it. I 
disabled my scripts for cleaning up the orphaned .header files, but 
still run htcacheclean in non-daemon mode every 10 minutes. After a few 
hours, I can now see that there are still .header files being left 
without any .vary directory.

Something about the above patch is confusing to me - this applies to 
mod_disk_cache, but I didn't think that mod_disk_cache actually did any 
cleaning up of the cache. I thought that was all done in htcacheclean. 
Am I mistaken there? In any case, it seems that the orphaned .header 
files are being produced by the runs of htcacheclean, so surely any 
prospective fix should be for htcacheclean.c?

Thanks again for your time and effort, much appreciated.

Neil

Re: Issues with mod_disk_cache and htcacheclean

Posted by "Plüm, Rüdiger, VF-Group" <ru...@vodafone.com>.
 

> -----Ursprüngliche Nachricht-----
> Von: Neil Gunton 
> Gesendet: Montag, 5. Januar 2009 23:17
> An: dev@httpd.apache.org
> Betreff: Re: Issues with mod_disk_cache and htcacheclean
> 
> Ruediger Pluem wrote:

> > This seems to be a bug. Can you please try if the following 
> patch fixes this?
> 
> I applied the patch and rebuilt httpd_proxy successfully. The new 
> htcacheclean runs ok, but still seems to leave behind the 
> orphan .header 
> files. At least, I tried running htcacheclean in single run 
> mode, thus:
> 
> htcacheclean -t -p/var/cache/www -l100M
> 
> Then I run my prune_cache_headers perl script, and it seems to still 
> find a bunch of orphaned .header files to delete. So it 
> doesn't appear 
> to have fixed the issue. I did confirm that the patch was applied.

Can you try with the following additional patch and a clean cache?
Afterwards there should only be very very few orphaned header files
left.:

Index: modules/cache/mod_disk_cache.c
===================================================================
--- modules/cache/mod_disk_cache.c      (revision 732705)
+++ modules/cache/mod_disk_cache.c      (working copy)
@@ -558,6 +558,8 @@
         str_to_copy = dobj->hdrsfile ? dobj->hdrsfile : dobj->datafile;
         if (str_to_copy) {
             char *dir, *slash, *q;
+            char *dot;
+            char *hdrs_file;

             dir = apr_pstrdup(p, str_to_copy);

@@ -586,6 +588,18 @@
                  }
                  slash = strrchr(q, '/');
                  *slash = '\0';
+                 /*
+                  * Check if we just deleted a vary directory. If we did, the
+                  * corresponding header file is of no use anymore. So delete
+                  * it.
+                  */
+                 dot = strrchr(slash + 1, '.');
+                 if (dot && (strcmp(dot + 1, CACHE_VDIR_SUFFIX) == 0)) {
+                     *dot = '\0';
+                     hdrs_file = apr_pstrcat(p, dir, "/", slash + 1,
+                                             CACHE_HEADER_SUFFIX, NULL);
+                     apr_file_remove(hdrs_file, p);
+                 }
             }
         }
     }

Regards

Rüdiger

Re: Issues with mod_disk_cache and htcacheclean

Posted by Neil Gunton <ne...@nilspace.com>.
Ruediger Pluem wrote:
> What information do your cookies contain? Are these session cookies that
> are individual to each client? In this case the usage of mod_disk_cache
> with Vary Cookies set would be bad. As these responses would be individual
> you couldn't reuse the results anyway for other clients, so it would be
> the best to leave caching to the individual client caches (e.g. browser caches).
> If your cookies are like BACKGROUND=blue for some users and BACKGROUND=red
> for other users you should think of incorporating these differences into
> the URL's instead of into varying responses.

I use two cookies currently - one for user logins and one for options. 
They are independent - people browsing the site may have either, or 
both, or neither set.

I need to cache all dynamically generated content so that the server can 
cope with slashdottings and links from other popular sites where lots of 
people all click on the same link at the same time ("click storms"). 
Such links could go to any page on the site, and so I really need to 
cache almost everything from mod_perl - with the exception of areas of 
the site which are obviously user-specific, such as edit forms, users' 
personal pages and so on. Those are no-cache.

I am very careful about setting expiration times, since with it being a 
dynamic site and all, you don't want too many stale pages. So many of 
the indexes (e.g. list of latest journal updates) have an expiration of 
only 1-3 minutes, while other journal pages have expiration of 12 hours 
or more.

I keep a 'version' field as part of the database records for most 
content on the site, which is incremented whenever an object is edited. 
Then when someone edits a journal, I include a special 'v=xxx' parameter 
in subsequent links to pages on that journal, to differentiate it from 
earlier versions. So the links from the (fast expiring) index pages such 
as forums or journals index will quickly have the new link with the new 
version. This allows me to have extensively cached content while still 
having people see new edits quickly. Thus the cache is fairly high turnover.

The mod_disk_cache works very well, the only issue being keeping the 
cache size under control without making iowait become noticable as a 
result. I have been finding that keeping the limit down to 100M rather 
than 1000M, and making DirCacheLevels 2 rather than 3, and clearing out 
the orphaned .header files, and running htcacheclean and my header 
pruning script every 10 minutes, seems to make the server very 
comfortable - the iowait goes away to unnoticeable levels.

All the app level code here was developed by me. This is a community 
website for bicycle touring journals - www.crazyguyonabike.com. It 
currently sees somewhere north of 100,000 page requests per day, 
according to analog (and that's not including googlebot, which is on 
there constantly). I am very interested in configuring the site to be 
able to run efficiently on one reasonably well-spec'd server. Caching 
dynamic content is a major part of being able to scale well to cope with 
click storms.

> Regarding the performance you should take a look at the following:
> 
> 1. Use a separate filesystem for the cache.
> 2. Ensure that it is mounted with noatime option.
> 3. Check if you are using the right type of filesystem for this job. If the
>    size of the individual cache files is rather small reiserfs can be much
>    faster then ext3 if I remember correctly.

I currently use ext2 with noatime for the main filesystem (including 
cache). I went to ext2 from ext3 because ext3 has extra overhead related 
to keeping the journal (I believe that is the big difference between the 
two these days). Though I do not have numbers, I do seem to have seen 
disk performance increase since going back to ext2. I'm not sure if you 
can install dir_index with ext2 without turning it into ext3 in the 
process, but in any case I don't have dir_index enabled currently.

I was aware of the potential for using other filesystems for the cache, 
and had thought about reiserfs as a possibility. However after I wrote 
to the httpd users list a few weeks back asking about this very issue, I 
got zero responses. I then went to the squid group and asked there too, 
and similarly got zero useful responses. I agree that reiserfs might 
handle many small files better, but I am wary of using that since the 
trial of Hans Reiser - it kind of calls the future of his tool into 
question, unfortunately.

>> 2. Why does htcacheclean not keep the cache at the stated size limit? If
>> you say -l100M and then do a du and it says 200M, then that is
>> counter-intuitive, and actually wrong in real terms. It gets worse with
>> the larger caches - when I had 3 levels and cookie Vary headers on, the
>> limit for htcacheclean was 1000M, but the cache would grow to 3GB and up.
> 
> Again, this is an issue with the documentation. In fact htcacheclean does
> not limit the size of the cache at all. It can grow indefinitely.
> It only ensures that the size of the cache is being reduced back at least
> to the given limit after it ran. The size of the cache is defined as the
> sum of all filesizes in the cache. It does not consider the disk usage of
> these files which can be larger and it also doesn't take the sizes of the
> directories into account. I am not sure if a du like measurement of the
> cache size would be implementable in a platform independent way, but I
> may be wrong here.

Ok, that's fine. You're right, it sounds like a documentation issue.

> This seems to be a bug. Can you please try if the following patch fixes this?

I applied the patch and rebuilt httpd_proxy successfully. The new 
htcacheclean runs ok, but still seems to leave behind the orphan .header 
files. At least, I tried running htcacheclean in single run mode, thus:

htcacheclean -t -p/var/cache/www -l100M

Then I run my prune_cache_headers perl script, and it seems to still 
find a bunch of orphaned .header files to delete. So it doesn't appear 
to have fixed the issue. I did confirm that the patch was applied.

>> 4. Will I be causing any potential problems for Apache by my deleting
>> the leftover .header files myself (ones which have no corresponding
>> .vary subdir)? Does that cause apache or htcacheclean to have potential
>> issues if you do this while they are running? If they are junk then I
>> can't see it being a problem, but it's unclear currently if they are
>> actually used or not.
> 
> IMHO not. The patch above does the same.

Great, thanks - good to know.

Thanks for your help!

Neil

Re: Issues with mod_disk_cache and htcacheclean

Posted by Ruediger Pluem <rp...@apache.org>.

On 01/05/2009 03:50 PM, Ruediger Pluem wrote:
> 

> 
> Regarding the performance you should take a look at the following:
> 
> 1. Use a separate filesystem for the cache.
> 2. Ensure that it is mounted with noatime option.
> 3. Check if you are using the right type of filesystem for this job. If the
>    size of the individual cache files is rather small reiserfs can be much
>    faster then ext3 if I remember correctly.

Forget about reiserfs. It seems to be dead (long time passed since I dealed
with details of filesystems on Linux). Nevertheless I guess it is still
worth checking for tuning your filesystem or checking if another FS type
is more suitable for your purpose.

Regards

Rüdiger

Re: Issues with mod_disk_cache and htcacheclean

Posted by Ruediger Pluem <rp...@apache.org>.

On 01/04/2009 11:09 PM, Neil Gunton wrote:

> 
> All of this brings up a few questions:
> 
> 1. Why does mod_disk_cache generate six levels of subdirectory when
> CacheDirLevels is clearly set to 3? I realize what it's trying to do,

This is more of a documentation bug, than a code bug. The documentation
should clearly state that in the Vary case the depth can be twice as
large as CacheDirLevels said.

> (each page might have many variations and so those variations must be
> differentiated by subdir) but the additional levels cause an exponential
> increase in the number of directories that must be traversed. It seems
> absurd when this causes trouble for a relatively well-specced server.
> Since starting this investigation, I have moved to a completely new
> server, a 4 core Xeon 2.33GHz, with 8 x 10k Raptor SATA drives in
> hardware RAID10 configuration. The performance is excellent, but when I
> tried using mod_disk_cache with CacheDirLevels at 3 and cookie Vary
> headers on, it still could not keep up with pruning. Even simply
> traversing this kind of structure with du is clearly not scalable. Could
> we not have the three main levels of directory, but then have a
> different setting for the number of subdirs below the .vary dirs?
> Usually there is just one file at the leaf of the .vary subdirs, so
> having three additional levels seems like a bit of overkill. We should
> be able to tune the subdir levels to minimize the depth of the cache as
> makes sense.

What information do your cookies contain? Are these session cookies that
are individual to each client? In this case the usage of mod_disk_cache
with Vary Cookies set would be bad. As these responses would be individual
you couldn't reuse the results anyway for other clients, so it would be
the best to leave caching to the individual client caches (e.g. browser caches).
If your cookies are like BACKGROUND=blue for some users and BACKGROUND=red
for other users you should think of incorporating these differences into
the URL's instead of into varying responses.

Regarding the performance you should take a look at the following:

1. Use a separate filesystem for the cache.
2. Ensure that it is mounted with noatime option.
3. Check if you are using the right type of filesystem for this job. If the
   size of the individual cache files is rather small reiserfs can be much
   faster then ext3 if I remember correctly.

> 2. Why does htcacheclean not keep the cache at the stated size limit? If
> you say -l100M and then do a du and it says 200M, then that is
> counter-intuitive, and actually wrong in real terms. It gets worse with
> the larger caches - when I had 3 levels and cookie Vary headers on, the
> limit for htcacheclean was 1000M, but the cache would grow to 3GB and up.

Again, this is an issue with the documentation. In fact htcacheclean does
not limit the size of the cache at all. It can grow indefinitely.
It only ensures that the size of the cache is being reduced back at least
to the given limit after it ran. The size of the cache is defined as the
sum of all filesizes in the cache. It does not consider the disk usage of
these files which can be larger and it also doesn't take the sizes of the
directories into account. I am not sure if a du like measurement of the
cache size would be implementable in a platform independent way, but I
may be wrong here.

> 3. Why are .header files left over by htcachelean when it has deleted
> the .vary subdirectory? Is this something like a memory leak, but with
> files? I would have thought that if the cached content (.data) file has
> gone away, then why bother keeping the .header file around. It clogs up
> the cache directory and makes traversing the tree more work. If it's
> kept for 304 "unchanged" responses then I can understand that, but then
> why do these files still seem to pile up even after the related page
> would have clearly expired anyway? Surely better to just delete them
> when the .vary subdir is deleted. In any case, I didn't notice the
> .header files being left over when the Vary header was disabled, so I
> think this might be a straightforward "leak" when using Vary.

This seems to be a bug. Can you please try if the following patch fixes this?

Index: support/htcacheclean.c
===================================================================
--- support/htcacheclean.c      (Revision 731535)
+++ support/htcacheclean.c      (Arbeitskopie)
@@ -248,6 +248,7 @@
 {
     char *nextpath;
     apr_pool_t *p;
+    char *cache_root_path;

     if (dryrun) {
         return;
@@ -262,6 +263,49 @@
     nextpath = apr_pstrcat(p, path, "/", basename, CACHE_DATA_SUFFIX, NULL);
     apr_file_remove(nextpath, p);

+    if (deldirs && (apr_filepath_get(&cache_root_path, 0, p) == APR_SUCCESS)) {
+        apr_status_t rc;
+        char *q;
+        char *dir;
+        char *slash;
+        char *dot;
+
+        dir = apr_pstrdup(p, path);
+
+        /*
+         * now walk our way back to the cache root, delete everything
+         * in the way as far as possible
+         *
+         * Note: due to the way we constructed the file names in
+         * process_dir, we are guaranteed that the
+         * cache_root_path is suffixed by at least one '/' which will be
+         * turned into a terminating null by this loop.  Therefore,
+         * we won't either delete or go above our cache root.
+         */
+        for (q = dir + strlen(cache_root_path); *q ; ) {
+            rc = apr_dir_remove(dir, p);
+            delcount++;
+            if (rc != APR_SUCCESS && !APR_STATUS_IS_ENOENT(rc)) {
+                break;
+            }
+            slash = strrchr(q, '/');
+            *slash = '\0';
+            /*
+             * Check if we just deleted a vary directory. If we did, the
+             * corresponding header file is of no use anymore. So delete
+             * it.
+             */
+            dot = strrchr(slash + 1, '.');
+            if (dot && (strcmp(dot + 1, CACHE_VDIR_SUFFIX) == 0)) {
+                *dot = '\0';
+                nextpath = apr_pstrcat(p, dir, "/", slash + 1,
+                                       CACHE_HEADER_SUFFIX, NULL);
+                apr_file_remove(nextpath, p);
+                delcount++;
+            }
+        }
+    }
+
     apr_pool_destroy(p);

     if (benice) {


> 4. Will I be causing any potential problems for Apache by my deleting
> the leftover .header files myself (ones which have no corresponding
> .vary subdir)? Does that cause apache or htcacheclean to have potential
> issues if you do this while they are running? If they are junk then I
> can't see it being a problem, but it's unclear currently if they are
> actually used or not.

IMHO not. The patch above does the same.

Regards

Rüdiger