You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by Neil Gunton <ne...@nilspace.com> on 2008/11/24 18:47:01 UTC

Best filesystem type for mod_cache in reverse proxy?

Hi all,

I posted this to the Apache httpd users list, but no reply there, so I'm 
posting here in the hopes that someone else who uses mod_perl with 
mod_cache in a reverse proxy setup might have insight.

I am using Apache 2.2.9 (built from source) on Debian Lenny to run a 
fairly large community LAMP (Perl, MySQL) site. I use the proxy and 
cache of Apache to improve site performance - I have a front end proxy 
build and a back-end mod_perl build, both on the same server currently. 
I have been using this setup for years successfully, but most of that 
time was using Apache 1.3, with mod_access and mod_deflate from Igor 
Sysoev. Since moving to Apache 2.2, I am using the stock caching.

The cache and front-end proxy help to serve images without bogging down 
the heavy mod_perl processes, while also obviously caching the mod_perl 
content. The site gets around 100,000 page requests or more per day. The 
cache is set to 1000MB, with htcacheclean running in daemon mode, 
interval 60 minutes (but looking at the performance charts, it seems to 
be running constantly).

I am finding that the cache directories that mod_cache builds are very 
large, and take a long time to traverse under ext2. There is currently 
about 10 GB under the cache according to du, and it took 162 minutes 
just to tell me that. Basically, htcacheclean is not keeping up. I'm 
using three levels of directory. Htcacheclean also takes a long time to 
process this if I try running it from cron nightly, during which time I 
would see a huge spike in iowait on the server, and it would take upward 
of 3 hours to complete. If I run htcacheclean in daemon mode, using the 
-n (nice) option, then it doesn't seem to be able to keep up, the cache 
just creeps up in size. If I take off the nice option, then it takes up 
a lot more resources, to the point where I'm concerned it'll be 
impacting the server performance by monopolising the disks.

So what I'm observing is that at least part of the problem appears to be 
that the directory structure is just very, very big and wide and takes a 
long time to traverse, even for basic system functions like du.

This leads to my main question, which is this: Would a different 
filesystem, perhaps reiserfs, be better for this type of cache? I have 
never used reiser before, but from reputation it seems to be designed 
for handling many small files efficiently. I wonder if it would be any 
easier for my system to traverse the directory and maintain the cache if 
it was under reiser rather than ext.

If not that, then are there other filesystems which make it very 
efficient to traverse wide directory structures?

I have a quad core server (AMD Opteron 265), with four 10k SCSI drives 
set up in RAID0 (yeah I know it's risky, but everything is backed up 
immediately via mysql replication, and I need the space and performance).

Thanks!

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Perrin Harkins <pe...@elem.com>.
On Mon, Nov 24, 2008 at 4:15 PM, Neil Gunton <ne...@nilspace.com> wrote:
> Perrin Harkins wrote:
>>
>> A ton of RAM in the server might help too.
>
> I've already got 4GB in there.

Some desktop machines ship with that much these days.  You could bump
it up to 16 or 32 (assuming it's 64-bit) pretty inexpensively and let
the VM system help you out.

A software change could be cheaper if it's simple, but if it requires
you to do a lot rewriting you might save money by buying some RAM.

- Perrin

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Perrin Harkins <pe...@elem.com>.
On Tue, Nov 25, 2008 at 1:30 PM, Neil Gunton <ne...@nilspace.com> wrote:
> The only downside is that people on extremely slow dialup connections might
> notice longer download times for page text... but I have to wonder if that's
> really an issue today. Back in 1998 perhaps you might care about something
> being 20KB rather than 80KB, but surely not today. In any case, don't dialup
> ISPs often implement their own compression now?

Compressing is pretty important:
http://developer.yahoo.net/blog/archives/2007/07/high_performanc_3.html

I wonder if there's a way to make the mod_deflate Vary header a bit
saner, so it just reflects compressed or not, rather than every
possible User-Agent.

There are also alternative ways to cache pages, like pre-publishing
them as static files or doing page caching with mod_perl handlers that
intercept the request before the response phase and serve a cached
copy.  It's very convenient to use mod_cache though.

- Perrin

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by André Warnier <aw...@ice-sa.com>.
Neil Gunton wrote:
[...]
At the risk of stating the obvious, but since you are talking about 
mod_perl (and thus I suppose perl), the basic module File::Find is a 
good starting point to collect all kinds of statistics about a file 
hierarchy.
Such as how many levels maximum and average, how many files per 
directory or per depth, sizes etc..
You can easily build a script that will run regularly on your file 
structure and take some snapshots over time.
Real numbers are generally a better base for optimisation than mere 
impressions.


Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Raymond Wan <rw...@kuicr.kyoto-u.ac.jp>.
Hi Michael,


Michael Peters wrote:
> Raymond Wan wrote:
>> I had looked at the effect compression has on web pages a while ago.  
>> Though not relevant to modperl, there is obviously a cost to 
>> compression and since most HTML pages are small, sometimes it is hard 
>> to justify. 
>
> Not to discredit the work you did researching this, but a lot of 
> people are studying the same thing and coming to different conclusions:
>
> http://developer.yahoo.com/performance/rules.html
>
> Yes, backend performance matters, but more and more we realize that 
> the front end tweaks we can make  give a better performance for users.
>
> Take google as an example. The overhead of compressing their content 
> and decompressing it on the browser takes less time than sending the 
> same content uncompressed over the network. I'd say the same is true 
> for most other applications too.


It's ok; I don't consider another opinion as discrediting my work.  :-)  
Actually, it was a while ago and it was only one aspect of my work and 
in a smaller test bed.  My fault for handwaving in my reply, though.

The point is actually the "sometimes"...  My research was more in 
general compression and web compression was only one aspect.  My point 
is if you take a one byte file and run gzip -9 on it (again, the same 
algorithm as deflate), you get a 24 byte file.  As you increase that 
file size, you will reach a point where it becomes more beneficial to 
compress.  Though my example is both silly and pathological, it just 
shows that there are cases when compression may not be beneficial.  And 
one can imagine the average file size of a web site to be some kind of 
knob and as it turns (average file size increases as you go from site to 
site), the benefits become more and more evident.

For example, compressing an already compressed file is generally 
pointless (if it was done right the first time).  MP3, JPEG, GIF, etc. 
are all file formats that have or may have compression incorporated.  
PDFs can be compressed too if someone selected that option when creating 
it.  English text compresses well (25%, in general?) but two-byte 
encodings such as Chinese and Japanese (I think) get around 40-50% 
[handwaving again :-) there are more updated numbers out there].  Also, 
compression works if it is a uniform file; if a web page has a mix of 
text, images, etc., then each one has to be compressed individually.

As for Google, you are right -- I can imagine why it would work well for 
Google.  However, I can also hypothesize that it might be a special 
case.  I presume you mean the results of a query.  The result we get is 
a list of results which all are related to each other.  i.e., if you 
searched for "apache2 modperl", we can expect those two words to be in 
every result and the type of words to be similar from result to result 
[they would all be computer-oriented].  As compression aims to reduce 
redundancy, their results are perfect for it.  Especially if

Anyway, what I wanted to say is that there ought to be instances when 
compression is beneficial and when it isn't.  I think it is fine to do 
what the Yahoo site says and have it "on" by default; but if someone 
examines the traffic and data and realizes it should be "off", that 
isn't beyond reason.


>> As for dialup, if I remember from those dark modem days :-)
>
> Even non dialup customers can benefit. Many "broadband" connections 
> aren't very fast, especially in rural places (I'm thinking large 
> portions of the US).
>
> But all this talk is really useless in the abstract. Take a tool like 
> YSlow for a spin and see how your sites perform with and without 
> compression. Especially looking at the waterfall display.
>

Well, one good thing about deflate is that it is *fast*.  Very fast.  
So, while my silly one byte file example shows there are exceptions, it 
might be closer to one byte.  :-)

One cost savings might be to pre-compress files since it is more 
time-consuming to compress than decompress using deflate.  i.e., have 
them reside on the server in compressed form.  Of course, that offers 
many problems and is one reason why things like Stacker didn't really 
catch on (much)...

Ray





Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Michael Peters <mp...@plusthree.com>.
Raymond Wan wrote:

> I had looked at the effect compression has on web pages a while ago.  
> Though not relevant to modperl, there is obviously a cost to compression 
> and since most HTML pages are small, sometimes it is hard to justify. 

Not to discredit the work you did researching this, but a lot of people are studying the same thing 
and coming to different conclusions:

http://developer.yahoo.com/performance/rules.html

Yes, backend performance matters, but more and more we realize that the front end tweaks we can make 
  give a better performance for users.

Take google as an example. The overhead of compressing their content and decompressing it on the 
browser takes less time than sending the same content uncompressed over the network. I'd say the 
same is true for most other applications too.

> As for dialup, if I remember from those dark modem days :-)

Even non dialup customers can benefit. Many "broadband" connections aren't very fast, especially in 
rural places (I'm thinking large portions of the US).

But all this talk is really useless in the abstract. Take a tool like YSlow for a spin and see how 
your sites perform with and without compression. Especially looking at the waterfall display.

-- 
Michael Peters
Plus Three, LP


Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Raymond Wan <rw...@kuicr.kyoto-u.ac.jp>.
Hi

Neil Gunton wrote:
> Well, that seemed to do the trick! So the caveat seems to be: Be 
> careful using both mod_deflate and mod_cache (mod_disk_cache 
> specifically) together if you have a large dynamic website that can 
> generate a large number of distinct pages. Mod_deflate produces a


This is probably a digression from your discussion, but I'm not sure if 
any of you have used gzip + md5sum together before.  I have, and it can 
be annoying especially if you are playing with large data files like I 
do.  This is because gzip seems to (not 100% sure) store some time 
information in the archive.  So, if you create two archives of the same 
files, they aren't identical...their md5sums do not match.

As deflate is essentially the same algorithm as gzip, it is somewhat the 
same annoyance...


> Web pages seem to render a little faster in the browser too. That may 
> be my imagination and/or placebo effect, but it might make sense if 
> there isn't that additional compression/decompression going on both ends.
>
> The only downside is that people on extremely slow dialup connections 
> might notice longer download times for page text... but I have to 
> wonder if that's really an issue today. Back in 1998 perhaps you might 
> care about something being 20KB rather than 80KB, but surely not 
> today. In any case, don't dialup ISPs often implement their own 
> compression now?


I had looked at the effect compression has on web pages a while ago.  
Though not relevant to modperl, there is obviously a cost to compression 
and since most HTML pages are small, sometimes it is hard to justify.  
If users are downloading XML files of data, though, then that is of 
course worth it...but one could argue that if you are making XML files 
available for download, then wouldn't it be better to compress it 
yourself rather than asking Apache to compress on-the-fly.

As for dialup, if I remember from those dark modem days :-), even many 
of them had compression built in.  In fact, I think they had some form 
of the deflate/gzip/sliding window algorithm.  And for those of us who 
have tried gzipping an already-gzipped file, adding compression to 
something that is already compressed is generally counter-productive...

Anyway, I don't think it is much of an issue...might be more  helpful to 
educate web page creators to not put MBs of images on a single page.  :-)

Ray




>
> Anyway, hope that's helpful to anybody running large dynamic websites 
> behind a reverse proxy. Keep mod_cache, maybe think about ditching 
> mod_deflate. The combination does technically work, but for large 
> numbers of pages, it can make your cache size (and your iowait) explode.


Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Neil Gunton <ne...@nilspace.com>.
Neil Gunton wrote:
> Neil Gunton wrote:
>> Neil Gunton wrote:
>> It seems like this might have something to do with mod_deflate, which 
>> I am using in combination with mod_disk_cache. This page gives a clue 
>> that there might be a problem with the way files are cached when these 
>> modules are both enabled:
>>
>> http://www.digitalsanctuary.com/tech-blog/general/apache-mod_deflate-and-mod_cache-issues.html 
> 
> I have just been doing some experimentation on my development 
> workstation. It seems that with mod_deflate enabled, mod_cache doesn't 
> cache properly, or at least not as I would expect: I tested with two 
> browsers (Mozilla and Opera), both with no cookies related the site, and 
> loading the same page from each. Both requests were passed through to 
> the back-end, i.e. were cached separately. This is with mod_deflate 
> enabled for html pages. So I disabled mod_deflate (just commented out 
> that one line), restarted the servers, cleared the caches of both 
> browsers and mod_cache, and tried again. This time, the first request 
> was passed through to the backend (as expected), but the second request, 
> from the other browser for the same page, was this time retrieved from 
> mod_cache. Also, the cache directories on the server end look a lot 
> simpler, I guess because the Vary header is no longer being set by 
> mod_deflate. This is very interesting, I'm going to do some more testing 
> on the production server, by clearing the mod_disk_cache cache and 
> disabling mod_deflate for a while to see how things run. I know the 
> content transmitted will be larger and thus slower for people on slow 
> connections, but right now I'm interested in seeing how this affects the 
> performance of htcacheclean, and even du - see if times for traversing 
> the directories gets much better without all those extra Vary subdirs. 
> In any case, it would seem that the cache wasn't really working after 
> all, which might explain the large number of cache directories - 
> multiple versions of the same content. Yikes.

Well, that seemed to do the trick! So the caveat seems to be: Be careful 
using both mod_deflate and mod_cache (mod_disk_cache specifically) 
together if you have a large dynamic website that can generate a large 
number of distinct pages. Mod_deflate produces a Vary header, which 
forces mod_cache to store multiple versions of the same content. To 
compound this, every version involves additional subdirs in the cache, 
which makes traversing it in any fashion very, very time consuming, 
producing high iowait even for a fast 4 disk SCSI RAID0 setup.

It took more than three hours just to delete the old cache.

Once I disabled mod_deflate, the new cache looks a lot cleaner - just 
the three levels of directory that I specified in the config via 
CacheDirLevels, and none of the extra .vary sub-levels.

Additionally, du now just takes a few seconds to traverse the cache, 
which currently is set at 1GB. Htcacheclean seems to be keeping up well 
in daemon mode, with -i -n options. The large, ongoing iowait on the 
server has disappeared completely.

Web pages seem to render a little faster in the browser too. That may be 
my imagination and/or placebo effect, but it might make sense if there 
isn't that additional compression/decompression going on both ends.

The only downside is that people on extremely slow dialup connections 
might notice longer download times for page text... but I have to wonder 
if that's really an issue today. Back in 1998 perhaps you might care 
about something being 20KB rather than 80KB, but surely not today. In 
any case, don't dialup ISPs often implement their own compression now?

Anyway, hope that's helpful to anybody running large dynamic websites 
behind a reverse proxy. Keep mod_cache, maybe think about ditching 
mod_deflate. The combination does technically work, but for large 
numbers of pages, it can make your cache size (and your iowait) explode.

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Neil Gunton <ne...@nilspace.com>.
Neil Gunton wrote:
> Neil Gunton wrote:
> It seems like this might have something to do with mod_deflate, which I 
> am using in combination with mod_disk_cache. This page gives a clue that 
> there might be a problem with the way files are cached when these 
> modules are both enabled:
> 
> http://www.digitalsanctuary.com/tech-blog/general/apache-mod_deflate-and-mod_cache-issues.html 

I have just been doing some experimentation on my development 
workstation. It seems that with mod_deflate enabled, mod_cache doesn't 
cache properly, or at least not as I would expect: I tested with two 
browsers (Mozilla and Opera), both with no cookies related the site, and 
loading the same page from each. Both requests were passed through to 
the back-end, i.e. were cached separately. This is with mod_deflate 
enabled for html pages. So I disabled mod_deflate (just commented out 
that one line), restarted the servers, cleared the caches of both 
browsers and mod_cache, and tried again. This time, the first request 
was passed through to the backend (as expected), but the second request, 
from the other browser for the same page, was this time retrieved from 
mod_cache. Also, the cache directories on the server end look a lot 
simpler, I guess because the Vary header is no longer being set by 
mod_deflate. This is very interesting, I'm going to do some more testing 
on the production server, by clearing the mod_disk_cache cache and 
disabling mod_deflate for a while to see how things run. I know the 
content transmitted will be larger and thus slower for people on slow 
connections, but right now I'm interested in seeing how this affects the 
performance of htcacheclean, and even du - see if times for traversing 
the directories gets much better without all those extra Vary subdirs. 
In any case, it would seem that the cache wasn't really working after 
all, which might explain the large number of cache directories - 
multiple versions of the same content. Yikes.

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Michael Peters <mp...@plusthree.com>.
Adam Prime wrote:

> That does look like a big deal, if i were in your situation, I'd try 
> running with only mod_deflate, then only mod_cache, and see what 
> happens.  There are benefits to running the reverse proxy alone (without 
> mod_cache), so that'd be the first scenario i'd try.

Or split them up. If you have any static assets that can benefit from mod_deflate (Javascript, CSS, 
etc) then put mod_deflate on the proxies and mod_perl, mod_cache on the backend.

-- 
Michael Peters
Plus Three, LP


Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Adam Prime <ad...@utoronto.ca>.
Neil Gunton wrote:
>
> It seems like this might have something to do with mod_deflate, which 
> I am using in combination with mod_disk_cache. This page gives a clue 
> that there might be a problem with the way files are cached when these 
> modules are both enabled:
>
> http://www.digitalsanctuary.com/tech-blog/general/apache-mod_deflate-and-mod_cache-issues.html 
>
>
> Seems like a very recent post (Nov 18th).
>
> Any ideas? Seems like a big problem, if you're trying to use a reverse 
> proxy on a large dynamic site, and also optimize bandwidth by using 
> mod_deflate too.
>
> Neil
That does look like a big deal, if i were in your situation, I'd try 
running with only mod_deflate, then only mod_cache, and see what 
happens.  There are benefits to running the reverse proxy alone (without 
mod_cache), so that'd be the first scenario i'd try.

Adam

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Neil Gunton <ne...@nilspace.com>.
Neil Gunton wrote:
> Well, the du just finished, it took 214 minutes to complete. I just took
> a look at one of the directories in the cache. Now, I have it set for a
> depth of 3, so I looked at d/d/d just randomly selected. Then I did a du
> there. Here's the output:
> 
> server:/var/cache/www/d/d/d# du -h
> 4.0K    ./2BykLs49Xm7cnV6MrWA.header.vary/Y/z/m
> 8.0K    ./2BykLs49Xm7cnV6MrWA.header.vary/Y/z
> 12K    ./2BykLs49Xm7cnV6MrWA.header.vary/Y
> 16K    ./2BykLs49Xm7cnV6MrWA.header.vary
> 4.0K    ./YFPZLpyo_NRtEUoJQQA.header.vary/k/a/y
> 8.0K    ./YFPZLpyo_NRtEUoJQQA.header.vary/k/a
> 12K    ./YFPZLpyo_NRtEUoJQQA.header.vary/k
> 16K    ./YFPZLpyo_NRtEUoJQQA.header.vary
> 16K    ./UM@uZ0AwL5n@QqLWnrA.header.vary/F/O/b
> 20K    ./UM@uZ0AwL5n@QqLWnrA.header.vary/F/O
> 24K    ./UM@uZ0AwL5n@QqLWnrA.header.vary/F
> 28K    ./UM@uZ0AwL5n@QqLWnrA.header.vary
> 4.0K    ./FrakgI6EKDUjb4dgMXQ.header.vary/G/N/n
> 8.0K    ./FrakgI6EKDUjb4dgMXQ.header.vary/G/N
> 12K    ./FrakgI6EKDUjb4dgMXQ.header.vary/G
> 16K    ./FrakgI6EKDUjb4dgMXQ.header.vary
> 80K    .
> 
> So you see, there are actually a lot more directories there than you
> might assume based on a 3-level tree! I didn't know it was doing all
> this as well, it makes more sense now that it would take a long time to
> traverse - we're talking about a huge number of directories after you do
> 3 levels, one for each letter (large and small case) at each level, then
> throw in those additional sub-levels... for EVERY leaf of the 3-level
> tree, that's staggering. I need to look into the documentation for
> mod_cache to see if there is something I need to tweak with this "vary"
> stuff - maybe it's doing more than it has to, but I just don't know.

It seems like this might have something to do with mod_deflate, which I 
am using in combination with mod_disk_cache. This page gives a clue that 
there might be a problem with the way files are cached when these 
modules are both enabled:

http://www.digitalsanctuary.com/tech-blog/general/apache-mod_deflate-and-mod_cache-issues.html

Seems like a very recent post (Nov 18th).

Any ideas? Seems like a big problem, if you're trying to use a reverse 
proxy on a large dynamic site, and also optimize bandwidth by using 
mod_deflate too.

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Neil Gunton <ne...@nilspace.com>.
Perrin Harkins wrote:
> A ton of RAM in the server might help too.

I've already got 4GB in there.

Well, the du just finished, it took 214 minutes to complete. I just took
a look at one of the directories in the cache. Now, I have it set for a
depth of 3, so I looked at d/d/d just randomly selected. Then I did a du
there. Here's the output:

server:/var/cache/www/d/d/d# du -h
4.0K	./2BykLs49Xm7cnV6MrWA.header.vary/Y/z/m
8.0K	./2BykLs49Xm7cnV6MrWA.header.vary/Y/z
12K	./2BykLs49Xm7cnV6MrWA.header.vary/Y
16K	./2BykLs49Xm7cnV6MrWA.header.vary
4.0K	./YFPZLpyo_NRtEUoJQQA.header.vary/k/a/y
8.0K	./YFPZLpyo_NRtEUoJQQA.header.vary/k/a
12K	./YFPZLpyo_NRtEUoJQQA.header.vary/k
16K	./YFPZLpyo_NRtEUoJQQA.header.vary
16K	./UM@uZ0AwL5n@QqLWnrA.header.vary/F/O/b
20K	./UM@uZ0AwL5n@QqLWnrA.header.vary/F/O
24K	./UM@uZ0AwL5n@QqLWnrA.header.vary/F
28K	./UM@uZ0AwL5n@QqLWnrA.header.vary
4.0K	./FrakgI6EKDUjb4dgMXQ.header.vary/G/N/n
8.0K	./FrakgI6EKDUjb4dgMXQ.header.vary/G/N
12K	./FrakgI6EKDUjb4dgMXQ.header.vary/G
16K	./FrakgI6EKDUjb4dgMXQ.header.vary
80K	.

So you see, there are actually a lot more directories there than you
might assume based on a 3-level tree! I didn't know it was doing all
this as well, it makes more sense now that it would take a long time to
traverse - we're talking about a huge number of directories after you do
3 levels, one for each letter (large and small case) at each level, then
throw in those additional sub-levels... for EVERY leaf of the 3-level
tree, that's staggering. I need to look into the documentation for
mod_cache to see if there is something I need to tweak with this "vary"
stuff - maybe it's doing more than it has to, but I just don't know.

Neil


Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Perrin Harkins <pe...@elem.com>.
On Mon, Nov 24, 2008 at 3:46 PM, Michael Peters <mp...@plusthree.com> wrote:
> He's already using RAID0, which should be the best performance of RAID since
> it doesn't have to use any parity blocks/disks right?

Yes, I missed that.  He could still improve the throughput by adding more disks.

> And from what I've
> seen about SSD (can't find a link now) filesystems haven't caught up to it
> to make a real difference with one over the other. They do have much lower
> powser usage though (which is why they find their way into laptops).

We're talking high-end SSD, not the stuff they put in laptops.  It's
fast, and you can make a RAID array of them, and it's within a
reasonable price range now.

A ton of RAM in the server might help too.

- Perrin

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Michael Peters <mp...@plusthree.com>.
Perrin Harkins wrote:
> On Mon, Nov 24, 2008 at 3:16 PM, Michael Peters <mp...@plusthree.com> wrote:
>> Well except for getting 15K disks you probably won't be able to get much
>> more improvement from just the hardware.
> 
> You don't think so?  RAID and SSD can both improve your write
> throughput pretty significantly.

He's already using RAID0, which should be the best performance of RAID since it doesn't have to use 
any parity blocks/disks right? And from what I've seen about SSD (can't find a link now) filesystems 
haven't caught up to it to make a real difference with one over the other. They do have much lower 
powser usage though (which is why they find their way into laptops).

-- 
Michael Peters
Plus Three, LP


Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Holger Kipp <hk...@alogis.com>.
On Mon, Nov 24, 2008 at 03:37:29PM -0500, Perrin Harkins wrote:
> On Mon, Nov 24, 2008 at 3:16 PM, Michael Peters <mp...@plusthree.com> wrote:
> > Well except for getting 15K disks you probably won't be able to get much
> > more improvement from just the hardware.
> 
> You don't think so?  RAID and SSD can both improve your write
> throughput pretty significantly.

Using squid he could define one cache-directory for every disk,
so striping won't increase performance of the disks that much.
more important might be how the os is caching write changes to
mitigate limited bandwidth (io) of the disks.

With ReiserFS I have seen some benchmarks that are not really in
favour, like

http://linuxgazette.net/122/TWDT.html#piszcz

and my experience with UFS2 (albeit on FreeBSD) was much better
than with Linux/ReiserFS on the same machine. Neither were tuned, though,
so ymmv.

Regards,
Holger Kipp

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Perrin Harkins <pe...@elem.com>.
On Mon, Nov 24, 2008 at 3:16 PM, Michael Peters <mp...@plusthree.com> wrote:
> Well except for getting 15K disks you probably won't be able to get much
> more improvement from just the hardware.

You don't think so?  RAID and SSD can both improve your write
throughput pretty significantly.

- Perrin

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Neil Gunton <ne...@nilspace.com>.
Michael Peters wrote:
> Michael Peters wrote:
> 
> But these benchmarks (http://www.debian-administration.org/articles/388) 
> say the following:
> 
>   For quick operations on large file tree, choose Ext3 or XFS. 
> Benchmarks from other authors have
>   supported the use of ReiserFS for operations on large number of small 
> files. However, the present
>   results on a tree comprising thousands of files of various size (10KB 
> to 5MB) suggest than Ext3 or
>   XFS may be more appropriate for real-world file server operations
> 
> But they both say don't use ext2 :)

This may be a tangent, but my understanding is that the only real 
difference between ext2 and ext3 is the journaling, which is related to 
safety in the event of unclean shutdown rather than everyday 
performance. If anything, in fact, ext3 performs a little worse than 
ext2 because of the requirement to keep the journal (which means more 
writes to the disk for updates). Otherwise, all the optimization 
features such as dir_index are, I think, available for ext2 as well as 
ext3. I have noticed that for SSD drives (e.g. the Asus Eee PC, which I 
have), people recommend using ext2, since it's less likely to result in 
the write fatigue that those drives experience over time (you only get 
so many writes). And for laptops, ext2 results in fewer io writes. 
Finally, I have noticed my iowait times go down since I moved from using 
ext3 to ext2 on the server (previously I always used ext3, but for a 
recent rebuild I switched to ext2 to see how it did).

Of course I may be wrong about all this, but my experience seems to 
favor ext2 over ext3, at least for performance. Since I back everything 
up on the server anyway (using RAID0, a necessity), I am more concerned 
with performance than unclean shutdowns. In any case the server is in a 
datacenter with UPS, so that is not so likely, though it did happen once 
and I didn't lose any data even then.

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Michael Peters <mp...@plusthree.com>.
Michael Peters wrote:

> According to these benchmarks 
> (http://fsbench.netnation.com/new_hardware/2.6.0-test9/scsi/bonnie.html) 
> ReiserFS handles deletes much better than ext2 (10,015/sec vs 729/sec)

But these benchmarks (http://www.debian-administration.org/articles/388) say the following:

   For quick operations on large file tree, choose Ext3 or XFS. Benchmarks from other authors have
   supported the use of ReiserFS for operations on large number of small files. However, the present
   results on a tree comprising thousands of files of various size (10KB to 5MB) suggest than Ext3 or
   XFS may be more appropriate for real-world file server operations

But they both say don't use ext2 :)

-- 
Michael Peters
Plus Three, LP


Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Michael Peters <mp...@plusthree.com>.

Neil Gunton wrote:
> Perrin Harkins wrote:
>> On Mon, Nov 24, 2008 at 2:42 PM, Neil Gunton <ne...@nilspace.com> wrote:
>>> The section on "Maintaining the Disk Cache" says you should use
>>> htcacheclean, which is what I've been doing, and it doesn't seem to 
>>> be up to
>>> the job.
>>
>> I can't speak to your filesystem question but you might consider
>> getting better disks.  Either a RAID system or a SSD would help your
>> write speed and both are pretty cheap these days.
> 
> I'm using 4x10k SCSI drives in RAID0 configuration currently, on an 
> Adaptec zero channel SmartRaid V controller. Filesystem is ext2.

Well except for getting 15K disks you probably won't be able to get much more improvement from just 
the hardware.

According to these benchmarks 
(http://fsbench.netnation.com/new_hardware/2.6.0-test9/scsi/bonnie.html) ReiserFS handles deletes 
much better than ext2 (10,015/sec vs 729/sec)

-- 
Michael Peters
Plus Three, LP


Re: Best filesystem type for mod_cache in reverse proxy?

Posted by John Hallam <jo...@mmmi.sdu.dk>.
On Mon, 24 Nov 2008, Neil Gunton wrote:

> I think the issue here is the large size of the directory tree itself - 
> simply traversing this seems to be a problem. I started off a du this 
> morning on that tree, at around 9am, and it's now after 12 midday and 
> the command is still not done yet. Meanwhile my iowait has doubled on 
> the server as a result.

 	Just a random thought...  The O(n) directory search/traversal in 
filesystems only hits you if you have directories with many many files in. 
If your directories are like the one you sampled, with few items in, then 
maybe you are thrashing one of the filesystem caches -- inodes, vnodes or 
such -- while traversing the tree.  I don't recall off-hand how you check 
this, though looking at the output of iostat and vmstat would give you 
some idea of where the traffic is in the VM and block IO subsystems.

Best wishes,

 	John

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Neil Gunton <ne...@nilspace.com>.
André Warnier wrote:
> Neil Gunton wrote:
> [...]
> Hi.
> I am not really an expert on large websites, caches and so on, but in 
> our applications we are managing a large number of files.
> One of the things we have learned over the years, is that even on modern 
> operating systems, having large numbers of entries in each directory is 
> an absolute performance killer.
> This may thus be or not relevant to your particular problem, but what is 
> the average number of entries you have *per directory* ?

I'm not sure what the average number of files per directory is 
currently. Is there a linux tool which gives that kind of statistic?

Looking at one random bucket, there were only 2 files in there.

I think the issue here is the large size of the directory tree itself - 
simply traversing this seems to be a problem. I started off a du this 
morning on that tree, at around 9am, and it's now after 12 midday and 
the command is still not done yet. Meanwhile my iowait has doubled on 
the server as a result. Obviously it's a lot of work just traversing 
this tree, since du is not even doing any pruning, just walking the 
directory tree. It makes me wonder if there's something wrong with my 
system, though it seems ok in all other respects. I think this is just a 
not-very-efficient datastructure, at least with respect to this 
filesystem, hence my original question about reiserfs. I think I need 
either a filesystem better suited to traversing large directory trees, 
or else a different tool that keeps track of the cache in a different 
manner.

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by André Warnier <aw...@ice-sa.com>.
Neil Gunton wrote:
[...]
Hi.
I am not really an expert on large websites, caches and so on, but in 
our applications we are managing a large number of files.
One of the things we have learned over the years, is that even on modern 
operating systems, having large numbers of entries in each directory is 
an absolute performance killer.
This may thus be or not relevant to your particular problem, but what is 
the average number of entries you have *per directory* ?


Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Neil Gunton <ne...@nilspace.com>.
Perrin Harkins wrote:
> On Mon, Nov 24, 2008 at 2:42 PM, Neil Gunton <ne...@nilspace.com> wrote:
>> The section on "Maintaining the Disk Cache" says you should use
>> htcacheclean, which is what I've been doing, and it doesn't seem to be up to
>> the job.
> 
> I can't speak to your filesystem question but you might consider
> getting better disks.  Either a RAID system or a SSD would help your
> write speed and both are pretty cheap these days.

I'm using 4x10k SCSI drives in RAID0 configuration currently, on an 
Adaptec zero channel SmartRaid V controller. Filesystem is ext2.

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Perrin Harkins <pe...@elem.com>.
On Mon, Nov 24, 2008 at 2:42 PM, Neil Gunton <ne...@nilspace.com> wrote:
> The section on "Maintaining the Disk Cache" says you should use
> htcacheclean, which is what I've been doing, and it doesn't seem to be up to
> the job.

I can't speak to your filesystem question but you might consider
getting better disks.  Either a RAID system or a SSD would help your
write speed and both are pretty cheap these days.

- Perrin

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Neil Gunton <ne...@nilspace.com>.
Neil Gunton wrote:
> http://httpd.apache.org/docs/2.0/mod/mod_disk_cache.html#cachegcinterval

Oops - sorry, I seem to have been looking at the 2.0 docs, rather than 
the 2.2. In 2.2, it appears that CacheGCInterval has disappeared...

Now, looking at the 2.2. caching guide:

http://httpd.apache.org/docs/2.2/caching.html

The section on "Maintaining the Disk Cache" says you should use 
htcacheclean, which is what I've been doing, and it doesn't seem to be 
up to the job.

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Neil Gunton <ne...@nilspace.com>.
Perrin Harkins wrote:
> One thing you didn't mention is why you're using mod_cache at all for
> things not generated by mod_perl.  Why don't you serve the static
> files directly from your front-end server?  That's the most common
> setup I've seen, with proxying only for mod_perl requests.

Yes, I am only caching mod_perl content. I exclude things like the 
static files and images. I cache mod_perl output for performance in 
cases like slashdottings (or, these days, links from digg or reddit 
etc). The problem is, the site gets so many page requests, that 
htcacheclean just seems to be a little overwhelmed.

I'm looking at Squid right now, and have sent a message to their list to 
see what they think. At first glance, Squid does seem to have a fairly 
big list of configuration directives, so it's possible it might be able 
to handle what I need. I'm open to switching, if it turns out that Squid 
uses a more scalable cache pruning methodology. I'm a little sad to see 
that Apache's mod_cache doesn't seem to even be complete yet - e.g. 
directives like CacheGcInterval aren't implemented:

http://httpd.apache.org/docs/2.0/mod/mod_disk_cache.html#cachegcinterval

Maybe Squid is more mature in the caching department... dunno, but worth 
a look. I'd appreciate any more experienced people here educating me if 
this is wrong.

Thanks again,

Neil

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Perrin Harkins <pe...@elem.com>.
On Mon, Nov 24, 2008 at 1:56 PM, Neil Gunton <ne...@nilspace.com> wrote:
> Someone replied to me off-list suggesting using Squid instead of httpd for
> the front-end caching reverse proxy. I guess that is a good question - I use
> Apache for proxying mainly because I know apache quite well, and like being
> able to use mod_rewrite and other neat features that httpd gives. I've never
> used Squid. Does anyone have opinions there?

I think you hit the main issue right there: squid is not apache and
you can't use the same tools with it.  I also haven't seen any recent
benchmark suggesting squid performs better, but I'd like to run a set
of benchmarks on all the recent proxy servers to really sort this out.

> Does anyone run a 3-layer combination of Squid for cache, and then an Apache
> front end proxy (no mod_cache) for it's mod_rewrite capabilities, and then
> the back-end mod_perl server?

That's a bad idea.  Too much overhead.

> I need mod_rewrite at some point for stuff like stopping image hotlinking
> from other websites (people stealing my bandwidth by making my server act as
> an image server for their forums, auctions etc), and other access control
> stuff. I'll have to look into whether squid can do all that.

Squid can do a lot, but you have to learn it, and it's not as
comprehensive as apache.

One thing you didn't mention is why you're using mod_cache at all for
things not generated by mod_perl.  Why don't you serve the static
files directly from your front-end server?  That's the most common
setup I've seen, with proxying only for mod_perl requests.

- Perrin

Re: Best filesystem type for mod_cache in reverse proxy?

Posted by Neil Gunton <ne...@nilspace.com>.
Neil Gunton wrote:
> The cache and front-end proxy help to serve images without bogging down 
> the heavy mod_perl processes, while also obviously caching the mod_perl 
> content. The site gets around 100,000 page requests or more per day. The 
> cache is set to 1000MB, with htcacheclean running in daemon mode, 
> interval 60 minutes (but looking at the performance charts, it seems to 
> be running constantly).
> 
> I am finding that the cache directories that mod_cache builds are very 
> large, and take a long time to traverse under ext2. There is currently 
> about 10 GB under the cache according to du, and it took 162 minutes 
> just to tell me that. Basically, htcacheclean is not keeping up. I'm 
> using three levels of directory. Htcacheclean also takes a long time to 
> process this if I try running it from cron nightly, during which time I 
> would see a huge spike in iowait on the server, and it would take upward 
> of 3 hours to complete. If I run htcacheclean in daemon mode, using the 
> -n (nice) option, then it doesn't seem to be able to keep up, the cache 
> just creeps up in size. If I take off the nice option, then it takes up 
> a lot more resources, to the point where I'm concerned it'll be 
> impacting the server performance by monopolising the disks.
> 
> So what I'm observing is that at least part of the problem appears to be 
> that the directory structure is just very, very big and wide and takes a 
> long time to traverse, even for basic system functions like du.

Someone replied to me off-list suggesting using Squid instead of httpd 
for the front-end caching reverse proxy. I guess that is a good question 
- I use Apache for proxying mainly because I know apache quite well, and 
like being able to use mod_rewrite and other neat features that httpd 
gives. I've never used Squid. Does anyone have opinions there? Is Squid 
better at managing its cache files in a sane (and efficient, i.e. no 
100% iowait) fashion?

Does anyone run a 3-layer combination of Squid for cache, and then an 
Apache front end proxy (no mod_cache) for it's mod_rewrite capabilities, 
and then the back-end mod_perl server?

I need mod_rewrite at some point for stuff like stopping image 
hotlinking from other websites (people stealing my bandwidth by making 
my server act as an image server for their forums, auctions etc), and 
other access control stuff. I'll have to look into whether squid can do 
all that.

I'm open to alternatives, if it turns out that Apache's mod_cache simply 
isn't mature enough yet. I notice that some of the features of mod_cache 
have not even been implemented yet, so maybe this module isn't really 
ready for prime time yet? Opinions? Surely most people using mod_perl in 
a production environment must be using some form of reverse proxy, since 
it just makes so much sense from a server utilization point of view.

Thanks again,

Neil