You are viewing a plain text version of this content. The canonical link for it is here.

Posted to asp@perl.apache.org by Brat Wizard <br...@naxs.com> on 2002/01/22 16:48:54 UTC

OT HELP/ADVICE: Experiences with db's & serving lots of pages


Howdy--

This is an off-topic post-- feel free to ignore it.

I know the folks on this list are active web developers and generally
smart people and this is a problem that has been vexing me for some time
now-- I am at the point of having to do _something_ and I could use some
advice...

I am working on a large application which is built on top of postgresql.
The system is currently spread out over several hosts and will
(relatively soon) grow to handling many many requests per short
interval. (As an idea of what the site is/does, think "listings").
Overall everything works fine. An issue to contend with is fairly
frequently portions of the data expire as old listings fall off, and new
listings are added (in hundreds of categories). Another minor wrinkle is
that the listings need to be accessible from one of about 5 sort orders.

Currently the lists of listings are statically-cached on the system.
I have a back-end process that runs periodically to rebuild the static
caches which basically preprocesses the listings and prepares html
static "chunks" that can then be included at the appropriate time. This
I have found to be pretty much the fastest approach from the user's
point of view, and probably, I'm thinking, the least of the evils as far
as system load goes. But I'm concerned about several things and so
I keep mulling over some other ideas and this is what I need advice
on...

At the moment, rebuilding the static caches is not a big issue- it
doesn't take long and its possible to get it all rebuilt in less than
half an hour. I'm concerned though what the next step will be when the
number of listings grows to the point that half and hour (or an hour)
isn't enough. One ameliorating factor, I suspect, will be that the
number of hosts needed to serve users will continue to grow to the point
that the sensible thing will be to break up the site to serve up only a
subset of categories per host (or host cluster) and thus the load will
be spread out that way. This will somewhat reduce the amount of overhead
spent rebuilding category listings for a bit-- but ultimately, in my
reasoning-- the system will reach a point where it is pretty much
continuously rebuilding category listings and putting a constant strain
on the main "live" database. So somewhere down the road another solution
will have to be found. What other solutions are there?

I have considered an alternative approach to maintaining category
listings-- instead of pre-generating entire pages as static chunks- what
about using something like gdbm (or whatever) to make tiny little
category databases that store pre-generated __rows__ of each page, and
then pull "x" of them at a time to make a page... has anyone tried
something like this? What is the performance like? Is there much
overhead associated with constantly opening and closing these types of
database files (I'm not real familiar with non-postgres databases)...?
Would this make it easier to deal with sorting? Ideally it would be nice
to maintain only one body of category data and just resort it on the fly
(or maintain indexes) as needed. How much efficiency would be lost
versus time to serve up pages? I know there has to be a trade-off
someplace but how bad will it be?

Is there some other solution I'm missing? Is it faster to access
something pre-generated from the disk (doing asynch system calls vs.
lighter-weight network db access calls) to get data? I'm cool with the
idea that the list of listings does not have to be up to the absolute
minute accurate- reasonably accurate (say half-hour, hour, two-hours) is
close enough...

What are you experiences with this type of situation?

Thanks for any help or guidance anyone can offer-- there don't seem to
be a lot of places "out there" that one can go to get this type of
information.

Regards,

John Whitten
brat@naxs.com
Wizard.Org, Inc.



---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org

Re: OT HELP/ADVICE: Experiences with db's & serving lots of pages

Posted by Joshua Chamas <jo...@chamas.com>.

Joshua Chamas wrote:
> 
> As your data set grows, its better & easier to cache on the fly than
> to prebuild your cached data.  For this purpose, you could use
> the $Response->Include() cache API to do component based caching,
> which is supported in Apache::ASP 2.29.
> 
> Check out http://www.apache-asp.org/objects.html#%24Response-%3EI3a1a8b77
> 

BTW, if you do decide to use include caching, please read:

  http://www.apache-asp.org/config.html#Caching

as caching will not work if you do not follow directions there.

Also condider a mod_proxy front end to cache whole static HTML
pages too!

--Josh
_________________________________________________________________
Joshua Chamas                           Chamas Enterprises Inc.
NodeWorks Founder                       Huntington Beach, CA  USA 
http://www.nodeworks.com                1-714-625-4051

---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org

Re: OT HELP/ADVICE: Experiences with db's & serving lots of pages

Posted by Joshua Chamas <jo...@chamas.com>.

Brat Wizard wrote:
> 
> Howdy--
> 
> This is an off-topic post-- feel free to ignore it.
> 
> I know the folks on this list are active web developers and generally
> smart people and this is a problem that has been vexing me for some time
> now-- I am at the point of having to do _something_ and I could use some
> advice...
> 
> I am working on a large application which is built on top of postgresql.
> The system is currently spread out over several hosts and will
> (relatively soon) grow to handling many many requests per short
> interval. (As an idea of what the site is/does, think "listings").
> Overall everything works fine. An issue to contend with is fairly
> frequently portions of the data expire as old listings fall off, and new
> listings are added (in hundreds of categories). Another minor wrinkle is
> that the listings need to be accessible from one of about 5 sort orders.
> 

As your data set grows, its better & easier to cache on the fly than
to prebuild your cached data.  For this purpose, you could use 
the $Response->Include() cache API to do component based caching,
which is supported in Apache::ASP 2.29.

Check out http://www.apache-asp.org/objects.html#%24Response-%3EI3a1a8b77

Let's say you have a listings page that does an expensive SQL query
to generate 50 listings.  Or a listings page that does 50 
inexpensive SQL queries.  Chances are that if you cache the results
to disk locally, that you have great performance gains, that 
you will take a calculation that might take .25 sec normally, and
reduce to a .03 second cache fetch.  If you use a modern OS that
handled file buffering for you well, the disk based cache will
really be cached in local RAM anyway, so can be really fast.

The trick to use Apache::ASP include caching is to create the largest
include that can be reused between your web site visitors.  So let's
say this was a real estate listing search, you could:

$Response->Include({
                         File => 'realestate_search.inc',
                         Cache => 1, # to activate cache layer
                         Expires => 300, # to expire in 5 minutes
                       }, 
		$Request->Form('search_zip')
		); 

The search_zip will be passed into the include as $_[0], or shift(@_)
and the component cache key will be based on search_zip in part so
the caching will be per query.  This output from this component
would be fully caching, so no time would even be spent generating
the HTML from the SQL.

If you didn't mind spending time generating the HTML, but just
wanted to cache the SQL results, you could do something like:

my $rows = $Response->Include({ 
		File => 'realestate_search_results.inc', 
		Cache => 1, 
		Expires => 300 
		},
		$Request->Form('search_zip')
		);

and then in the include realestate_search_results.inc, it could ultimately:

<%
 ...
 return $rows;
%>

The returned result rows will also be cached, as the include caching layer
caches both the output and return value of a cached include.

What I would not do is cache the results of a simple one row select, as this may
be even slower than the time the database would spend doing it.  Deciding which
includes to cache must be carefully done.  Now if one needed to scale to a
large cluster, I take it back, then it can be worthwhile to cache even
simple selects just so you can front end your database with more web servers.
Caching static SQL table data is a no brainer, but caching dynamic data
needs to be done with great care.

--Josh

_________________________________________________________________
Joshua Chamas                           Chamas Enterprises Inc.
NodeWorks Founder                       Huntington Beach, CA  USA 
http://www.nodeworks.com                1-714-625-4051

---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org

Re: OT HELP/ADVICE: Experiences with db's & serving lots of pages

Posted by Joshua Chamas <jo...@chamas.com>.

> The searching has me more concerned. That one is a lot more difficult to anticipate and control. Initially I think I'm just going to have to take
> the hit on servers and let it bang on the database directly. I just don't see any other way around it. As the site gets bigger, I think it may work
> to move searches to another machine and handle them out of a "nearly-live" databas
> -- perhaps even served directly out of ram (...?) and implemented with a lighter-weight faster responding database platform (mysql or dbm perhaps?)
> 

One common technique I use for handling searches against a database is to 
acquire an exclusive lock, so that no more than one search can run at
a time on the system.  For a system that does more than searching, this
ensures that no matter how many searches have queued up, the rest of the 
site will still perform reasonably, and for a multiprocesssor database will
further ensure that the searching does not use more than 1CPU at a time.

In MySQL, I typically do this with GET_LOCK() user locks.  In Oracle I
tend to do this by acquiring a row level lock somewhere relevant with
a select ... for update, and then commit to allow another search to begin.

With this kind of queue forming, the next thing to worry about
is that you have enough mod_perl servers to wait in this queue,
so you cannot get away with just 5 backend mod_perl servers in
a dual mod_proxy/mod_perl type of config, but rather, you would
probably need 20+ mod_perl servers at the very minimum so when
the queue builds up it does not starve your entire site.

--Josh

_________________________________________________________________
Joshua Chamas                           Chamas Enterprises Inc.
NodeWorks Founder                       Huntington Beach, CA  USA 
http://www.nodeworks.com                1-714-625-4051

---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org

Re: OT HELP/ADVICE: Experiences with db's & serving lots of pages

Posted by Brat Wizard <br...@naxs.com>.

>    Hope my comments have been of some small help.
>    Ray
>

Ray-

Actually your comments have been greatly helpful. It is very useful to
have a dialog with other folks who have actually implemented large
database + web solutions. I know that each one is different, or most of
them anyway- and that is one of the things that makes all this
challenging.

In our scenario, we are working on model data (and pre-loaded "live"
data) but do not yet have public users so we do not have "live" usage
data (history) to look at yet (which always makes the problem even more
fun to try to solve ;) So I am attempting to anticipate demand and
figure out what usage requirements will be based on the nature of the
site (olm listings). Your point regarding seasonal demand is interesting
and one that I had not considered. Offhand, I cannot think of many
categories of listings that are particularly seasonal in our setup but
there probably are some. On the other hand, there probably are areas of
the site that are (will be) considerably less trafficked than others. I
definately need to rebuild all pages every couple of hours or so but you
are absolutely right in pointing out that a scheduled and/or slewed
approach might well reduce the loading factor considerably. (Thanks!) In
rebuilding listings, I don't get the luxury of regenerating one page or
another, rather an entire category at a time.

>From your input, and from that of Josh, Thanos and everyone else who's
helped out- it would seem that I'm already on the right track in
generating static chunks and what's left to figure out is the best
scheduling mechanism (which may in large part not be known until real
history can be developed).

The searching has me more concerned. That one is a lot more difficult to
anticipate and control. Initially I think I'm just going to have to take
the hit on servers and let it bang on the database directly. I just
don't see any other way around it. As the site gets bigger, I think it
may work to move searches to another machine and handle them out of a
"nearly-live" databas
-- perhaps even served directly out of ram (...?) and implemented with a
lighter-weight faster responding database platform (mysql or dbm
perhaps?)

Since I'm already certain that the system will grow wider and the most
obvious split point is across categories, this isn't as unweildy as it
may seem at first glance and all that's left would be to implement some
back-channel inter-machine communication layer to search across
categories that are maintained on other machines and to collect the
search results and spool them out to the browser.

Does this sound feasible/practical?

John





Ray Cote wrote:

> Hi Brat:
>
> >What are your experiences with squid? Have you used it before? What
> sort
> >of improvement have you typically seen by using it?
>
> Short-hand, Squid delivers as quickly as Apache, just from a
> different machine, thus freeing up your primary server. Plus, since
> Squid actually delivers from RAM, it should be quicker than Apache
> delivering from disk.
>
> >I've never seen a well-crafted study of
> >this scenario and no opinions I trust on the subject. One of these
> days
> >I reckon I'll have to model it and find out- but it would be nice to
> find
> >someone who already has an opionion.
>
> I do note that one of the other answers on the list suggests letting
> ASP deliver direct out of the database and then cache that. How well
> it works vs static? Expect this to be highly system dependent.
> At present, we have a fairly small search engine (food related,
> http://www.digitalhearth.com/> that generates static pages at night
> for the links. However, one of the things we find is that usage goes
> in cycles. a) seasonal, b) depending on articles other places
> publish, etc.
>
> Assume you have 1,000 pages (just for discussion). Now ask how many
> of those 1,000 pages are going to be delivered today? If it is all
> 1,000 will be hit, then static may work well unless, of course, you
> can maintain all 1,000 in your cache, then we're back to thinking
> dynamic delivery is fine.
>
> If, however, you find that only 100 of the 1,000 are going to be
> delivered, then you start to lean towards the dynamic delivery
> method. Why generate those other 900 pages if you don't deliver them.
>
>  From your earlier message, you stated that you rebuilt the static
> pages many times during the day. I'd expect you cannot get more than
> 20% usage (talking off the top of my head) of those pages. Thus, by
> building static, you're wasting the build time for 80% of your pages.
> Wow!
>
> >So the next question all this leads to is what is the best "search"
> >scenario for a similar setup?
>
> Again my question relates to how many different searches do you get?
> We've been discussing similar problems here (not implementing any of
> these, so take these as off-the-cuff suggestions). May be interesting
> to disconnect these caches from the session and just cache them
> globally. Scan back through your search logs and determine what, if
> any, clusters you get. Do 50% of the people search for foo? Then a
> cached foo search will save a lot of work. You can just keep this
> tucked away somewhere and, when you want to 'refresh' the database,
> let your systems know that the search results need to be updated.
> Perhaps some structure with two simple tables (obviously
> implementation would be a bit more robust):
>
> |SearchID|SearchString |
> |   1    | foo         |
> |   2    | amazing foo |
>
> |SearchID| Rank | IndexIntoDatabase |
> |   1    |   1  |     80988         |
> |   1    |   2  |     5233          | etc.
>
> This keeps your search results hanging about in rank order for each
> specific search string. May also want to maintain a third table:
>
> |SearchID|Frequency|
> |   1    |   8342  |
> |   2    |     5   |
>
> which tells you how often each of the searches was requested. Then
> when you decide to refresh the queries, you could always prime this
> table with the top 100 searches so they respond quickest.
>
> I have to say I really enjoy these types of problems because there is
> simply no correct answer. There is only the ability to tune your
> solution against field demand. A never ending problem.
>
> Hope my comments have been of some small help.
> Ray
>
> --
> -----------------------------------------------------------------
> Raymond Cote, President                 Appropriate Solutions, Inc.
> www.AppropriateSolutions.com       rgacote@AppropriateSolutions.com
> 603.924.6079(v)  POB 458, Peterborough, NH 03458    603.924.8668(f)

Re: OT HELP/ADVICE: Experiences with db's & serving lots of pages

Posted by Brat Wizard <br...@naxs.com>.

Thanks for taking the time to reply Ray- I like using up simple ideas
first too. I'm just worried that this project is gonna get big quick and I
need to have the next several levels of "studliness" rolling around in the
back of my head now I think ;)

As you rightly guessed- after many trials using a number of techniques,
static pages were the one that moved all of the processing overhead
"someplace else" and leaving pure html "chunks" in its place. I'm not
quite to the point though of being able to serve it all up statically. I
have some other processing that needs to be done but Apache::ASP handles
that admirably.

(BTW- just in case anybody here wasn't already sure...
Apache::ASP  _ROCKS_ !! :)

What are your experiences with squid? Have you used it before? What sort
of improvement have you typically seen by using it?

As far as cache timeouts- I have a "clever" way of handling that. Each
category's listings are stored in its own subdir. I simply regen the
listings in a "new" dir and when they are done, slide the current dir to
an "old" dir and the "new" dir as the "cur" dir. That way any file handles
that are open will stay open and will certainly be concluded by the next
round of processing. The only load is inodes and I think I have enough of
those.

My real concern with this method is the constant pounding the drives take
in the process. Its all asynch system calls (read: can block) and I have
really wanted to know, all things being equal- which is faster, assembling
data from a file on disk or retrieving data via a networked database
(assuming a fast enough backbone). I've never seen a well-crafted study of
this scenario and no opinions I trust on the subject. One of these days
I reckon I'll have to model it and find out- but it would be nice to find
someone who already has an opionion.

So the next question all this leads to is what is the best "search"
scenario for a similar setup? Currently I'm just searching the database
raw. I already know this will grind to a halt quickly. My first plan is to
implement a quick and dirty inverted index. That should take me a ways
down the road- how far I'm not sure since this will be my first foray into
such things. I'm wondering how/where to improve efficiencies here and I
find myself scratching my head quite a bit. Obviously there has to be some
limit on returned tuples. However, all the tuples are selected even if
they are culled out by a 'limit' later. So this part is hard to avoid.
Building caches on the fly (say using the user's session id as a cache
identifier) is my next thought but this can get out of hand quickly too-
but overall seems to be the best way to go as it only has to be done once
and can be reaped later by the OnSessionEnd() event.

Any ideas or suggestions on searching? Is this a variation of the same
problem or worse because its asynchronous?

John

Ray Cote wrote:

> >At the moment, rebuilding the static caches is not a big issue- it
> >doesn't take long and its possible to get it all rebuilt in less than
> >half an hour. I'm concerned though what the next step will be when the
> >number of listings grows to the point that half and hour (or an hour)
>
> Well, nothing is going to beat static caches -- particularly if you
> put a caching server (such as Squid) in front of it.
>
> One thing you may want to consider is setting your cache timeouts
> fairly long (longer than it takes to generate a single category).
> Then, update a category and clear the cache. While you're generating,
> you will not be serving any of the new pages since the cache won't be
> looking yet.
>
> Also, simply moving the cache to a separate box may be sufficient to
> get next bump of speed. Again, the cache is what is actually
> 'serving' and not your web server with the static pages. Would be
> interested to see how much of the processor you get to use during
> page generation vs what your Web server is taking.
>
> Along similar lines, you should be able to free up significant disk
> time (and head seeking) by offloading the cache.
>
> Just some initial thoughts on your request. Obviously, there's some
> more complicated approaches you could take, but I like to run out of
> simple ideas first. :}
>
> Ray
> --
> -----------------------------------------------------------------
> Raymond Cote, President                 Appropriate Solutions, Inc.
> www.AppropriateSolutions.com       rgacote@AppropriateSolutions.com
> 603.924.6079(v)  POB 458, Peterborough, NH 03458    603.924.8668(f)

---------------------------------------------------------------------
To unsubscribe, e-mail: asp-unsubscribe@perl.apache.org
For additional commands, e-mail: asp-help@perl.apache.org