You are viewing a plain text version of this content. The canonical link for it is here.

Posted to modperl@perl.apache.org by Bill Moseley <mo...@hank.org> on 2001/01/08 18:25:33 UTC

Caching search results

I've got a mod_perl application that's using swish-e.  A query from swish
may return hundreds of results, but I only display them 20 at a time.  

There's currently no session control on this application, and so when the
client asks for the next page (or to jump to page number 12, for example),
I  have to run the original query again, and then extract out just the
results for the page the client wants to see.

Seems like some basic design problems there.

Anyway, I'd like to avoid the repeated queries in mod_perl, of course.  So,
in the sort term, I was thinking about caching search results (which is
just a sorted list of file names) using a simple file-system db -- that is,
(carefully) build file names out of the queries and writing them to some
directory tree .  Then I'd use cron to purge LRU files every so often.  I
think this approach will work fine and instead of a dbm or rdbms approach.


So I asking for some advice:

- Is there a better way to do this?

- There was some discussion about performance and how many files to put in
each directory in the past.  Are there some commonly accepted numbers for
this?

- For file names does it make sense to use a MD5 hash of the query string?
It would be nice to get an even distribution of files in each directory.

- Can someone offern any help with the locking issues?  I was hoping to
avoid shared locking during reading -- but maybe I'm worrying too much
about the time it takes to ask for a shared lock when reading.  I could
wait a second for the shared lock and if I don't' get it I'll run the query
again.

But it seems like if one process creates the file and begins to write
without LOCK_EX and then gets blocked, then other processes might not see
the entire file when reading.

Would it be better to avoid the locks and instead use a temp file when
creating and then do an (atomic?) rename?

Thanks very much,

Bill Moseley
mailto:moseley@hank.org

Re: Caching search results

Posted by DeWitt Clinton <dc...@avacet.com>.

On Mon, Jan 08, 2001 at 10:10:25AM -0800, Perrin Harkins wrote:

> Always start with CPAN.  Try Tie::FileLRUCache or File::Cache for
> starters. A dbm would be fine too, but more trouble to purge old
> entries from.

If you find that File::Cache works for you, then you may also want to
check out the simplified and improved version in the Avacet code,
which additionally offers a unified service model for mod_perl
applications.  Services are available for templates (either Embperl or
Template Toolkit), XML-based configuratio, object caching, connecting
to the Avacet application engine, standardized error handling,
dynamically dispatching requests to modules, and many other things.

-DeWitt

Re: Caching search results

Posted by Sander van Zoest <sa...@covalent.net>.

On Mon, 8 Jan 2001, G.W. Haywood wrote:

> At the risk of getting shot down in flames again,
> do you think you could take this off-list guys?

I guess this could be moved to the scalable list (scalable-subscribe@artic.org),
or in private since this isn't really on the topic of modperl anymore.

Cheers,

--
Sander van Zoest                                         [sander@covalent.net]
Covalent Technologies, Inc.                           http://www.covalent.net/
(415) 536-5218                                 http://www.vanzoest.com/sander/

Re: Caching search results

Posted by "G.W. Haywood" <ge...@www.jubileegroup.co.uk>.

Hi Guys,

On Mon, 8 Jan 2001, Sander van Zoest wrote:

> On Mon, 8 Jan 2001, Perrin Harkins wrote:
> 
> > On Mon, 8 Jan 2001, Sander van Zoest wrote:

At the risk of getting shot down in flames again,
do you think you could take this off-list guys?
I can't seem to delete the messages as fast as
they're coming in... :)

73,
Ged.

Re: Caching search results

Posted by Sander van Zoest <sa...@covalent.net>.

On Mon, 8 Jan 2001, Perrin Harkins wrote:

> On Mon, 8 Jan 2001, Sander van Zoest wrote:
> > > starters. A dbm would be fine too, but more trouble to purge old entries
> > > from.
> > You could always have a second dbm file that can keep track of TTL issues
> > of your data keys, so it would simply be a series of delete calls.
> > Granted you would have another DBM file to maintain.
> I find it kind of painful to trim dbm files, because most implementations
> don't relinquish disk space when you delete entries.  You end up having to
> actually make a new dbm file with the "good" contents copied over to it in
> order to slim it down.

Yeah, this is true. Some DBMs have special routines to fix these issues.  
You could use the gdbm_reorganize call to clean up those issues for 
example (if you are using gdbm that is)

Just some quick pseudo code (don't have a quick example ready here):

use GDBM_File;

my $gdbm = tie %hash, 'GDBM_File', 'file.gdbm' &GDBM_WRCREAT|&GDBM_FAST, 0640 
	   or die "$!";

$gdbm->reorganize;

That definately helps a lot.

--
Sander van Zoest                                         [sander@covalent.net]
Covalent Technologies, Inc.                           http://www.covalent.net/
(415) 536-5218                                 http://www.vanzoest.com/sander/

Re: Caching search results

Posted by Perrin Harkins <pe...@primenet.com>.

On Mon, 8 Jan 2001, Sander van Zoest wrote:
> > starters. A dbm would be fine too, but more trouble to purge old entries
> > from.
> 
> You could always have a second dbm file that can keep track of TTL issues
> of your data keys, so it would simply be a series of delete calls.
> Granted you would have another DBM file to maintain.

I find it kind of painful to trim dbm files, because most implementations
don't relinquish disk space when you delete entries.  You end up having to
actually make a new dbm file with the "good" contents copied over to it in
order to slim it down.

- Perrin

Re: Caching search results

Posted by Sander van Zoest <sa...@covalent.net>.

On Mon, 8 Jan 2001, Perrin Harkins wrote:

> Bill Moseley wrote:
> > Anyway, I'd like to avoid the repeated queries in mod_perl, of course.  So,
> > in the sort term, I was thinking about caching search results (which is
> > just a sorted list of file names) using a simple file-system db -- that is,
> > (carefully) build file names out of the queries and writing them to some
> > directory tree .  Then I'd use cron to purge LRU files every so often.  I
> > think this approach will work fine and instead of a dbm or rdbms approach.
> Always start with CPAN.  Try Tie::FileLRUCache or File::Cache for
> starters. A dbm would be fine too, but more trouble to purge old entries
> from.

You could always have a second dbm file that can keep track of TTL issues
of your data keys, so it would simply be a series of delete calls.
Granted you would have another DBM file to maintain.

--
Sander van Zoest                                         [sander@covalent.net]
Covalent Technologies, Inc.                           http://www.covalent.net/
(415) 536-5218                                 http://www.vanzoest.com/sander/

Re: Caching search results

Posted by Sander van Zoest <sa...@covalent.net>.

On Mon, 8 Jan 2001, Simon Rosenthal wrote:

>   I couldn't see writing a daemon as you suggested  offering us any 
> benefits under those circumstances, given that RDBMS access is built into 
> Apache::Session.

No, in your case I do not see a reason behind it either. ;-)
Again this shows that it all depends on the requirements and things you
are willing to sacrafice.

Cheers,

--
Sander van Zoest                                         [sander@covalent.net]
Covalent Technologies, Inc.                           http://www.covalent.net/
(415) 536-5218                                 http://www.vanzoest.com/sander/

Re: Caching search results

Posted by Simon Rosenthal <sr...@northernlight.com>.

At 02:02 PM 1/8/01 -0800, Sander van Zoest wrote:
>On Mon, 8 Jan 2001, Simon Rosenthal wrote:
>
> > an RDBMS is not much more trouble to purge, if you have a
> > time-of-last-update field. And if you're ever going to access your cache
> > from multiple servers, you definitely don't want to deal with  locking
> > issues for DBM and filesystem based solutions ;=(
>
>RDBMS does bring replication and backup issues. The DBM and FS solutions
>definately have their advantages. It would not be too difficult to write
>a serialized daemon that makes request over the net to a DBM file.
>
>What in you experience makes you pick the overhead of an RDBMS for a simple
>cache in favor of DBM, FS solutions?

We cache user session state  (basically using Apache::Session) in a small 
(maybe 500K records) mysql database , which is accessed by multiple web 
servers. We made an explicit decision NOT to replicate or backup this 
database - it's very dynamic, and the only user visible consequence of a 
loss of the database would be an unexpected login screen - we felt this was 
a tradeoff we could live with.  We have a hot spare mysql instance which 
can be brought into service immediately, if required.

  I couldn't see writing a daemon as you suggested  offering us any 
benefits under those circumstances, given that RDBMS access is built into 
Apache::Session.

I would not be as cavalier as this if we were doing anything more than 
using the RDBMS as a fast cache. With decent hardware (which we have - Sun 
Enterprise servers  with nice fast disks and enough memory) the typical 
record retrieval time  is around 10ms, which  even if slow compared to a 
local FS access is plenty fast enough in the context of the processing we 
do for dynamic pages.

Hope this answers your question.

-Simon

>
>--
>Sander van Zoest                                         [sander@covalent.net]
>Covalent Technologies, Inc.                           http://www.covalent.net/
>(415) 536-5218                                 http://www.vanzoest.com/sander/

-----------------------------------------------------
Simon Rosenthal	(srosenthal@northernlight.com)    	
Web Systems Architect
Northern Light Technology
One Athenaeum Street. Suite 1700, Cambridge, MA  02142
Phone:  (617)621-5296  :       URL:  http://www.northernlight.com
"Northern Light - Just what you've been searching for"

Re: Caching search results

Posted by Sander van Zoest <sa...@covalent.net>.

On Mon, 8 Jan 2001, Simon Rosenthal wrote:

> an RDBMS is not much more trouble to purge, if you have a 
> time-of-last-update field. And if you're ever going to access your cache 
> from multiple servers, you definitely don't want to deal with  locking 
> issues for DBM and filesystem based solutions ;=(

RDBMS does bring replication and backup issues. The DBM and FS solutions
definately have their advantages. It would not be too difficult to write
a serialized daemon that makes request over the net to a DBM file.

What in you experience makes you pick the overhead of an RDBMS for a simple
cache in favor of DBM, FS solutions?

--
Sander van Zoest                                         [sander@covalent.net]
Covalent Technologies, Inc.                           http://www.covalent.net/
(415) 536-5218                                 http://www.vanzoest.com/sander/

Re: Caching search results

Posted by Simon Rosenthal <sr...@northernlight.com>.

At 10:10 AM 1/8/01 -0800, you wrote:
>Bill Moseley wrote:
> > Anyway, I'd like to avoid the repeated queries in mod_perl, of course.  So,
> > in the sort term, I was thinking about caching search results (which is
> > just a sorted list of file names) using a simple file-system db -- that is,
> > (carefully) build file names out of the queries and writing them to some
> > directory tree .  Then I'd use cron to purge LRU files every so often.  I
> > think this approach will work fine and instead of a dbm or rdbms approach.
>
>Always start with CPAN.  Try Tie::FileLRUCache or File::Cache for
>starters. A dbm would be fine too, but more trouble to purge old entries
>from.

an RDBMS is not much more trouble to purge, if you have a 
time-of-last-update field. And if you're ever going to access your cache 
from multiple servers, you definitely don't want to deal with  locking 
issues for DBM and filesystem based solutions ;=(

-Simon

-----------------------------------------------------
Simon Rosenthal	(srosenthal@northernlight.com)    	
Web Systems Architect
Northern Light Technology
One Athenaeum Street. Suite 1700, Cambridge, MA  02142
Phone:  (617)621-5296  :       URL:  http://www.northernlight.com
"Northern Light - Just what you've been searching for"

Re: Caching search results

Posted by Perrin Harkins <pe...@primenet.com>.

Bill Moseley wrote:
> Anyway, I'd like to avoid the repeated queries in mod_perl, of course.  So,
> in the sort term, I was thinking about caching search results (which is
> just a sorted list of file names) using a simple file-system db -- that is,
> (carefully) build file names out of the queries and writing them to some
> directory tree .  Then I'd use cron to purge LRU files every so often.  I
> think this approach will work fine and instead of a dbm or rdbms approach.

Always start with CPAN.  Try Tie::FileLRUCache or File::Cache for
starters. A dbm would be fine too, but more trouble to purge old entries
from.
- Perrin