You are viewing a plain text version of this content. The canonical link for it is here.

Posted to modperl@perl.apache.org by Will Fould <wi...@gmail.com> on 2006/03/08 06:05:26 UTC

changing global data strategy

an old issue:
   "a dream solution would be if all child processes could *update* a large
global structure."

we have a tool that loads a huge store of data (25-50Mb+) from a
database into many perl hashes at start up: each session needs access to all
these data but it would be prohibitive to use mysql or another databases for
multiple, large lookups (and builds), at each session:  there are quite a
few structures, each are very big.

if the data never changed, it would be trivial; load/build just at
start-up.

but since the data changes often, we use a semaphore strategy to determine
when childern should reload/rebuild the structures (after updates have been
made).

this is painful. there has got to be a better way of doing this - I've seen
posts on memcache and other, more exotic animals.

can someone point me in the right direction: a reference/read, or a stable
modules that exist for our situation?


thanks in advance,
william

Re: changing global data strategy

Posted by Perrin Harkins <pe...@elem.com>.

On Tue, 2006-03-07 at 21:05 -0800, Will Fould wrote:

> we have a tool that loads a huge store of data (25-50Mb+) from a
> database into many perl hashes at start up: each session needs
> access to all these data but it would be prohibitive to use mysql or
> another databases for multiple, large lookups (and builds), at each
> session:  there are quite a few structures, each are very big.

This gets asked about a lot, so I'm going to just dump all my ideas
about it here and then I'll have something to refer back to later.

I can think of four possible approaches for this kind of situation (in
order of difficulty):

1) Find a way to organize it into a database.  I know, everyone thinks
their data is special and it can't be done, but a database can be used
for most data storage if you think creatively about it.  Databases can
be very flexible if you are willing to denormalize things a bit.

2) Use an external cache server like memcached.  This will require you
to figure out how to split your data access into hash-like patterns.  It
will not be anywhere near as fast as in-memory lookups that you use now
though.  That's the price you pay for scaling across machines.  You also
need to be aware that memcached is a cache, not a database, so it can't
be the final destination for data changes.  It also can drop data when
it gets full or when a server goes down.

3) Use local caches with networked updates.  You can use something like
BerkeleyDB which performs really well on a local machine (significantly
better than any networked daemon) to store the data.  If you have enough
RAM to use a big cache with it, the data will all be in memory anyway.
You would still need to organize your data access into hashes.  The
other part of this is handling updates, which you can do by running a
simple daemon on each machine that listens for updates and then writing
to a master daemon that tells all the others.  Alternatively, you could
use something like Spread, which does reliable multicast messaging, to
distribute the updates.

4) Write your own custom query daemon.  Most likely this would be a
multi-threaded server written in C/C++ which loads all the data and has
a protocol for querying and changing it.  This will be lots of work and
you'll have to do a very good job of it to make it do better than
existing fast database servers like MySQL.  You do get to create the
exact data structures and indexes that you need though.  I have seen a
search engine written this way, with great success, but it took a lot of
work by an expert C++ programmer to do it.  You might be able to
simplify a little by writing it as a custom data type for PostgreSQL,
but I have no idea how hard that is or how it performs.

One thing that you can't do to solve it is rewrite in a language with
better thread support like Java, since you would still be stuck when you
try to run it on multiple servers.

Hope that helps outline your options a little.

- Perrin

Re: changing global data strategy

Posted by Jonathan Vanasco <mo...@2xlp.com>.

how big are these data structures?

200k?  2mb?  20mb?

if they're not too big, you could just use memcached.

	http://danga.com:80/memcached/
	http://search.cpan.org/~bradfitz/Cache-Memcached-1.15/Memcached.pm

its ridiculously painless to implement. i found it easier that a lot  
of other approaches.	

but if you have 50mb of data, i'd rethink what you're doing.

you're just going to keep getting screwed when your cache db updates  
(because the updates will only be per-child, not per parent  
process).  so you've got the potential for a 50MB  parent process  
having children that read in 50mb each in data?  thats a cascading  
nightmare.

if you need to precache such giant data structures, i'd do something  
like 2 tiered server
	apache a - talks to web users / load balancer
				sends data / whatever for specific processing to
	daemon b - either apache or some custom server, which handles  
precaching of db and parsing requests from apache
	db - datastore

having all of that data in modperl would be a nightmare though.  even  
with memcached, you'll update fast and everyone can access it, but  
you're going to keep eating memory.  if every session is going to  
toss though 20mb hashes of info, i'd keep that info out of apache  
entirely

On Mar 8, 2006, at 12:16 AM, Will Fould wrote:

> at this point, the application is on a single machine, but I'm  
> being tasked with moving our database onto another machine and  
> implement load balancing b/w 2 webservers.
>
> william
>
>
> On 3/7/06, Will Fould <wi...@gmail.com> wrote:
> an old issue:
>    "a dream solution would be if all child processes could *update*  
> a large global structure."
>
> we have a tool that loads a huge store of data (25-50Mb+) from a  
> database into many perl hashes at start up: each session needs  
> access to all these data but it would be prohibitive to use mysql  
> or another databases for multiple, large lookups (and builds), at  
> each session:  there are quite a few structures, each are very big.
>
> if the data never changed, it would be trivial; load/build just at  
> start-up.
>
> but since the data changes often, we use a semaphore strategy to  
> determine when childern should reload/rebuild the structures (after  
> updates have been made).
>
> this is painful. there has got to be a better way of doing this -  
> I've seen posts on memcache and other, more exotic animals.
>
> can someone point me in the right direction: a reference/read, or a  
> stable modules that exist for our situation?

Re: changing global data strategy

Posted by Will Fould <wi...@gmail.com>.

at this point, the application is on a single machine, but I'm being tasked
with moving our database onto another machine and implement load balancing
b/w 2 webservers.

william


On 3/7/06, Will Fould <wi...@gmail.com> wrote:
>
>  an old issue:
>    "a dream solution would be if all child processes could *update* a
> large global structure."
>
> we have a tool that loads a huge store of data (25-50Mb+) from a
> database into many perl hashes at start up: each session needs access to all
> these data but it would be prohibitive to use mysql or another databases for
> multiple, large lookups (and builds), at each session:  there are quite a
> few structures, each are very big.
>
> if the data never changed, it would be trivial; load/build just at
> start-up.
>
> but since the data changes often, we use a semaphore strategy to determine
> when childern should reload/rebuild the structures (after updates have been
> made).
>
> this is painful. there has got to be a better way of doing this - I've
> seen posts on memcache and other, more exotic animals.
>
> can someone point me in the right direction: a reference/read, or a stable
> modules that exist for our situation?
>
>
> thanks in advance,
> william
>