You are viewing a plain text version of this content. The canonical link for it is here.

Posted to modperl@perl.apache.org by André Warnier <aw...@ice-sa.com> on 2007/09/26 13:33:35 UTC

Sharing data between many requests

For quite a while now, I have been mulling the following issue, without 
ever seeming to get the final, bestest and greatest solution : how to 
best share a data structure over many Apache2 requests, in a 
multi-platform context.
My doubts have to do with what exactly can be shared and how between 
Apache processes, threads and/or requests.

Suppose the following scenario :

On an Apache2/mp2/perl 5.8.x server host, there are 100,000 files, split 
over 1000 disk directories, at 100 files per directory.  Each file in a 
directory is called file1, file2, .... file100.
For a number of reasons, the real locations of the files are encoded as 
such :
- a simple text file on disk contains a list of the 1000 directories, 
each one identified by some "obscure" key, like this :
   key000=directorypath0
   key001=directorypath1
   key002=directorypath2
   ...
   key999=directorypath999

and the URL provided to web users to access each object is something like :
http://myserver.com/getobj/key500-file7

- the mod_perl module which handles accesses to /getobj/ decodes 
"key500" into "directorypath500" and retrieves the corresponding object 
to send it back to the browser.

- the text file on disk (containing the list of directories) 
occasionally changes : whenever another independent process adds a new 
file and finds  that the current directory has already 100 objects in 
it, it adds a new key and directory path, and files the new file there. 
  It also updates the text file to reflect the newly added directory.
The frequency of these changes is however much much lower than the 
number of "read" accesses by the browsers.

- This whole Apache2 webserver setup can be running on any platform, so 
that the particular MPM used cannot really be predicted in advance (it 
could be threaded, preforking, whatever)

- For portability and ease-of-installation reasons, I would like to 
avoid the usage of an external DBMS.

What I would like to achieve is that succesive requests to /getdoc/ do 
not require the mp2 handler module to re-read and reparse the directory 
text file at each request.  To decode the obscure id's into real paths, 
I would thus like to be able to use something like a simple hashtable in 
memory.
But any "instance" of the module should still be able to notice when the 
basic authoritative directory file has changed, and reload it when 
needed before serving a new object.  Of course this should not happen 
more often than necessary.

My basic idea would be to create a perl package encapsulating the 
decoding of the obscure paths and the updates to the memory hashtable, 
and use this package in the /getdoc/ handler.

A PerlChildInitHandler would initially read-in the text file and build 
the internal table, prior to any read handler access. (Alternatively, 
this could be done the first time a response handler needs to access the 
structure).
The module would contain a method "decode_path(obscure_id)", made 
available to the response handler, which would take care of checking 
that the table is still up-to-date, and if not, re-read and re-parse it 
into the internal hastable.
I imagine that each child process could (and probably would) have its 
own copy of the table, but that each request handler, while processing 
one request, could have access to that same child-level hashtable.

My doubts focus (mainly) on the following issues
- wether or not I *can* declare and initialise some object e.g. in the 
PerlChildInitHandler, and later access that same object in the request 
handlers.
- also, if later from the request handler, I would call a method of this 
object that updates the object content, wether this updated object would 
still be "shared" by all subsequent instances of request handlers.
- supposing that this architecture is running within a threaded 
environment, are there special guidelines to follow regarding the 
possibility that 2 threads in the same child would access the object at 
the same time and try to update the internal table ?
- and if I follow such guidelines, does the same code also work if it 
happens to run in a non-threaded environment ?
- if there is a mandatory difference between threaded/non-threaded mp2 
perl code, can I check at run-time under which environment I'm running, 
and condition which code is executed accordingly ?

Thanks for your patience reading this, and thanks in advance for any 
comments, answers or suggestions.

Re: Sharing data between many requests

Posted by Perrin Harkins <pe...@elem.com>.

On 9/26/07, André Warnier <aw...@ice-sa.com> wrote:
> - For portability and ease-of-installation reasons, I would like to
> avoid the usage of an external DBMS.

I think you're making a mistake there.  An RDBMS is the easiest way to
achieve what you want.

There is no simple way to share a Perl data structure between
processes.  Solutions like IPC::Shareable scale badly for large data
structures.  If you really won't use an RDBMS, I suggest you use
MLDBM::Sync.  If performance is very important, you can use BerkeleyDB
directly, but it will be more complicated.

- Perrin

Re: Sharing data between many requests

Posted by Jim Brandt <cb...@buffalo.edu>.

André Warnier wrote:

> - if there is a mandatory difference between threaded/non-threaded mp2 
> perl code, can I check at run-time under which environment I'm running, 
> and condition which code is executed accordingly ?

To answer one question, you can use the Apache2::MPM module to find out 
what MPM you're running and execute different code based on the result.

use Apache2::MPM ();
my $mpm = lc Apache2::MPM->show;
if ($mpm eq 'prefork') {
# prefork-specific code
}
etc...


-- 
Jim Brandt
Administrative Computing Services
University at Buffalo

Re: Sharing data between many requests

Posted by André Warnier <aw...@ice-sa.com>.

Michael Peters wrote:
> André Warnier wrote:
> 
>> My doubts focus (mainly) on the following issues
>> - wether or not I *can* declare and initialise some object e.g. in the
>> PerlChildInitHandler, and later access that same object in the request
>> handlers.
> 
> Yes.
> 
>> - also, if later from the request handler, I would call a method of this
>> object that updates the object content, wether this updated object would
>> still be "shared" by all subsequent instances of request handlers.
> 
> That depends on whether or not you have a threaded mpm. Even OSes with COW won't
> shared updated structures.

Ok, I'm already a bit out of my depth here.
Tell me if the following is correct :

1) Apache forking model (non-threaded) :

a) Apache starts and forks a number of children.
b) Each child executes its PerlChildInitHandler.
c) within each of these children, I thus now have an object containing 
the initial version of the hashtable I loaded.
Let's say that this "hashtable+ object" is accessible via the global 
identifier $OBJ. Each child has its own $OBJ.
d) requests are processed each by a child. There can be several 
simultaneous requests being processed, but there is only one request 
(and one mp2 content handler) being executed at anyone time within each 
child process. This handler "instance" has access to the $OBJ object of 
its own parent child.
e) if one handler instance somehow modifies the content of the $OBJ 
object, then it is (only) this Apache child's version of $OBJ that is 
modified.
f) if a new request happens to be processed again by this same child, 
the content handler instance that runs then, will see the $OBJ as 
(possibly) modified by the previous request processed in that child.
g) there is no real need to "lock" the $OBJ while it's being modified, 
but it doesn't hurt either.

2) Apache running under a threaded MPM :
(basically all of this below is post-fixed with question marks, as I am 
really very naive about threads. Be kind..)

a) Apache pre-forks at least one child (I think).
b) Each child executes its PerlChildInitHandler.
c) within each of these children, I thus now have an object containing 
the initial version of the hashtable I loaded.
Let's say that this "hashtable+ object" is accessible via the global 
identifier $OBJ.
d) each child (or the one child) launches a number of threads ???
(or are threads launched as requests come in ?)
e) each of these threads sees the *same* $OBJ (meaning at the same 
memory location, not just a copy) ???
f) when a request comes in, it is handed off to a thread, which runs the 
mp2 handler.
g) if this handler, by calling a method of $OBJ modifies $OBJ, what 
happens ?  does this particular thread now have its own private copy of 
$OBJ ?
h) assuming that when a thread modifies $OBJ, it does so under 
protection of some lock mechanism (I guess by defining the modifying 
method  as "locking"), does it matter ?
i) does a thread "stay alive" and handle possibly other subsequent 
requests, or does one thread only handle one request and then die ?

Maybe I'll finally end up understanding threads.
Some hope.

3) (just looking for a yes or no here) : If I want to have only one 
single copy of the hashtable in memory at any time, then the only 
multi-platform and portable way is to offload the table and it's update 
to a separate daemon, to which the various Apache processes would 
address their translation requests (via TCP e.g.) .  But then that one 
process becomes the bottleneck, unless it's itself forking or 
multi-threaded.

[...]

> 'lock' should be a no-op in recent versions of Perl that are non-threaded. So if
> you have a non-threaded Perl running in a non-threaded mpm it should be ok. I've
> never done it so you might want to check with others or ask on Perl monks.
> 
Am I right to assume that if I have Apache2 and mod_perl2 and perl 5.8.x 
installed (and apparently working well) on a given host, if Apache2 is 
running with a threaded MPM, the perl version will also be 
thread-enabled ? (or else I have a configuration problem)

And that if the Apache is running a non-threaded MPM, the perl that is 
installed may well be a thread-enabled one, but the mp2 modules running 
under it will never actually use threads ?

Later, I'll come with the "then how do I create this global object" 
question..

Re: Sharing data between many requests

Posted by Michael Peters <mp...@plusthree.com>.

André Warnier wrote:

> My doubts focus (mainly) on the following issues
> - wether or not I *can* declare and initialise some object e.g. in the
> PerlChildInitHandler, and later access that same object in the request
> handlers.

Yes.

> - also, if later from the request handler, I would call a method of this
> object that updates the object content, wether this updated object would
> still be "shared" by all subsequent instances of request handlers.

That depends on whether or not you have a threaded mpm. Even OSes with COW won't
shared updated structures.

> - supposing that this architecture is running within a threaded
> environment, are there special guidelines to follow regarding the
> possibility that 2 threads in the same child would access the object at
> the same time and try to update the internal table ?

Yes. It's a shared memory structure so you will need to use care to make sure
it's not corrupted. See the 'lock' keyword.

> - and if I follow such guidelines, does the same code also work if it
> happens to run in a non-threaded environment ?

'lock' should be a no-op in recent versions of Perl that are non-threaded. So if
you have a non-threaded Perl running in a non-threaded mpm it should be ok. I've
never done it so you might want to check with others or ask on Perl monks.

> - if there is a mandatory difference between threaded/non-threaded mp2
> perl code, can I check at run-time under which environment I'm running,
> and condition which code is executed accordingly ?

Jim already gave you a good answer here.

-- 
Michael Peters
Developer
Plus Three, LP