You are viewing a plain text version of this content. The canonical link for it is here.
Posted to modperl@perl.apache.org by André Warnier <aw...@ice-sa.com> on 2007/09/26 13:33:35 UTC
Sharing data between many requests
For quite a while now, I have been mulling the following issue, without
ever seeming to get the final, bestest and greatest solution : how to
best share a data structure over many Apache2 requests, in a
multi-platform context.
My doubts have to do with what exactly can be shared and how between
Apache processes, threads and/or requests.
Suppose the following scenario :
On an Apache2/mp2/perl 5.8.x server host, there are 100,000 files, split
over 1000 disk directories, at 100 files per directory. Each file in a
directory is called file1, file2, .... file100.
For a number of reasons, the real locations of the files are encoded as
such :
- a simple text file on disk contains a list of the 1000 directories,
each one identified by some "obscure" key, like this :
key000=directorypath0
key001=directorypath1
key002=directorypath2
...
key999=directorypath999
and the URL provided to web users to access each object is something like :
http://myserver.com/getobj/key500-file7
- the mod_perl module which handles accesses to /getobj/ decodes
"key500" into "directorypath500" and retrieves the corresponding object
to send it back to the browser.
- the text file on disk (containing the list of directories)
occasionally changes : whenever another independent process adds a new
file and finds that the current directory has already 100 objects in
it, it adds a new key and directory path, and files the new file there.
It also updates the text file to reflect the newly added directory.
The frequency of these changes is however much much lower than the
number of "read" accesses by the browsers.
- This whole Apache2 webserver setup can be running on any platform, so
that the particular MPM used cannot really be predicted in advance (it
could be threaded, preforking, whatever)
- For portability and ease-of-installation reasons, I would like to
avoid the usage of an external DBMS.
What I would like to achieve is that succesive requests to /getdoc/ do
not require the mp2 handler module to re-read and reparse the directory
text file at each request. To decode the obscure id's into real paths,
I would thus like to be able to use something like a simple hashtable in
memory.
But any "instance" of the module should still be able to notice when the
basic authoritative directory file has changed, and reload it when
needed before serving a new object. Of course this should not happen
more often than necessary.
My basic idea would be to create a perl package encapsulating the
decoding of the obscure paths and the updates to the memory hashtable,
and use this package in the /getdoc/ handler.
A PerlChildInitHandler would initially read-in the text file and build
the internal table, prior to any read handler access. (Alternatively,
this could be done the first time a response handler needs to access the
structure).
The module would contain a method "decode_path(obscure_id)", made
available to the response handler, which would take care of checking
that the table is still up-to-date, and if not, re-read and re-parse it
into the internal hastable.
I imagine that each child process could (and probably would) have its
own copy of the table, but that each request handler, while processing
one request, could have access to that same child-level hashtable.
My doubts focus (mainly) on the following issues
- wether or not I *can* declare and initialise some object e.g. in the
PerlChildInitHandler, and later access that same object in the request
handlers.
- also, if later from the request handler, I would call a method of this
object that updates the object content, wether this updated object would
still be "shared" by all subsequent instances of request handlers.
- supposing that this architecture is running within a threaded
environment, are there special guidelines to follow regarding the
possibility that 2 threads in the same child would access the object at
the same time and try to update the internal table ?
- and if I follow such guidelines, does the same code also work if it
happens to run in a non-threaded environment ?
- if there is a mandatory difference between threaded/non-threaded mp2
perl code, can I check at run-time under which environment I'm running,
and condition which code is executed accordingly ?
Thanks for your patience reading this, and thanks in advance for any
comments, answers or suggestions.
Re: Sharing data between many requests
Posted by Perrin Harkins <pe...@elem.com>.
On 9/26/07, André Warnier <aw...@ice-sa.com> wrote:
> - For portability and ease-of-installation reasons, I would like to
> avoid the usage of an external DBMS.
I think you're making a mistake there. An RDBMS is the easiest way to
achieve what you want.
There is no simple way to share a Perl data structure between
processes. Solutions like IPC::Shareable scale badly for large data
structures. If you really won't use an RDBMS, I suggest you use
MLDBM::Sync. If performance is very important, you can use BerkeleyDB
directly, but it will be more complicated.
- Perrin
Re: Sharing data between many requests
Posted by Jim Brandt <cb...@buffalo.edu>.
André Warnier wrote:
> - if there is a mandatory difference between threaded/non-threaded mp2
> perl code, can I check at run-time under which environment I'm running,
> and condition which code is executed accordingly ?
To answer one question, you can use the Apache2::MPM module to find out
what MPM you're running and execute different code based on the result.
use Apache2::MPM ();
my $mpm = lc Apache2::MPM->show;
if ($mpm eq 'prefork') {
# prefork-specific code
}
etc...
--
Jim Brandt
Administrative Computing Services
University at Buffalo
Re: Sharing data between many requests
Posted by André Warnier <aw...@ice-sa.com>.
Michael Peters wrote:
> André Warnier wrote:
>
>> My doubts focus (mainly) on the following issues
>> - wether or not I *can* declare and initialise some object e.g. in the
>> PerlChildInitHandler, and later access that same object in the request
>> handlers.
>
> Yes.
>
>> - also, if later from the request handler, I would call a method of this
>> object that updates the object content, wether this updated object would
>> still be "shared" by all subsequent instances of request handlers.
>
> That depends on whether or not you have a threaded mpm. Even OSes with COW won't
> shared updated structures.
Ok, I'm already a bit out of my depth here.
Tell me if the following is correct :
1) Apache forking model (non-threaded) :
a) Apache starts and forks a number of children.
b) Each child executes its PerlChildInitHandler.
c) within each of these children, I thus now have an object containing
the initial version of the hashtable I loaded.
Let's say that this "hashtable+ object" is accessible via the global
identifier $OBJ. Each child has its own $OBJ.
d) requests are processed each by a child. There can be several
simultaneous requests being processed, but there is only one request
(and one mp2 content handler) being executed at anyone time within each
child process. This handler "instance" has access to the $OBJ object of
its own parent child.
e) if one handler instance somehow modifies the content of the $OBJ
object, then it is (only) this Apache child's version of $OBJ that is
modified.
f) if a new request happens to be processed again by this same child,
the content handler instance that runs then, will see the $OBJ as
(possibly) modified by the previous request processed in that child.
g) there is no real need to "lock" the $OBJ while it's being modified,
but it doesn't hurt either.
2) Apache running under a threaded MPM :
(basically all of this below is post-fixed with question marks, as I am
really very naive about threads. Be kind..)
a) Apache pre-forks at least one child (I think).
b) Each child executes its PerlChildInitHandler.
c) within each of these children, I thus now have an object containing
the initial version of the hashtable I loaded.
Let's say that this "hashtable+ object" is accessible via the global
identifier $OBJ.
d) each child (or the one child) launches a number of threads ???
(or are threads launched as requests come in ?)
e) each of these threads sees the *same* $OBJ (meaning at the same
memory location, not just a copy) ???
f) when a request comes in, it is handed off to a thread, which runs the
mp2 handler.
g) if this handler, by calling a method of $OBJ modifies $OBJ, what
happens ? does this particular thread now have its own private copy of
$OBJ ?
h) assuming that when a thread modifies $OBJ, it does so under
protection of some lock mechanism (I guess by defining the modifying
method as "locking"), does it matter ?
i) does a thread "stay alive" and handle possibly other subsequent
requests, or does one thread only handle one request and then die ?
Maybe I'll finally end up understanding threads.
Some hope.
3) (just looking for a yes or no here) : If I want to have only one
single copy of the hashtable in memory at any time, then the only
multi-platform and portable way is to offload the table and it's update
to a separate daemon, to which the various Apache processes would
address their translation requests (via TCP e.g.) . But then that one
process becomes the bottleneck, unless it's itself forking or
multi-threaded.
[...]
> 'lock' should be a no-op in recent versions of Perl that are non-threaded. So if
> you have a non-threaded Perl running in a non-threaded mpm it should be ok. I've
> never done it so you might want to check with others or ask on Perl monks.
>
Am I right to assume that if I have Apache2 and mod_perl2 and perl 5.8.x
installed (and apparently working well) on a given host, if Apache2 is
running with a threaded MPM, the perl version will also be
thread-enabled ? (or else I have a configuration problem)
And that if the Apache is running a non-threaded MPM, the perl that is
installed may well be a thread-enabled one, but the mp2 modules running
under it will never actually use threads ?
Later, I'll come with the "then how do I create this global object"
question..
Re: Sharing data between many requests
Posted by Michael Peters <mp...@plusthree.com>.
André Warnier wrote:
> My doubts focus (mainly) on the following issues
> - wether or not I *can* declare and initialise some object e.g. in the
> PerlChildInitHandler, and later access that same object in the request
> handlers.
Yes.
> - also, if later from the request handler, I would call a method of this
> object that updates the object content, wether this updated object would
> still be "shared" by all subsequent instances of request handlers.
That depends on whether or not you have a threaded mpm. Even OSes with COW won't
shared updated structures.
> - supposing that this architecture is running within a threaded
> environment, are there special guidelines to follow regarding the
> possibility that 2 threads in the same child would access the object at
> the same time and try to update the internal table ?
Yes. It's a shared memory structure so you will need to use care to make sure
it's not corrupted. See the 'lock' keyword.
> - and if I follow such guidelines, does the same code also work if it
> happens to run in a non-threaded environment ?
'lock' should be a no-op in recent versions of Perl that are non-threaded. So if
you have a non-threaded Perl running in a non-threaded mpm it should be ok. I've
never done it so you might want to check with others or ask on Perl monks.
> - if there is a mandatory difference between threaded/non-threaded mp2
> perl code, can I check at run-time under which environment I'm running,
> and condition which code is executed accordingly ?
Jim already gave you a good answer here.
--
Michael Peters
Developer
Plus Three, LP