You are viewing a plain text version of this content. The canonical link for it is here.

Posted to modperl@perl.apache.org by Will Fould <wi...@gmail.com> on 2007/05/05 02:50:01 UTC

Global question

Can lists and other global objects created at apache startup be altered as
an *indirect* result of child processes (i.e. some type of
semaphore/listener scheme?).

Thanks

Re: Global question

Posted by Perrin Harkins <pe...@elem.com>.

On 5/4/07, Will Fould <wi...@gmail.com> wrote:
> Can lists and other global objects created at apache startup be altered as
> an *indirect* result of child processes (i.e. some type of
> semaphore/listener scheme?).

Are you asking if changing a perl data structure in one process will
affect it in another process?  No, it won't.  If you want that, you
need to use a database or other shared data storage.

- Perrin

Re: Global question

Posted by Perrin Harkins <pe...@elem.com>.

On 5/7/07, Adam Prime x443 <ap...@brunico.com> wrote:
> If they change really rarely couldln't you just have the children
> automatically die off when the stuff needs to change and reload it?
> You'd have to create the datastructure using a ChildInit handler i
> assume, but couldn't a setup like that potentially work?

That doesn't have any advantage over simply reloading the data in each
child when it changes.  Nothing is shared unless it was set up in the
parent process.

- Perrin

Re: Global question

Posted by Perrin Harkins <pe...@elem.com>.

On 5/19/07, Will Fould <wi...@gmail.com> wrote:
> I'm afraid that:
>    1. hashes get really big (greater than a few MB's each)
>    2. re-caching entire hash just b/c 1 key updated (waste).
>    3. latency for pulling cache data from remote DB.
>    4. doing this for all children.

The most common way to improve speed is to cache things after you
fetch them from the db, rather than pre-fetching as you are now.  You
give them a reasonable timeout value, and always check the cache for
data first, falling back to the db if it's not there.  For
applications  that can tolerate a little stale data and have a
relatively small set of hot data, this works great.  It also assumes
that you can make your code fetch from the db (when the result is not
cached yet) in a slow but reasonable amount of time.

If you want to stick with pre-fetching, you have a few options.  One
is to use memcached.  It will be much slower than your current method.
 However, you can update values whenever you like and they will be
visible to all processes on all servers immediately.  You can't count
on data to be there though -- you have to structure your application
so it can fetch from the db if memcached drops some data.  It is not a
database.

Another is to build local shared caches with BerkeleyDB, MySQL on the
local machine, or Cache::FastMmap.  All of these will be faster than a
remote memcached.  You can update them with a cron job on each server
and all children will see the results immediately.  The same caveats
about surviving missing data apply for Cache::FastMmap -- it's not a
database either.

In both cases, you are going to sacrifice performance.  What you'll
get for your trouble is memory -- no more duplicating MBs of data in
every process.

> For now, what seems like the 'holy-grail' (*) is to cache last_modified for
> each type, (available to the cluster, say through memcached), in a way that
> indicates only which parts of the cache (which keys of each hash) the
> children need to update/delete such that a child rarely, if ever, will only
> need to query for just those keys and directly modify their own hashes
> accordingly to keep current.

That actually sounds pretty easy -- put a timestamp on your rows and
only fetch the data that changed since last time you asked.

- Perrin

Re: Global question

Posted by Jonathan Vanasco <jv...@2xlp.com>.

my .02¢

	•  ldap would be silly unless you're clustering -- most  
implementations use bdb as their backend
	•  bdb and cache::fastmmap would make more sense if you're on 1 machine

also

i think your hash system may be better off rethought....

you have:
	$CACHE_1{id}='foo'
	$CACHE_2{ida}{idb}='bar'
	
which limits you to thinking in terms of perl hash structures...

if you emulate that with flattended keys
	cache_1_[\d+]
	cache_2_[\w+]_[\w+]

then you have a lot more options.  in a clustered system, you have  
memcached or a dedicated mysql / whatever daemon
	
one of my projects , RoadSound, is a collaborative content management  
system where each 'view' into a set of data  is from the independent  
perspective of both each relevant entities  and content manager.   
loosely translated -- to display the most basic details of a  
concert,  i need to do 4 15-20 table joins in postgres -- and I need  
to do & store that seperately for each artist / venue / whatever  
involved.  in order to offload the db, i store everything in  
memcached as I generate it with a key like: """show_%(id)s_% 
(perspective_type_id)s_%(perspective_owner_id)""".  it doesn't  
perform nearly as fast using shared memory , but it offloads A TON of  
work from my db and works across multiple machines.

the only issue to this approach would be clearing out  the 'y' level  
in this model: $CACHE_2{$y}{$z}  .  i don't know if that is a concern  
for you or not, but that could create issues.

also- depending on your current performance, you might be able to  
just use mysql as well.  you could conceivably do something that  
takes advantage of the speed of memory or myisam tables and select  
query caching.  while that wouldn't be as fast as using memory  
alone , it clusters.



On May 19, 2007, at 6:13 PM, Will Fould wrote:

> Thanks a lot Perrin -
>
> I really like the current method (if it were to stay on 1 machine  
> and not grow). Caching per child has not really been a problem once  
> I got beyond the emotional hangup of what seemed to be duplicative,  
> waste of memory.  I am totally amazed how fast and efficient using  
> modperl in this way has been. The hash building queries issued by  
> the children are very simple selects but the data provided by (and  
> cached within) them is used in many ways throughout the session  
> such that not having them would require extra joins in multiple  
> places and queries in other places that are currently not needed at  
> all. -- ( i.e. collaborative environment ACL's etc.).  To be clear,  
> the hashes are not only for quick de-normalizing, but they serve a  
> vital caching function.
>
> The problem is that I am now moving the database off localhost and  
> configuring a second web node now.
>
> > what it is that you don't like about your current method.
>
> I'm afraid that:
>    1. hashes get really big (greater than a few MB's each)
>    2. re-caching entire hash just b/c 1 key updated (waste).
>    3. latency for pulling cache data from remote DB.
>    4. doing this for all children.
>
> For now, what seems like the 'holy-grail' (*) is to cache  
> last_modified for each type, (available to the cluster, say through  
> memcached), in a way that indicates only which parts of the cache  
> (which keys of each hash) the children need to update/delete such  
> that a child rarely, if ever, will only need to query for just  
> those keys and directly modify their own hashes accordingly to keep  
> current.
>
> (*) I'm not too clear about this, but it seems like the real 'holy- 
> grail' would be to do this within apache in a scoreboard like way.
>
> -w
>
>
> On 5/19/07, Perrin Harkins <pe...@elem.com> wrote: On 5/19/07,  
> Will Fould <wi...@gmail.com> wrote:
> > Here's the situation:  We have a fully normalized relational  
> database
> > (mysql) now being accessed by a web application and to save a lot  
> of complex
> > joins each time we grab rows from the database, I currently load  
> and cache a
> > few simple hashes (1-10MB) in each apache processes with the  
> corresponding
> > lookup data
>
> Are you certain this is saving you all that much, compared to just
> doing the joins?  With proper indexes, joins are fast.  It could be a
> win to do them yourself, but it depends greatly on how much of the
> data you end up displaying before the lookup tables change and have to
> be re-fetched.
>
> > Is anyone doing something similar? I'm wondering if implementing  
> a BerkleyDB
> > or another slave store on each web node with a tied hash (or  
> something
> > similar) is feasible and if not, what a better solution might be.
>
> Well, first of all, I wouldn't feed a tied hash to my neighbor's dog.
> It's slower than method calls, and more confusing.
>
> There are lots of things you could do here, but it's not clear to me
> what it is that you don't like about your current method.  Is it that
> when the database changes you have to do heavy queries from every
> child process?  That also kills any sharing of the data.  Do you have
> more than one server, or expect to soon?
>
> - Perrin
>

// Jonathan Vanasco

| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -
| SyndiClick.com
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -
|      FindMeOn.com - The cure for Multiple Web Personality Disorder
|      Web Identity Management and 3D Social Networking
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -
|      RoadSound.com - Tools For Bands, Stuff For Fans
|      Collaborative Online Management And Syndication Tools
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

Re: Global question

Posted by Will Fould <wi...@gmail.com>.

Thanks a lot Perrin -

I really like the current method (if it were to stay on 1 machine and not
grow). Caching per child has not really been a problem once I got beyond the
emotional hangup of what seemed to be duplicative, waste of memory.  I am
totally amazed how fast and efficient using modperl in this way has been.
The hash building queries issued by the children are very simple selects but
the data provided by (and cached within) them is used in many ways
throughout the session such that not having them would require extra joins
in multiple places and queries in other places that are currently not needed
at all. -- (i.e. collaborative environment ACL's etc.).  To be clear, the
hashes are not only for quick de-normalizing, but they serve a vital caching
function.

The problem is that I am now moving the database off localhost and
configuring a second web node now.

> what it is that you don't like about your current method.

I'm afraid that:
   1. hashes get really big (greater than a few MB's each)
   2. re-caching entire hash just b/c 1 key updated (waste).
   3. latency for pulling cache data from remote DB.
   4. doing this for all children.

For now, what seems like the 'holy-grail' (*) is to cache last_modified for
each type, (available to the cluster, say through memcached), in a way that
indicates only which parts of the cache (which keys of each hash) the
children need to update/delete such that a child rarely, if ever, will only
need to query for just those keys and directly modify their own hashes
accordingly to keep current.

(*) I'm not too clear about this, but it seems like the real 'holy-grail'
would be to do this within apache in a scoreboard like way.

-w

On 5/19/07, Perrin Harkins <pe...@elem.com> wrote:
>
> On 5/19/07, Will Fould <wi...@gmail.com> wrote:
> > Here's the situation:  We have a fully normalized relational database
> > (mysql) now being accessed by a web application and to save a lot of
> complex
> > joins each time we grab rows from the database, I currently load and
> cache a
> > few simple hashes (1-10MB) in each apache processes with the
> corresponding
> > lookup data
>
> Are you certain this is saving you all that much, compared to just
> doing the joins?  With proper indexes, joins are fast.  It could be a
> win to do them yourself, but it depends greatly on how much of the
> data you end up displaying before the lookup tables change and have to
> be re-fetched.
>
> > Is anyone doing something similar? I'm wondering if implementing a
> BerkleyDB
> > or another slave store on each web node with a tied hash (or something
> > similar) is feasible and if not, what a better solution might be.
>
> Well, first of all, I wouldn't feed a tied hash to my neighbor's dog.
> It's slower than method calls, and more confusing.
>
> There are lots of things you could do here, but it's not clear to me
> what it is that you don't like about your current method.  Is it that
> when the database changes you have to do heavy queries from every
> child process?  That also kills any sharing of the data.  Do you have
> more than one server, or expect to soon?
>
> - Perrin
>

Re: Global question

Posted by Perrin Harkins <pe...@elem.com>.

On 5/7/07, Jonathan Vanasco <jv...@2xlp.com> wrote:
> I thought the same as you , until that lighttpd posting ( March
> 2006 ).    But according to Jan , who was at MySQL at the time ,
> MySQL only had a synchronous API.  An async was on a todo list, but
> selects were blocking on the whole server.

There are two different concepts here.  The first is how the RDBMS
handles locking.  This is pretty simple.  MyISAM tables use shared
read locks and exclusive write locks.  InnoDB tables use an MVCC
approach like Postgres and Oracle.  Readers do not block readers with
either one.

Then there's the network transport layer.  Lighttpd uses the same
approach as memcached or POE, i.e. a single process that multiplexes
between active connections using some kind of polling.  This is
usually called asynchronous or non-blocking I/O.

There are no databases I'm aware of that provide an asynchronous
network library.  They all assume you will send a query and then wait
for the response, blocking within that thread/process until the
response comes back.  A system like mod_perl which dedicates a process
or thread to each request has no problem with this, since one waiting
thread doesn't block any others.  It's a disaster for Lighttpd because
any blocking I/O means the entire server has to sit and wait.  (Note
that this is only for database calls made from directly within the
server.  Most Lighttpd users write their applications in separate
processes like FastCGI that have the same multi-process model mod_perl
does.)

This is why there are a bunch of plugins for POE to let you do DBI
queries.  They do things like fork before sending the query, since if
they did it from within POE it would make the whole thing sit and wait
for the results.

This is also why I'm always skeptical when people come along saying
they are going to switch to a non-blocking process model and beat the
pants off apache.  Using that model complicates everything, and makes
the majority of network libraries useless unless you do tricks like
forking before you use them.

If you're interested in using non-blocking I/O from perl, Stas Bekman
wrote an article about it here:
http://www.onlamp.com/pub/a/onlamp/2006/10/12/asynchronous_events.html

- Perrin

Re: Global question

Posted by Jonathan Vanasco <jv...@2xlp.com>.

On May 7, 2007, at 2:01 PM, Perrin Harkins wrote:
> It does when you shut down the BDB "environment", but there's no
> reason to do that unless your processes are exiting.

ah, that makes sense.  so long as one process has bdb running, its  
there's a shared bdb memory section.

> Blocking?  You mean readers blocking writers?  If you have frequent
> updates, the MVCC model used by InnoDB tables avoids that.  In
> general, I've found the read performance of InnoDB to be better than
> MyISAM in my application.

readers blocking readers.

>> they
>> eventually realized that when the system wasn't making use of the
>> mysql query caching, the requests were blocking with all the other
>> mysql traffic.  i think it was because all the selects happen in one
>> synchronous process.
>
> Not sure what you're talking about there.  MySQL is a multi-threaded
> daemon and readers don't block each other, even with the simple MyISAM
> locking scheme.
>
> In any case, the scenario I had in mind for Will Fould's situation is
> a dedicated MySQL on the local box that does nothing but handle this
> shared data.

Is this new?

I thought the same as you , until that lighttpd posting ( March  
2006 ).    But according to Jan , who was at MySQL at the time ,   
MySQL only had a synchronous API.  An async was on a todo list, but  
selects were blocking on the whole server.

I didn't really look into it much then, because I was already  
converted to Postgres for just about everything at the time.

Re: Global question

Posted by Perrin Harkins <pe...@elem.com>.

On 5/7/07, Jonathan Vanasco <jv...@2xlp.com> wrote:
> that's interesting.  since its not persistant, i figured it just
> released everything once you closed the file.

It does when you shut down the BDB "environment", but there's no
reason to do that unless your processes are exiting.

> if you've got a dedicated system on this from mysql that works.  but
> the blocking nature of mysql selects means that you're bringing in
> the overhead from other mysql apps onto your mp application -- which
> you might not want on a per-request basis.

Blocking?  You mean readers blocking writers?  If you have frequent
updates, the MVCC model used by InnoDB tables avoids that.  In
general, I've found the read performance of InnoDB to be better than
MyISAM in my application.

> they
> eventually realized that when the system wasn't making use of the
> mysql query caching, the requests were blocking with all the other
> mysql traffic.  i think it was because all the selects happen in one
> synchronous process.

Not sure what you're talking about there.  MySQL is a multi-threaded
daemon and readers don't block each other, even with the simple MyISAM
locking scheme.

In any case, the scenario I had in mind for Will Fould's situation is
a dedicated MySQL on the local box that does nothing but handle this
shared data.

- Perrin

Re: Global question

Posted by Jonathan Vanasco <jv...@2xlp.com>.

On May 7, 2007, at 1:12 PM, Perrin Harkins wrote:
>> I didn't know that BDB does shared memory caching.
>
> And no socket overhead too.  All the calls are in-process.

that's interesting.  since its not persistant, i figured it just  
released everything once you closed the file.

>> > Primary key lookups in MySQL over local sockets are very fast --
>> > faster than memcached.
>>
>> Really ?  I had read that they were about the same, but that mysql
>> selects are blocking & FIFO , while memcached is threaded and
>> supports concurrent access.
>
> Memcached is single-threaded and uses non-blocking I/O.  It's a very
> different approach from a multi-threaded daemon like MySQL, and should
> scale better ultimately.  However, for simple lookups, the network
> overhead from the TCP socket that memcached requires seems to outweigh
> any advantages.  MySQL can use a pipe instead of a TCP socket, which
> it does automatically when you connect to a server on localhost.

if you've got a dedicated system on this from mysql that works.  but  
the blocking nature of mysql selects means that you're bringing in  
the overhead from other mysql apps onto your mp application -- which  
you might not want on a per-request basis.

before i switched to nginx, i was using lighttpd as my port80 proxy  
server.  some people started posting to the  list because of an odd  
behavior from their servers configured with virtual hosts.  they  
eventually realized that when the system wasn't making use of the  
mysql query caching, the requests were blocking with all the other  
mysql traffic.  i think it was because all the selects happen in one  
synchronous process.   the end result though, was that requests per  
second went from 3k to 30.   its definitely an edge case in regards  
to this functionality -- but its something that has made me wary of  
mysql for anything that is needed on a per-page basis unless i'm  
already using sesisons on that page.

// Jonathan Vanasco

| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -
| SyndiClick.com
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -
|      FindMeOn.com - The cure for Multiple Web Personality Disorder
|      Web Identity Management and 3D Social Networking
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -
|      RoadSound.com - Tools For Bands, Stuff For Fans
|      Collaborative Online Management And Syndication Tools
| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
- - - - - - - - - - - - - - - - - - -

Re: Global question

Posted by Perrin Harkins <pe...@elem.com>.

On 5/19/07, Will Fould <wi...@gmail.com> wrote:
> Here's the situation:  We have a fully normalized relational database
> (mysql) now being accessed by a web application and to save a lot of complex
> joins each time we grab rows from the database, I currently load and cache a
> few simple hashes (1-10MB) in each apache processes with the corresponding
> lookup data

Are you certain this is saving you all that much, compared to just
doing the joins?  With proper indexes, joins are fast.  It could be a
win to do them yourself, but it depends greatly on how much of the
data you end up displaying before the lookup tables change and have to
be re-fetched.

> Is anyone doing something similar? I'm wondering if implementing a BerkleyDB
> or another slave store on each web node with a tied hash (or something
> similar) is feasible and if not, what a better solution might be.

Well, first of all, I wouldn't feed a tied hash to my neighbor's dog.
It's slower than method calls, and more confusing.

There are lots of things you could do here, but it's not clear to me
what it is that you don't like about your current method.  Is it that
when the database changes you have to do heavy queries from every
child process?  That also kills any sharing of the data.  Do you have
more than one server, or expect to soon?

- Perrin

Re: Global question

Posted by Will Fould <wi...@gmail.com>.

Maybe I should restate this question -- I'm wondering if BerkleyDB, LDAP, or
something like IPC::MM will help me with this but I have little experience
with these, in heavy practice.

Here's the situation:  We have a fully normalized relational database
(mysql) now being accessed by a web application and to save a lot of complex
joins each time we grab rows from the database, I currently load and cache a
few simple hashes (1-10MB) in each apache processes with the corresponding
lookup data:

    $CACHE_1{id}='foo'

    and

    $CACHE_2{ida}{idb}='bar'

Basically, this lets me just grab and loop through the normalized
(non-joined) DB rows and print something like:

    "This row belongs to $CACHE_1{$a}" and is about $CACHE_2{$y}{$z}, please
call $CACHE_1{$b};
    "This row belongs to $CACHE_1{$a}" and is about $CACHE_2{$y}{$z}, please
call $CACHE_1{$b};
    "This row belongs to $CACHE_1{$a}" and is about $CACHE_2{$y}{$z}, please
call $CACHE_1{$b};

More importantly, if the value of  $a,$b,$y or $z ever change, all rows in
all tables need not be updated.

For large datasets (100-1000 rows), this is working great but would it would
be prohibitively expensive to query each value in the database separately,
forcing me to rethink a more complex data-joining strategy.

The lookup hashes are are very simple name=value paires and rarely change
(if ever) during the lifetime of any child process but they'll continue to
grow and change over time.  For now, when they do change, child processed
knows to reload them from the database.

Is anyone doing something similar? I'm wondering if implementing a BerkleyDB
or another slave store on each web node with a tied hash (or something
similar) is feasible and if not, what a better solution might be.

On 5/7/07, Perrin Harkins <pe...@elem.com> wrote:
>
> On 5/7/07, Will Fould <wi...@gmail.com> wrote:
> > C/Would anyone recommend any of the IPC::*** shared memory packages for
> what
> > I'm doing?
>
> No, they have terrible performance for any significant amount of data.
> Much worse than a simple shared file approach.
>
> If you can break up your data into a hash-like form, you might be able
> to use Cache::FastMmap.  It's a cache though, and will drop data when
> it gets full, so you have to keep the database as the master source
> and fall back to it for data not found in the cache.
>
> - Perrin
>

Re: Global question

Posted by Perrin Harkins <pe...@elem.com>.

On 5/7/07, Will Fould <wi...@gmail.com> wrote:
> C/Would anyone recommend any of the IPC::*** shared memory packages for what
> I'm doing?

No, they have terrible performance for any significant amount of data.
 Much worse than a simple shared file approach.

If you can break up your data into a hash-like form, you might be able
to use Cache::FastMmap.  It's a cache though, and will drop data when
it gets full, so you have to keep the database as the master source
and fall back to it for data not found in the cache.

- Perrin

Re: Global question

Posted by Will Fould <wi...@gmail.com>.

> simpler by checking last mod time on a shared file
Good idea.

C/Would anyone recommend any of the IPC::*** shared memory packages for what
I'm doing?


On 5/7/07, Perrin Harkins <pe...@elem.com> wrote:
>
> On 5/7/07, Will Fould <wi...@gmail.com> wrote:
> > Can apache processes meaningfully access any external ( i.e. shell,
> > other) structures?
>
> It's the same as any other process, i.e. the usual IPC methods are
> available.  If you want to update the whole data structure at once, I
> think what you're doing sounds fine, although you could make it
> simpler by checking last mod time on a shared file and doing away with
> semaphores.
>
> - Perrin
>

Re: Global question

Posted by Perrin Harkins <pe...@elem.com>.

On 5/7/07, Will Fould <wi...@gmail.com> wrote:
> Can apache processes meaningfully access any external ( i.e. shell,
> other) structures?

It's the same as any other process, i.e. the usual IPC methods are
available.  If you want to update the whole data structure at once, I
think what you're doing sounds fine, although you could make it
simpler by checking last mod time on a shared file and doing away with
semaphores.

- Perrin

Re: Global question

Posted by Perrin Harkins <pe...@elem.com>.

On 5/7/07, Carl Johnstone <mo...@fadetoblack.me.uk> wrote:
> You can use shared memory between apache processes. Check:
>
> *Apache::SharedMem*
> <http://search.cpan.org/author/RSOLIV/Apache-SharedMem-0.09/lib/Apache/SharedMem.pm>
>
> *Tie::ShareLite*
> <http://search.cpan.org/author/NSHAFER/Tie-ShareLite-0.03/ShareLite.pm>
> *Cache::SharedMemoryCache*
> <http://search.cpan.org/author/DCLINTON/Cache-Cache-1.05/lib/Cache/SharedMemoryCache.pm>
>
>
> all based on
>
> *IPC::ShareLite*
> <http://search.cpan.org/author/MAURICE/IPC-ShareLite-0.09/ShareLite.pm>

These have terrible performance unless you are just sharing a few
simple scalars.  They serialize the entire data structure with
Storable on every write and deserialize on every read.  You can
imagine how bad that gets if you try to share a 10MB hash.

- Perrin

Re: Global question

Posted by Carl Johnstone <mo...@fadetoblack.me.uk>.

You can use shared memory between apache processes. Check:

*Apache::SharedMem* 
<http://search.cpan.org/author/RSOLIV/Apache-SharedMem-0.09/lib/Apache/SharedMem.pm> 

*Tie::ShareLite* 
<http://search.cpan.org/author/NSHAFER/Tie-ShareLite-0.03/ShareLite.pm>
*Cache::SharedMemoryCache* 
<http://search.cpan.org/author/DCLINTON/Cache-Cache-1.05/lib/Cache/SharedMemoryCache.pm> 


all based on

*IPC::ShareLite* 
<http://search.cpan.org/author/MAURICE/IPC-ShareLite-0.09/ShareLite.pm>

Cache::SharedMemoryCache  say though, a decent OS will keep a frequently 
accessed *disk* cache in memory anyway through buffers etc. So a disk 
-based cache can frequently be as fast as shared memory.

So you'll probably find that Perrin's BDB suggestion is the quickest - 
easy to implement solution.

Carl

Re: Global question

Posted by Will Fould <wi...@gmail.com>.

Thanks guys.  (I'm sure Perrin is tired of answering these same old question
in all of it's forms.)

The lists are functionally similar to Unix security lists (group id=name,
etc).  With thousands of users, these key lists are getting larger and the
time to re-build them will continue to grow but they are invaluable in
cutting down the time to do all sorts of good stuff.  For example, to keep
mysql indexing and joins simple (and relational updates simple), we simply
pull de-normalized datasets from mysql and the lists are used to provide
'real' values on screen where ID's are present. So, the ideal situation
would be to load once and just edit the lists in memory directly (after
updating mysql) and never rebuild them except on restart. Loading and
reloading the lists seems exhausting and counter-productive and running a
query for the denormalized values at each session seems like another big
waste.  Can apache processes meaningfully access any external (i.e. shell,
other) structures?

On 5/7/07, Perrin Harkins <pe...@elem.com> wrote:
>
> On 5/7/07, Jonathan Vanasco <jv...@2xlp.com> wrote:
> > Ah, I reread the post.  I saw "large lists" and thought "complex data
> > structure", not simple text.
>
> I think we were talking about different things, actually.  For reading
> and writing a large and complex data structure in its entirety, a
> Storable file is as good as it gets, unless you can rig something with
> an mmap'ed file.  I was talking about reading/writing pieces of it in
> BDB or MySQL, which should not need Storable.
>
> > I didn't know that BDB does shared memory caching.
>
> And no socket overhead too.  All the calls are in-process.
>
> > > Primary key lookups in MySQL over local sockets are very fast --
> > > faster than memcached.
> >
> > Really ?  I had read that they were about the same, but that mysql
> > selects are blocking & FIFO , while memcached is threaded and
> > supports concurrent access.
>
> Memcached is single-threaded and uses non-blocking I/O.  It's a very
> different approach from a multi-threaded daemon like MySQL, and should
> scale better ultimately.  However, for simple lookups, the network
> overhead from the TCP socket that memcached requires seems to outweigh
> any advantages.  MySQL can use a pipe instead of a TCP socket, which
> it does automatically when you connect to a server on localhost.
>
> - Perrin
>

Re: Global question

Posted by Perrin Harkins <pe...@elem.com>.

On 5/7/07, Jonathan Vanasco <jv...@2xlp.com> wrote:
> Ah, I reread the post.  I saw "large lists" and thought "complex data
> structure", not simple text.

I think we were talking about different things, actually.  For reading
and writing a large and complex data structure in its entirety, a
Storable file is as good as it gets, unless you can rig something with
an mmap'ed file.  I was talking about reading/writing pieces of it in
BDB or MySQL, which should not need Storable.

> I didn't know that BDB does shared memory caching.

And no socket overhead too.  All the calls are in-process.

> > Primary key lookups in MySQL over local sockets are very fast --
> > faster than memcached.
>
> Really ?  I had read that they were about the same, but that mysql
> selects are blocking & FIFO , while memcached is threaded and
> supports concurrent access.

Memcached is single-threaded and uses non-blocking I/O.  It's a very
different approach from a multi-threaded daemon like MySQL, and should
scale better ultimately.  However, for simple lookups, the network
overhead from the TCP socket that memcached requires seems to outweigh
any advantages.  MySQL can use a pipe instead of a TCP socket, which
it does automatically when you connect to a server on localhost.

- Perrin

Re: Global question

Posted by Jonathan Vanasco <jv...@2xlp.com>.

On May 7, 2007, at 11:59 AM, Perrin Harkins wrote:

> Storable is fast, but not using it is considerably faster.  There's no
> need to use it for storing simple strings.  BerkeleyDB does shared
> memory caching, so commonly accessed data doesn't need to go to disk.

Ah, I reread the post.  I saw "large lists" and thought "complex data  
structure", not simple text.

I didn't know that BDB does shared memory caching.  I'll have to read  
up on it.

>>         Unless you're already using mysql in your app ,  I  
>> wouldn't add it
>> in -- you'll introduce a new potential performance bottleneck.
> Primary key lookups in MySQL over local sockets are very fast --
> faster than memcached.

Really ?  I had read that they were about the same, but that mysql  
selects are blocking & FIFO , while memcached is threaded and  
supports concurrent access.

Re: Global question

Posted by Perrin Harkins <pe...@elem.com>.

On 5/7/07, Jonathan Vanasco <jv...@2xlp.com> wrote:
> IIRC, Wouldn't the fastest method possible be using Storable and a
> simple local file on the server ?
>
>         Building a perl var from a storable has almost no overhead.

Storable is fast, but not using it is considerably faster.  There's no
need to use it for storing simple strings.  BerkeleyDB does shared
memory caching, so commonly accessed data doesn't need to go to disk.

>         Unless you're already using mysql in your app ,  I wouldn't add it
> in -- you'll introduce a new potential performance bottleneck.

Primary key lookups in MySQL over local sockets are very fast --
faster than memcached.

- Perrin

Re: Global question

Posted by Jonathan Vanasco <jv...@2xlp.com>.

On May 7, 2007, at 11:26 AM, Perrin Harkins wrote:

>> Of course, the problem with using a database to get
>> the lists (besides the lists being the result of a munge), is that  
>> they are
>> rather large.
>
> Ideally you would load only the part you need, rather than the whole
> thing.  A local shared storage method, like BerkeleyDB or MySQL over
> Unix-domain sockets, can be very speedy.

IIRC, Wouldn't the fastest method possible be using Storable and a  
simple local file on the server ?

	Building a perl var from a storable has almost no overhead.

	If you keep a record of a system file's last modified time , and  
just check that for changes periodically (or even every request), the  
check should happen all within the kernel's inode cache and never  
actually hit the disk.

	That should offer a slight improvement over BDB , since you'll never  
hit the disk.

	This wouldn't work in a clustered environment , but you could do  
some sort of cron job to rsync everything if needed.
	Unless you're already using mysql in your app ,  I wouldn't add it  
in -- you'll introduce a new potential performance bottleneck.

Re: Global question

Posted by Perrin Harkins <pe...@elem.com>.

On 5/5/07, Will Fould <wi...@gmail.com> wrote:
> But, I'd like to do
> something similar; have a separate process that can alter parent data
> receive signals and re-cache accordingly.

There's no way to alter data in the parent process without restarting
the server.

> Of course, the problem with using a database to get
> the lists (besides the lists being the result of a munge), is that they are
> rather large.

Ideally you would load only the part you need, rather than the whole
thing.  A local shared storage method, like BerkeleyDB or MySQL over
Unix-domain sockets, can be very speedy.

If you absolutely need all of it in memory in each process and it's
going to change, I don't know any way to beat what you're already
doing.

- Perrin

RE: Global question

Posted by Adam Prime x443 <ap...@brunico.com>.

If they change really rarely couldln't you just have the children
automatically die off when the stuff needs to change and reload it?
You'd have to create the datastructure using a ChildInit handler i
assume, but couldn't a setup like that potentially work?

Adam

-----Original Message-----
From: Will Fould [mailto:willfould@gmail.com] 
Sent: Saturday, May 05, 2007 11:44 AM
To: Jonathan Vanasco
Cc: modperl
Subject: Re: Global question

Yes.

I currently use a semaphore scheme to cache large lists within child
processes that rarely change. It works quite well.  If the semaphore is
set, the child knows to re-cache; children set the semaphore when they
do something that would require other children to re-cache. But, I'd
like to do something similar; have a separate process that can alter
parent data receive signals and re-cache accordingly.  Maybe this is
really bad idea?  Would existing child processes see the new data or
would the only have a copy of the stale data? Of course, the problem
with using a database to get the lists (besides the lists being the
result of a munge), is that they are rather large. 

On 5/4/07, Jonathan Vanasco <jv...@2xlp.com> wrote:

	On May 4, 2007, at 8:50 PM, Will Fould wrote:

	> Can lists and other global objects created at apache startup
be
	> altered as an *indirect* result of child processes (i.e. some
type
	> of semaphore/listener scheme?). 

	do you mean somehow using an external processes to modify vars
in the
	apache parent, and avoid the copy-on-write behavior ?

	// Jonathan Vanasco

	| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - 
	- - - - - - - - - - - - - - - - - - -
	| SyndiClick.com
	| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - -
	- - - - - - - - - - - - - - - - - - -
	|      FindMeOn.com - The cure for Multiple Web Personality
Disorder 
	|      Web Identity Management and 3D Social Networking
	| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - -
	- - - - - - - - - - - - - - - - - - -
	|      RoadSound.com - Tools For Bands, Stuff For Fans 
	|      Collaborative Online Management And Syndication Tools
	| - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - -
	- - - - - - - - - - - - - - - - - - -

Re: Global question

Posted by Will Fould <wi...@gmail.com>.

Yes.

I currently use a semaphore scheme to cache large lists within child
processes that rarely change. It works quite well.  If the semaphore is set,
the child knows to re-cache; children set the semaphore when they do
something that would require other children to re-cache. But, I'd like to do
something similar; have a separate process that can alter parent data
receive signals and re-cache accordingly.  Maybe this is really bad idea?
Would existing child processes see the new data or would the only have a
copy of the stale data? Of course, the problem with using a database to get
the lists (besides the lists being the result of a munge), is that they are
rather large.

On 5/4/07, Jonathan Vanasco <jv...@2xlp.com> wrote:
>
>
> On May 4, 2007, at 8:50 PM, Will Fould wrote:
>
> > Can lists and other global objects created at apache startup be
> > altered as an *indirect* result of child processes (i.e. some type
> > of semaphore/listener scheme?).
>
> do you mean somehow using an external processes to modify vars in the
> apache parent, and avoid the copy-on-write behavior ?
>
>
>
> // Jonathan Vanasco
>
> | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> - - - - - - - - - - - - - - - - - - -
> | SyndiClick.com
> | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> - - - - - - - - - - - - - - - - - - -
> |      FindMeOn.com - The cure for Multiple Web Personality Disorder
> |      Web Identity Management and 3D Social Networking
> | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> - - - - - - - - - - - - - - - - - - -
> |      RoadSound.com - Tools For Bands, Stuff For Fans
> |      Collaborative Online Management And Syndication Tools
> | - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
> - - - - - - - - - - - - - - - - - - -
>
>
>