You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mod_python-dev@quetz.apache.org by Jim Gallacher <ji...@jgassociates.ca> on 2005/04/12 06:48:09 UTC

FileSession - a couple of small fixes

I've made a couple of small changes to FileSession.

1. _fast_timeout now uses Session.DFT_TIMEOUT when timeout=0

2. Fixed stupid indent error in do_cleanup method.

3. Added comments to do_cleanup method to highlight possible 
file/session locking issues.

4. Removed some log_error calls that slipped in from my  debugging version.

Patch is attached. Is there any sort of convention for patch file naming?

Jim

Re: FileSession - a couple of small fixes

Posted by Nicolas Lehuen <ni...@gmail.com>.

Hi Jim,

This seems do to the trick, nice idea !

Regards,

Nicolas

On 4/20/05, Jim Gallacher <jg...@sympatico.ca> wrote:
> Hi Nicolas,
> 
> I've had a chance to review your analysis and I think there is a really
> simple solution. Is there an emoticon for a light bulb going on?
> 
> Nicolas Lehuen wrote:
> >
> > The sequence of events that we don't want is :
> >
> > - process/thread A loads an old session S1 (older than the session
> > timeout) which has not been cleaned up yet. It begins processing the
> > request.
> > - process/thread B loads another session S2, the random test on
> > session cleanup pass, so a session cleanup is triggered. S1 is seen as
> > too old so it...
> > - process/thread A saves S1 to disk, with a brand new access time
> > - process/thread B ... erases the session file for S1 since it is too
> > old. Ooops, we've lost a perfectly valid session.
> >
> > There is another scenario which seems OK to me :
> >
> > - A loads an old session S1
> > - B loads another session S2, triggers cleanup, delete the old session
> > file for S1
> > - A goes on processing the request then eventually saves S1 to disk
> >
> > No problem, we've just resurrected the session. Well it's not exactly
> > true : what if a third process/thread C tries to access the session S1
> > between the delete and the save ? Fortunately, this is not possible
> > due to the individual session locking.
> 
> True unless the application code creates the session without locking.
> Another instance of the need for the documentation to address the
> importance of session locking.
> 
> > Let's try to do this methodically. There are 2 concurrent sequences of
> > events : in process A, we have L,P,S as Loading session, Processing
> > request and Saving session. In process B, we have C,D as Checking the
> > session timeout and Deleting the session because it's too old. Thus,
> > we have 10 possible schedulings of these events :
> >
> > 01) CDLPS => OK, the old session is lost and a new session is built and saved
> > 02) CLDPS => the session is resurrected ; individual session locking is needed
> > 03) CLPDS => the session is resurrected ; individual session locking is needed
> > 04) CLPSD => this is the problem mentioned above
> > 05) LCDPS => the session is resurrected ; individual session locking is needed
> > 06) LCPDS => the session is resurrected ; individual session locking is needed
> > 07) LCPSD => this is the problem mentioned above
> > 08) LPCDS => the session is resurrected ; individual session locking is needed
> > 09) LPCSD => this is the problem mentioned above
> > 10) LPSCD => this won't happen since when the session is saved, its
> > timeout goes down to 0
> >
> > First, let's notice that the problem of deleting a valid session must
> > be quite rare. You'll have to send a request with an session that
> > should have timed out, just as a random occuring session cleanup takes
> > place, and you have to fall into 3 of the 10 possible schedulings
> > mentioned above.
> 
> Yes, rare but possible.
> 
> > Anyway, if we really want to make this rock-solid, we'll have to find
> > a locking scheme that prevents the scheduling sequences 4,7 and 9 to
> > happen. I can see two solutions :
> >
> > 1) individually acquire the session lock before checking for the
> > session and deleting it.
> > 2) have a reader / writer lock (a kind of lock which can be shared for
> > reading but is exclusive for writing), so that sessions get a read
> > lock and the cleanup code tries to get a write lock.
> >
> > The problem in both solutions is that the lock can be held for quite a
> > long time, so you'll penalize either the cleanup (which runs into a
> > normal request processing thread/process, remember !) or the other
> > requests. I really don't think that preventing the quite unprobable
> > cases 4,7 and 9 is worth the price of slowing everyone.
> 
> So lets avoid all session locking. See below.
> 
> > So I suggest this : why not refuse to load a timed-out session ? If
> > the session is timed-out, then drop it AT LOAD TIME and create a new
> > one. This way, there won't be any race condition between the request
> > processing and the cleaning of sessions.
> 
> BaseSession.load(), which calls FileSession.do_load() already checks if
> the session has expired. Expired sessions will never get loaded. We
> still have the problem of cases 4,7 and 9 where the session save falls
> between the cleanup check and the cleanup delete. .
> 
> And then the light came on...
> 
> The important question that needs to asked - What is the purpose of
> running the cleanup?
> 
> Answer: We don't want to consume all of our disk space with expired
> session files.
> 
> WHEN the cleanup runs is really not all that relevant. As long as there
> is enough free disk space we could run it only once a year if we wanted.
> So why the obsession with the cleanup interfering with a session that is
> on the brink of expiring? Give the session a grace period before it is
> cleaned up and all will be well.
> 
> Our cleanup code becomes something like this:
> 
>      # Add a grace period of 120 seconds to the timeout
>      # in case a request occurs for a session occurs at approx
>      # the time it is about to expire
> 
>      grace_period = 120  # seconds
>      mtime = os.stat(filename).st_mtime
>      if time.time() - mtime < (fast_timeout + grace_period):
>          continue
>      else:
>          # do the rest of the cleanup
> 
> As long as grace_period is longer than the longest request the race
> condition expressed in cases 4,7 and 9 will never occur. At this point I
> think we can completely avoid session locking in the cleanup, and my
> concern about global lock collisions also disappears.
> 
> I'll modify the code and do some testing tonight. If all goes well you
> should see the new FileSession tomorrow morning.
> 
> Regards,
> Jim
> 
>

Re: FileSession - a couple of small fixes

Posted by Jim Gallacher <jg...@sympatico.ca>.

David Fraser wrote:
> Jim Gallacher wrote:
> 
>> David Fraser wrote:
>>
>>> Jim Gallacher wrote:
>>>
>>>> Hi Nicolas,
>>>>
>>>> I've had a chance to review your analysis and I think there is a 
>>>> really simple solution. Is there an emoticon for a light bulb going on?
>>>>
>>>> Nicolas Lehuen wrote:
>>>>
>>>>>
>>>>> The sequence of events that we don't want is :
>>>>>
>>>>> - process/thread A loads an old session S1 (older than the session
>>>>> timeout) which has not been cleaned up yet. It begins processing the
>>>>> request.
>>>>> - process/thread B loads another session S2, the random test on
>>>>> session cleanup pass, so a session cleanup is triggered. S1 is seen as
>>>>> too old so it...
>>>>> - process/thread A saves S1 to disk, with a brand new access time
>>>>> - process/thread B ... erases the session file for S1 since it is too
>>>>> old. Ooops, we've lost a perfectly valid session.
>>>>>
>>>>> There is another scenario which seems OK to me :
>>>>>
>>>>> - A loads an old session S1
>>>>> - B loads another session S2, triggers cleanup, delete the old session
>>>>> file for S1
>>>>> - A goes on processing the request then eventually saves S1 to disk
>>>>>
>>>>> No problem, we've just resurrected the session. Well it's not exactly
>>>>> true : what if a third process/thread C tries to access the session S1
>>>>> between the delete and the save ? Fortunately, this is not possible
>>>>> due to the individual session locking.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> True unless the application code creates the session without 
>>>> locking. Another instance of the need for the documentation to 
>>>> address the importance of session locking.
>>>>
>>>>> Let's try to do this methodically. There are 2 concurrent sequences of
>>>>> events : in process A, we have L,P,S as Loading session, Processing
>>>>> request and Saving session. In process B, we have C,D as Checking the
>>>>> session timeout and Deleting the session because it's too old. Thus,
>>>>> we have 10 possible schedulings of these events :
>>>>>
>>>>> 01) CDLPS => OK, the old session is lost and a new session is built 
>>>>> and saved
>>>>> 02) CLDPS => the session is resurrected ; individual session 
>>>>> locking is needed
>>>>> 03) CLPDS => the session is resurrected ; individual session 
>>>>> locking is needed
>>>>> 04) CLPSD => this is the problem mentioned above
>>>>> 05) LCDPS => the session is resurrected ; individual session 
>>>>> locking is needed
>>>>> 06) LCPDS => the session is resurrected ; individual session 
>>>>> locking is needed
>>>>> 07) LCPSD => this is the problem mentioned above
>>>>> 08) LPCDS => the session is resurrected ; individual session 
>>>>> locking is needed
>>>>> 09) LPCSD => this is the problem mentioned above
>>>>> 10) LPSCD => this won't happen since when the session is saved, its
>>>>> timeout goes down to 0
>>>>>
>>>>> First, let's notice that the problem of deleting a valid session must
>>>>> be quite rare. You'll have to send a request with an session that
>>>>> should have timed out, just as a random occuring session cleanup takes
>>>>> place, and you have to fall into 3 of the 10 possible schedulings
>>>>> mentioned above.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Yes, rare but possible.
>>>>
>>>>> Anyway, if we really want to make this rock-solid, we'll have to find
>>>>> a locking scheme that prevents the scheduling sequences 4,7 and 9 to
>>>>> happen. I can see two solutions :
>>>>>
>>>>> 1) individually acquire the session lock before checking for the
>>>>> session and deleting it.
>>>>> 2) have a reader / writer lock (a kind of lock which can be shared for
>>>>> reading but is exclusive for writing), so that sessions get a read
>>>>> lock and the cleanup code tries to get a write lock.
>>>>>
>>>>> The problem in both solutions is that the lock can be held for quite a
>>>>> long time, so you'll penalize either the cleanup (which runs into a
>>>>> normal request processing thread/process, remember !) or the other
>>>>> requests. I really don't think that preventing the quite unprobable
>>>>> cases 4,7 and 9 is worth the price of slowing everyone.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> So lets avoid all session locking. See below.
>>>>
>>>>> So I suggest this : why not refuse to load a timed-out session ? If
>>>>> the session is timed-out, then drop it AT LOAD TIME and create a new
>>>>> one. This way, there won't be any race condition between the request
>>>>> processing and the cleaning of sessions.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> BaseSession.load(), which calls FileSession.do_load() already checks 
>>>> if the session has expired. Expired sessions will never get loaded. 
>>>> We still have the problem of cases 4,7 and 9 where the session save 
>>>> falls between the cleanup check and the cleanup delete. .
>>>>
>>>> And then the light came on...
>>>>
>>>> The important question that needs to asked - What is the purpose of 
>>>> running the cleanup?
>>>>
>>>> Answer: We don't want to consume all of our disk space with expired 
>>>> session files.
>>>>
>>>> WHEN the cleanup runs is really not all that relevant. As long as 
>>>> there is enough free disk space we could run it only once a year if 
>>>> we wanted. So why the obsession with the cleanup interfering with a 
>>>> session that is on the brink of expiring? Give the session a grace 
>>>> period before it is cleaned up and all will be well.
>>>>
>>>> Our cleanup code becomes something like this:
>>>>
>>>>     # Add a grace period of 120 seconds to the timeout
>>>>     # in case a request occurs for a session occurs at approx
>>>>     # the time it is about to expire
>>>>
>>>>     grace_period = 120  # seconds
>>>>     mtime = os.stat(filename).st_mtime
>>>>     if time.time() - mtime < (fast_timeout + grace_period):
>>>>         continue
>>>>     else:
>>>>         # do the rest of the cleanup
>>>>
>>>> As long as grace_period is longer than the longest request the race 
>>>> condition expressed in cases 4,7 and 9 will never occur. At this 
>>>> point I think we can completely avoid session locking in the 
>>>> cleanup, and my concern about global lock collisions also disappears.
>>>>
>>>> I'll modify the code and do some testing tonight. If all goes well 
>>>> you should see the new FileSession tomorrow morning.
>>>
>>>
>>>
>>>
>>> I'm not so sure if this is always going to work. I quite often have 
>>> HTTP requests that stay open for several minutes.
>>> So case 9 above could cause a problem: A Loads, A Processes, [> 120 
>>> seconds elapses], B Checks, A Saves, B Deletes
>>>
>>> Or am I misreading this?
>>>
>>> David
>>>
>>
>> No, you are reading it correctly. Just realize that 120 seconds is an 
>> arbitrary number. In the code I actually submitted to Nicolas I 
>> decided to increase it to 240 seconds. Maybe it should be a parameter 
>> in the FileSession constructor? That way the application code can 
>> compensate for unusual cases such as yours. For example:
>>
>> sess = FileSession(timeout=3600, grace_period=3600)
> 
> 
> That would improve it. But I'm sure there should be a technically 
> correct way to avoid the deadlock rather than using this grace period.
> But I'm not suggesting I know what that is, or that its important enough 
> to spend lots of time on...
> 
> David
> 

It's important enough. We all want the code to be solid... and predictable.

We have examined some different locking schemes. Using the grace_period
rather than locking should fix the race condition, and we avoid any
potential deadlocks as a side benefit. I'm certainly not suggesting that
using this kind of grace period is a general solution to fixing race
conditions, but in this case it should work. We also avoid acquiring and
releasing a large number of locks, which should have a positive impact
on performance.

Any race condition will be avoided when request_processing_time < grace
period. Since request_processing_time can never be known in advance,
there will always be some uncertainty on the correct value for
grace_period. I had considered just using grace_period = timeout, but
very large values of timeout would mean the proliferation of expired
session files on the system. The application coder should have a good
idea what the request_processing_time is and so in unusual corner cases
such as the one you suggest, the coder can adjust grace_period accordingly.

There is a slight modification which may help to avoid suprises. 
FileSession (by default) uses the modification time of the session file 
to determine if the file is a candidate for deletion. What if the
modification time of the file was updated when a request reads it's
session file? Since this will be at the beginning of the request, the
file will be skipped by the cleanup even if the request is long running.

do_load() woulld now look something like this simplified code:

     sess_file = os.path.join(self._sessdir, 'mp_sess_%s' % self._sid)
     fp = file(sess_file)
     data = cPickle.load(fp)
     # Set access and modified time for the session file
     # to the current time
     os.utime(sess_file)
     return data


This also suggests a possible optimization where a session is only saved 
if it is modified. In the current implementation, the session must 
always be saved to avoid it expiring, even when none of it's data has 
changed. More on that later.

Regards,
Jim

Re: FileSession - a couple of small fixes

Posted by David Fraser <da...@sjsoft.com>.

Jim Gallacher wrote:

> David Fraser wrote:
>
>> Jim Gallacher wrote:
>>
>>> Hi Nicolas,
>>>
>>> I've had a chance to review your analysis and I think there is a 
>>> really simple solution. Is there an emoticon for a light bulb going on?
>>>
>>> Nicolas Lehuen wrote:
>>>
>>>>
>>>> The sequence of events that we don't want is :
>>>>
>>>> - process/thread A loads an old session S1 (older than the session
>>>> timeout) which has not been cleaned up yet. It begins processing the
>>>> request.
>>>> - process/thread B loads another session S2, the random test on
>>>> session cleanup pass, so a session cleanup is triggered. S1 is seen as
>>>> too old so it...
>>>> - process/thread A saves S1 to disk, with a brand new access time
>>>> - process/thread B ... erases the session file for S1 since it is too
>>>> old. Ooops, we've lost a perfectly valid session.
>>>>
>>>> There is another scenario which seems OK to me :
>>>>
>>>> - A loads an old session S1
>>>> - B loads another session S2, triggers cleanup, delete the old session
>>>> file for S1
>>>> - A goes on processing the request then eventually saves S1 to disk
>>>>
>>>> No problem, we've just resurrected the session. Well it's not exactly
>>>> true : what if a third process/thread C tries to access the session S1
>>>> between the delete and the save ? Fortunately, this is not possible
>>>> due to the individual session locking.
>>>
>>>
>>>
>>>
>>> True unless the application code creates the session without 
>>> locking. Another instance of the need for the documentation to 
>>> address the importance of session locking.
>>>
>>>> Let's try to do this methodically. There are 2 concurrent sequences of
>>>> events : in process A, we have L,P,S as Loading session, Processing
>>>> request and Saving session. In process B, we have C,D as Checking the
>>>> session timeout and Deleting the session because it's too old. Thus,
>>>> we have 10 possible schedulings of these events :
>>>>
>>>> 01) CDLPS => OK, the old session is lost and a new session is built 
>>>> and saved
>>>> 02) CLDPS => the session is resurrected ; individual session 
>>>> locking is needed
>>>> 03) CLPDS => the session is resurrected ; individual session 
>>>> locking is needed
>>>> 04) CLPSD => this is the problem mentioned above
>>>> 05) LCDPS => the session is resurrected ; individual session 
>>>> locking is needed
>>>> 06) LCPDS => the session is resurrected ; individual session 
>>>> locking is needed
>>>> 07) LCPSD => this is the problem mentioned above
>>>> 08) LPCDS => the session is resurrected ; individual session 
>>>> locking is needed
>>>> 09) LPCSD => this is the problem mentioned above
>>>> 10) LPSCD => this won't happen since when the session is saved, its
>>>> timeout goes down to 0
>>>>
>>>> First, let's notice that the problem of deleting a valid session must
>>>> be quite rare. You'll have to send a request with an session that
>>>> should have timed out, just as a random occuring session cleanup takes
>>>> place, and you have to fall into 3 of the 10 possible schedulings
>>>> mentioned above.
>>>
>>>
>>>
>>>
>>> Yes, rare but possible.
>>>
>>>> Anyway, if we really want to make this rock-solid, we'll have to find
>>>> a locking scheme that prevents the scheduling sequences 4,7 and 9 to
>>>> happen. I can see two solutions :
>>>>
>>>> 1) individually acquire the session lock before checking for the
>>>> session and deleting it.
>>>> 2) have a reader / writer lock (a kind of lock which can be shared for
>>>> reading but is exclusive for writing), so that sessions get a read
>>>> lock and the cleanup code tries to get a write lock.
>>>>
>>>> The problem in both solutions is that the lock can be held for quite a
>>>> long time, so you'll penalize either the cleanup (which runs into a
>>>> normal request processing thread/process, remember !) or the other
>>>> requests. I really don't think that preventing the quite unprobable
>>>> cases 4,7 and 9 is worth the price of slowing everyone.
>>>
>>>
>>>
>>>
>>> So lets avoid all session locking. See below.
>>>
>>>> So I suggest this : why not refuse to load a timed-out session ? If
>>>> the session is timed-out, then drop it AT LOAD TIME and create a new
>>>> one. This way, there won't be any race condition between the request
>>>> processing and the cleaning of sessions.
>>>
>>>
>>>
>>>
>>> BaseSession.load(), which calls FileSession.do_load() already checks 
>>> if the session has expired. Expired sessions will never get loaded. 
>>> We still have the problem of cases 4,7 and 9 where the session save 
>>> falls between the cleanup check and the cleanup delete. .
>>>
>>> And then the light came on...
>>>
>>> The important question that needs to asked - What is the purpose of 
>>> running the cleanup?
>>>
>>> Answer: We don't want to consume all of our disk space with expired 
>>> session files.
>>>
>>> WHEN the cleanup runs is really not all that relevant. As long as 
>>> there is enough free disk space we could run it only once a year if 
>>> we wanted. So why the obsession with the cleanup interfering with a 
>>> session that is on the brink of expiring? Give the session a grace 
>>> period before it is cleaned up and all will be well.
>>>
>>> Our cleanup code becomes something like this:
>>>
>>>     # Add a grace period of 120 seconds to the timeout
>>>     # in case a request occurs for a session occurs at approx
>>>     # the time it is about to expire
>>>
>>>     grace_period = 120  # seconds
>>>     mtime = os.stat(filename).st_mtime
>>>     if time.time() - mtime < (fast_timeout + grace_period):
>>>         continue
>>>     else:
>>>         # do the rest of the cleanup
>>>
>>> As long as grace_period is longer than the longest request the race 
>>> condition expressed in cases 4,7 and 9 will never occur. At this 
>>> point I think we can completely avoid session locking in the 
>>> cleanup, and my concern about global lock collisions also disappears.
>>>
>>> I'll modify the code and do some testing tonight. If all goes well 
>>> you should see the new FileSession tomorrow morning.
>>
>>
>>
>> I'm not so sure if this is always going to work. I quite often have 
>> HTTP requests that stay open for several minutes.
>> So case 9 above could cause a problem: A Loads, A Processes, [> 120 
>> seconds elapses], B Checks, A Saves, B Deletes
>>
>> Or am I misreading this?
>>
>> David
>>
>
> No, you are reading it correctly. Just realize that 120 seconds is an 
> arbitrary number. In the code I actually submitted to Nicolas I 
> decided to increase it to 240 seconds. Maybe it should be a parameter 
> in the FileSession constructor? That way the application code can 
> compensate for unusual cases such as yours. For example:
>
> sess = FileSession(timeout=3600, grace_period=3600)

That would improve it. But I'm sure there should be a technically 
correct way to avoid the deadlock rather than using this grace period.
But I'm not suggesting I know what that is, or that its important enough 
to spend lots of time on...

David

Re: FileSession - a couple of small fixes

Posted by Jim Gallacher <jg...@sympatico.ca>.

David Fraser wrote:
> Jim Gallacher wrote:
> 
>> Hi Nicolas,
>>
>> I've had a chance to review your analysis and I think there is a 
>> really simple solution. Is there an emoticon for a light bulb going on?
>>
>> Nicolas Lehuen wrote:
>>
>>>
>>> The sequence of events that we don't want is :
>>>
>>> - process/thread A loads an old session S1 (older than the session
>>> timeout) which has not been cleaned up yet. It begins processing the
>>> request.
>>> - process/thread B loads another session S2, the random test on
>>> session cleanup pass, so a session cleanup is triggered. S1 is seen as
>>> too old so it...
>>> - process/thread A saves S1 to disk, with a brand new access time
>>> - process/thread B ... erases the session file for S1 since it is too
>>> old. Ooops, we've lost a perfectly valid session.
>>>
>>> There is another scenario which seems OK to me :
>>>
>>> - A loads an old session S1
>>> - B loads another session S2, triggers cleanup, delete the old session
>>> file for S1
>>> - A goes on processing the request then eventually saves S1 to disk
>>>
>>> No problem, we've just resurrected the session. Well it's not exactly
>>> true : what if a third process/thread C tries to access the session S1
>>> between the delete and the save ? Fortunately, this is not possible
>>> due to the individual session locking.
>>
>>
>>
>> True unless the application code creates the session without locking. 
>> Another instance of the need for the documentation to address the 
>> importance of session locking.
>>
>>> Let's try to do this methodically. There are 2 concurrent sequences of
>>> events : in process A, we have L,P,S as Loading session, Processing
>>> request and Saving session. In process B, we have C,D as Checking the
>>> session timeout and Deleting the session because it's too old. Thus,
>>> we have 10 possible schedulings of these events :
>>>
>>> 01) CDLPS => OK, the old session is lost and a new session is built 
>>> and saved
>>> 02) CLDPS => the session is resurrected ; individual session locking 
>>> is needed
>>> 03) CLPDS => the session is resurrected ; individual session locking 
>>> is needed
>>> 04) CLPSD => this is the problem mentioned above
>>> 05) LCDPS => the session is resurrected ; individual session locking 
>>> is needed
>>> 06) LCPDS => the session is resurrected ; individual session locking 
>>> is needed
>>> 07) LCPSD => this is the problem mentioned above
>>> 08) LPCDS => the session is resurrected ; individual session locking 
>>> is needed
>>> 09) LPCSD => this is the problem mentioned above
>>> 10) LPSCD => this won't happen since when the session is saved, its
>>> timeout goes down to 0
>>>
>>> First, let's notice that the problem of deleting a valid session must
>>> be quite rare. You'll have to send a request with an session that
>>> should have timed out, just as a random occuring session cleanup takes
>>> place, and you have to fall into 3 of the 10 possible schedulings
>>> mentioned above.
>>
>>
>>
>> Yes, rare but possible.
>>
>>> Anyway, if we really want to make this rock-solid, we'll have to find
>>> a locking scheme that prevents the scheduling sequences 4,7 and 9 to
>>> happen. I can see two solutions :
>>>
>>> 1) individually acquire the session lock before checking for the
>>> session and deleting it.
>>> 2) have a reader / writer lock (a kind of lock which can be shared for
>>> reading but is exclusive for writing), so that sessions get a read
>>> lock and the cleanup code tries to get a write lock.
>>>
>>> The problem in both solutions is that the lock can be held for quite a
>>> long time, so you'll penalize either the cleanup (which runs into a
>>> normal request processing thread/process, remember !) or the other
>>> requests. I really don't think that preventing the quite unprobable
>>> cases 4,7 and 9 is worth the price of slowing everyone.
>>
>>
>>
>> So lets avoid all session locking. See below.
>>
>>> So I suggest this : why not refuse to load a timed-out session ? If
>>> the session is timed-out, then drop it AT LOAD TIME and create a new
>>> one. This way, there won't be any race condition between the request
>>> processing and the cleaning of sessions.
>>
>>
>>
>> BaseSession.load(), which calls FileSession.do_load() already checks 
>> if the session has expired. Expired sessions will never get loaded. We 
>> still have the problem of cases 4,7 and 9 where the session save falls 
>> between the cleanup check and the cleanup delete. .
>>
>> And then the light came on...
>>
>> The important question that needs to asked - What is the purpose of 
>> running the cleanup?
>>
>> Answer: We don't want to consume all of our disk space with expired 
>> session files.
>>
>> WHEN the cleanup runs is really not all that relevant. As long as 
>> there is enough free disk space we could run it only once a year if we 
>> wanted. So why the obsession with the cleanup interfering with a 
>> session that is on the brink of expiring? Give the session a grace 
>> period before it is cleaned up and all will be well.
>>
>> Our cleanup code becomes something like this:
>>
>>     # Add a grace period of 120 seconds to the timeout
>>     # in case a request occurs for a session occurs at approx
>>     # the time it is about to expire
>>
>>     grace_period = 120  # seconds
>>     mtime = os.stat(filename).st_mtime
>>     if time.time() - mtime < (fast_timeout + grace_period):
>>         continue
>>     else:
>>         # do the rest of the cleanup
>>
>> As long as grace_period is longer than the longest request the race 
>> condition expressed in cases 4,7 and 9 will never occur. At this point 
>> I think we can completely avoid session locking in the cleanup, and my 
>> concern about global lock collisions also disappears.
>>
>> I'll modify the code and do some testing tonight. If all goes well you 
>> should see the new FileSession tomorrow morning.
> 
> 
> I'm not so sure if this is always going to work. I quite often have HTTP 
> requests that stay open for several minutes.
> So case 9 above could cause a problem: A Loads, A Processes, [> 120 
> seconds elapses], B Checks, A Saves, B Deletes
> 
> Or am I misreading this?
> 
> David
> 

No, you are reading it correctly. Just realize that 120 seconds is an 
arbitrary number. In the code I actually submitted to Nicolas I decided 
to increase it to 240 seconds. Maybe it should be a parameter in the 
FileSession constructor? That way the application code can compensate 
for unusual cases such as yours. For example:

sess = FileSession(timeout=3600, grace_period=3600)

Regards,
Jim

Re: FileSession - a couple of small fixes

Posted by David Fraser <da...@sjsoft.com>.

Hi Nicolas

120 seconds to process is *exactly* what I meant, sorry if that was not 
clear.
I often process lots of files etc in response to a request. I know this 
is a corner case and I'm not actually using mod_python's handling for 
these sessions, but I just thought it was important to point out. 
Usually with locking bugs there's no easy way to get around them :-)
Of course you may feel that this doesn't justify supporting, which is 
fine, as long as its documented so.

David

Nicolas Lehuen wrote:

>Hi
>
>You are describing a sequence of events that would happen if the
>request took longer than 120 to *process*. A 120 seconds processing
>time is not the same thing as a 120 second pause between two
>consecutive requests.
>
>Jim's solution does handle the 120 second pause between requests ; for
>a processing time of 120 second which happens at the end of a session,
>it will unduly erase the session (unless the session is saved before
>the long processing time), but I guess it is safe to forget about this
>possibility. That was not what you meant, anyway, wasn't it ?
>
>Regards,
>Nicolas
>
>On 4/21/05, David Fraser <da...@sjsoft.com> wrote:
>  
>
>>Jim Gallacher wrote:
>>
>>    
>>
>>>Hi Nicolas,
>>>
>>>I've had a chance to review your analysis and I think there is a
>>>really simple solution. Is there an emoticon for a light bulb going on?
>>>
>>>Nicolas Lehuen wrote:
>>>
>>>      
>>>
>>>>The sequence of events that we don't want is :
>>>>
>>>>- process/thread A loads an old session S1 (older than the session
>>>>timeout) which has not been cleaned up yet. It begins processing the
>>>>request.
>>>>- process/thread B loads another session S2, the random test on
>>>>session cleanup pass, so a session cleanup is triggered. S1 is seen as
>>>>too old so it...
>>>>- process/thread A saves S1 to disk, with a brand new access time
>>>>- process/thread B ... erases the session file for S1 since it is too
>>>>old. Ooops, we've lost a perfectly valid session.
>>>>
>>>>There is another scenario which seems OK to me :
>>>>
>>>>- A loads an old session S1
>>>>- B loads another session S2, triggers cleanup, delete the old session
>>>>file for S1
>>>>- A goes on processing the request then eventually saves S1 to disk
>>>>
>>>>No problem, we've just resurrected the session. Well it's not exactly
>>>>true : what if a third process/thread C tries to access the session S1
>>>>between the delete and the save ? Fortunately, this is not possible
>>>>due to the individual session locking.
>>>>        
>>>>
>>>True unless the application code creates the session without locking.
>>>Another instance of the need for the documentation to address the
>>>importance of session locking.
>>>
>>>      
>>>
>>>>Let's try to do this methodically. There are 2 concurrent sequences of
>>>>events : in process A, we have L,P,S as Loading session, Processing
>>>>request and Saving session. In process B, we have C,D as Checking the
>>>>session timeout and Deleting the session because it's too old. Thus,
>>>>we have 10 possible schedulings of these events :
>>>>
>>>>01) CDLPS => OK, the old session is lost and a new session is built
>>>>and saved
>>>>02) CLDPS => the session is resurrected ; individual session locking
>>>>is needed
>>>>03) CLPDS => the session is resurrected ; individual session locking
>>>>is needed
>>>>04) CLPSD => this is the problem mentioned above
>>>>05) LCDPS => the session is resurrected ; individual session locking
>>>>is needed
>>>>06) LCPDS => the session is resurrected ; individual session locking
>>>>is needed
>>>>07) LCPSD => this is the problem mentioned above
>>>>08) LPCDS => the session is resurrected ; individual session locking
>>>>is needed
>>>>09) LPCSD => this is the problem mentioned above
>>>>10) LPSCD => this won't happen since when the session is saved, its
>>>>timeout goes down to 0
>>>>
>>>>First, let's notice that the problem of deleting a valid session must
>>>>be quite rare. You'll have to send a request with an session that
>>>>should have timed out, just as a random occuring session cleanup takes
>>>>place, and you have to fall into 3 of the 10 possible schedulings
>>>>mentioned above.
>>>>        
>>>>
>>>Yes, rare but possible.
>>>
>>>      
>>>
>>>>Anyway, if we really want to make this rock-solid, we'll have to find
>>>>a locking scheme that prevents the scheduling sequences 4,7 and 9 to
>>>>happen. I can see two solutions :
>>>>
>>>>1) individually acquire the session lock before checking for the
>>>>session and deleting it.
>>>>2) have a reader / writer lock (a kind of lock which can be shared for
>>>>reading but is exclusive for writing), so that sessions get a read
>>>>lock and the cleanup code tries to get a write lock.
>>>>
>>>>The problem in both solutions is that the lock can be held for quite a
>>>>long time, so you'll penalize either the cleanup (which runs into a
>>>>normal request processing thread/process, remember !) or the other
>>>>requests. I really don't think that preventing the quite unprobable
>>>>cases 4,7 and 9 is worth the price of slowing everyone.
>>>>        
>>>>
>>>So lets avoid all session locking. See below.
>>>
>>>      
>>>
>>>>So I suggest this : why not refuse to load a timed-out session ? If
>>>>the session is timed-out, then drop it AT LOAD TIME and create a new
>>>>one. This way, there won't be any race condition between the request
>>>>processing and the cleaning of sessions.
>>>>        
>>>>
>>>BaseSession.load(), which calls FileSession.do_load() already checks
>>>if the session has expired. Expired sessions will never get loaded. We
>>>still have the problem of cases 4,7 and 9 where the session save falls
>>>between the cleanup check and the cleanup delete. .
>>>
>>>And then the light came on...
>>>
>>>The important question that needs to asked - What is the purpose of
>>>running the cleanup?
>>>
>>>Answer: We don't want to consume all of our disk space with expired
>>>session files.
>>>
>>>WHEN the cleanup runs is really not all that relevant. As long as
>>>there is enough free disk space we could run it only once a year if we
>>>wanted. So why the obsession with the cleanup interfering with a
>>>session that is on the brink of expiring? Give the session a grace
>>>period before it is cleaned up and all will be well.
>>>
>>>Our cleanup code becomes something like this:
>>>
>>>    # Add a grace period of 120 seconds to the timeout
>>>    # in case a request occurs for a session occurs at approx
>>>    # the time it is about to expire
>>>
>>>    grace_period = 120  # seconds
>>>    mtime = os.stat(filename).st_mtime
>>>    if time.time() - mtime < (fast_timeout + grace_period):
>>>        continue
>>>    else:
>>>        # do the rest of the cleanup
>>>
>>>As long as grace_period is longer than the longest request the race
>>>condition expressed in cases 4,7 and 9 will never occur. At this point
>>>I think we can completely avoid session locking in the cleanup, and my
>>>concern about global lock collisions also disappears.
>>>
>>>I'll modify the code and do some testing tonight. If all goes well you
>>>should see the new FileSession tomorrow morning.
>>>      
>>>
>>I'm not so sure if this is always going to work. I quite often have HTTP
>>requests that stay open for several minutes.
>>So case 9 above could cause a problem: A Loads, A Processes, [> 120
>>seconds elapses], B Checks, A Saves, B Deletes
>>
>>Or am I misreading this?
>>
>>David
>>
>>    
>>
>
>  
>

Re: FileSession - a couple of small fixes

Posted by Nicolas Lehuen <ni...@gmail.com>.

Hi

You are describing a sequence of events that would happen if the
request took longer than 120 to *process*. A 120 seconds processing
time is not the same thing as a 120 second pause between two
consecutive requests.

Jim's solution does handle the 120 second pause between requests ; for
a processing time of 120 second which happens at the end of a session,
it will unduly erase the session (unless the session is saved before
the long processing time), but I guess it is safe to forget about this
possibility. That was not what you meant, anyway, wasn't it ?

Regards,
Nicolas

On 4/21/05, David Fraser <da...@sjsoft.com> wrote:
> Jim Gallacher wrote:
> 
> > Hi Nicolas,
> >
> > I've had a chance to review your analysis and I think there is a
> > really simple solution. Is there an emoticon for a light bulb going on?
> >
> > Nicolas Lehuen wrote:
> >
> >>
> >> The sequence of events that we don't want is :
> >>
> >> - process/thread A loads an old session S1 (older than the session
> >> timeout) which has not been cleaned up yet. It begins processing the
> >> request.
> >> - process/thread B loads another session S2, the random test on
> >> session cleanup pass, so a session cleanup is triggered. S1 is seen as
> >> too old so it...
> >> - process/thread A saves S1 to disk, with a brand new access time
> >> - process/thread B ... erases the session file for S1 since it is too
> >> old. Ooops, we've lost a perfectly valid session.
> >>
> >> There is another scenario which seems OK to me :
> >>
> >> - A loads an old session S1
> >> - B loads another session S2, triggers cleanup, delete the old session
> >> file for S1
> >> - A goes on processing the request then eventually saves S1 to disk
> >>
> >> No problem, we've just resurrected the session. Well it's not exactly
> >> true : what if a third process/thread C tries to access the session S1
> >> between the delete and the save ? Fortunately, this is not possible
> >> due to the individual session locking.
> >
> >
> > True unless the application code creates the session without locking.
> > Another instance of the need for the documentation to address the
> > importance of session locking.
> >
> >> Let's try to do this methodically. There are 2 concurrent sequences of
> >> events : in process A, we have L,P,S as Loading session, Processing
> >> request and Saving session. In process B, we have C,D as Checking the
> >> session timeout and Deleting the session because it's too old. Thus,
> >> we have 10 possible schedulings of these events :
> >>
> >> 01) CDLPS => OK, the old session is lost and a new session is built
> >> and saved
> >> 02) CLDPS => the session is resurrected ; individual session locking
> >> is needed
> >> 03) CLPDS => the session is resurrected ; individual session locking
> >> is needed
> >> 04) CLPSD => this is the problem mentioned above
> >> 05) LCDPS => the session is resurrected ; individual session locking
> >> is needed
> >> 06) LCPDS => the session is resurrected ; individual session locking
> >> is needed
> >> 07) LCPSD => this is the problem mentioned above
> >> 08) LPCDS => the session is resurrected ; individual session locking
> >> is needed
> >> 09) LPCSD => this is the problem mentioned above
> >> 10) LPSCD => this won't happen since when the session is saved, its
> >> timeout goes down to 0
> >>
> >> First, let's notice that the problem of deleting a valid session must
> >> be quite rare. You'll have to send a request with an session that
> >> should have timed out, just as a random occuring session cleanup takes
> >> place, and you have to fall into 3 of the 10 possible schedulings
> >> mentioned above.
> >
> >
> > Yes, rare but possible.
> >
> >> Anyway, if we really want to make this rock-solid, we'll have to find
> >> a locking scheme that prevents the scheduling sequences 4,7 and 9 to
> >> happen. I can see two solutions :
> >>
> >> 1) individually acquire the session lock before checking for the
> >> session and deleting it.
> >> 2) have a reader / writer lock (a kind of lock which can be shared for
> >> reading but is exclusive for writing), so that sessions get a read
> >> lock and the cleanup code tries to get a write lock.
> >>
> >> The problem in both solutions is that the lock can be held for quite a
> >> long time, so you'll penalize either the cleanup (which runs into a
> >> normal request processing thread/process, remember !) or the other
> >> requests. I really don't think that preventing the quite unprobable
> >> cases 4,7 and 9 is worth the price of slowing everyone.
> >
> >
> > So lets avoid all session locking. See below.
> >
> >> So I suggest this : why not refuse to load a timed-out session ? If
> >> the session is timed-out, then drop it AT LOAD TIME and create a new
> >> one. This way, there won't be any race condition between the request
> >> processing and the cleaning of sessions.
> >
> >
> > BaseSession.load(), which calls FileSession.do_load() already checks
> > if the session has expired. Expired sessions will never get loaded. We
> > still have the problem of cases 4,7 and 9 where the session save falls
> > between the cleanup check and the cleanup delete. .
> >
> > And then the light came on...
> >
> > The important question that needs to asked - What is the purpose of
> > running the cleanup?
> >
> > Answer: We don't want to consume all of our disk space with expired
> > session files.
> >
> > WHEN the cleanup runs is really not all that relevant. As long as
> > there is enough free disk space we could run it only once a year if we
> > wanted. So why the obsession with the cleanup interfering with a
> > session that is on the brink of expiring? Give the session a grace
> > period before it is cleaned up and all will be well.
> >
> > Our cleanup code becomes something like this:
> >
> >     # Add a grace period of 120 seconds to the timeout
> >     # in case a request occurs for a session occurs at approx
> >     # the time it is about to expire
> >
> >     grace_period = 120  # seconds
> >     mtime = os.stat(filename).st_mtime
> >     if time.time() - mtime < (fast_timeout + grace_period):
> >         continue
> >     else:
> >         # do the rest of the cleanup
> >
> > As long as grace_period is longer than the longest request the race
> > condition expressed in cases 4,7 and 9 will never occur. At this point
> > I think we can completely avoid session locking in the cleanup, and my
> > concern about global lock collisions also disappears.
> >
> > I'll modify the code and do some testing tonight. If all goes well you
> > should see the new FileSession tomorrow morning.
> 
> I'm not so sure if this is always going to work. I quite often have HTTP
> requests that stay open for several minutes.
> So case 9 above could cause a problem: A Loads, A Processes, [> 120
> seconds elapses], B Checks, A Saves, B Deletes
> 
> Or am I misreading this?
> 
> David
>

Re: FileSession - a couple of small fixes

Posted by David Fraser <da...@sjsoft.com>.

Jim Gallacher wrote:

> Hi Nicolas,
>
> I've had a chance to review your analysis and I think there is a 
> really simple solution. Is there an emoticon for a light bulb going on?
>
> Nicolas Lehuen wrote:
>
>>
>> The sequence of events that we don't want is :
>>
>> - process/thread A loads an old session S1 (older than the session
>> timeout) which has not been cleaned up yet. It begins processing the
>> request.
>> - process/thread B loads another session S2, the random test on
>> session cleanup pass, so a session cleanup is triggered. S1 is seen as
>> too old so it...
>> - process/thread A saves S1 to disk, with a brand new access time
>> - process/thread B ... erases the session file for S1 since it is too
>> old. Ooops, we've lost a perfectly valid session.
>>
>> There is another scenario which seems OK to me :
>>
>> - A loads an old session S1
>> - B loads another session S2, triggers cleanup, delete the old session
>> file for S1
>> - A goes on processing the request then eventually saves S1 to disk
>>
>> No problem, we've just resurrected the session. Well it's not exactly
>> true : what if a third process/thread C tries to access the session S1
>> between the delete and the save ? Fortunately, this is not possible
>> due to the individual session locking.
>
>
> True unless the application code creates the session without locking. 
> Another instance of the need for the documentation to address the 
> importance of session locking.
>
>> Let's try to do this methodically. There are 2 concurrent sequences of
>> events : in process A, we have L,P,S as Loading session, Processing
>> request and Saving session. In process B, we have C,D as Checking the
>> session timeout and Deleting the session because it's too old. Thus,
>> we have 10 possible schedulings of these events :
>>
>> 01) CDLPS => OK, the old session is lost and a new session is built 
>> and saved
>> 02) CLDPS => the session is resurrected ; individual session locking 
>> is needed
>> 03) CLPDS => the session is resurrected ; individual session locking 
>> is needed
>> 04) CLPSD => this is the problem mentioned above
>> 05) LCDPS => the session is resurrected ; individual session locking 
>> is needed
>> 06) LCPDS => the session is resurrected ; individual session locking 
>> is needed
>> 07) LCPSD => this is the problem mentioned above
>> 08) LPCDS => the session is resurrected ; individual session locking 
>> is needed
>> 09) LPCSD => this is the problem mentioned above
>> 10) LPSCD => this won't happen since when the session is saved, its
>> timeout goes down to 0
>>
>> First, let's notice that the problem of deleting a valid session must
>> be quite rare. You'll have to send a request with an session that
>> should have timed out, just as a random occuring session cleanup takes
>> place, and you have to fall into 3 of the 10 possible schedulings
>> mentioned above.
>
>
> Yes, rare but possible.
>
>> Anyway, if we really want to make this rock-solid, we'll have to find
>> a locking scheme that prevents the scheduling sequences 4,7 and 9 to
>> happen. I can see two solutions :
>>
>> 1) individually acquire the session lock before checking for the
>> session and deleting it.
>> 2) have a reader / writer lock (a kind of lock which can be shared for
>> reading but is exclusive for writing), so that sessions get a read
>> lock and the cleanup code tries to get a write lock.
>>
>> The problem in both solutions is that the lock can be held for quite a
>> long time, so you'll penalize either the cleanup (which runs into a
>> normal request processing thread/process, remember !) or the other
>> requests. I really don't think that preventing the quite unprobable
>> cases 4,7 and 9 is worth the price of slowing everyone.
>
>
> So lets avoid all session locking. See below.
>
>> So I suggest this : why not refuse to load a timed-out session ? If
>> the session is timed-out, then drop it AT LOAD TIME and create a new
>> one. This way, there won't be any race condition between the request
>> processing and the cleaning of sessions.
>
>
> BaseSession.load(), which calls FileSession.do_load() already checks 
> if the session has expired. Expired sessions will never get loaded. We 
> still have the problem of cases 4,7 and 9 where the session save falls 
> between the cleanup check and the cleanup delete. .
>
> And then the light came on...
>
> The important question that needs to asked - What is the purpose of 
> running the cleanup?
>
> Answer: We don't want to consume all of our disk space with expired 
> session files.
>
> WHEN the cleanup runs is really not all that relevant. As long as 
> there is enough free disk space we could run it only once a year if we 
> wanted. So why the obsession with the cleanup interfering with a 
> session that is on the brink of expiring? Give the session a grace 
> period before it is cleaned up and all will be well.
>
> Our cleanup code becomes something like this:
>
>     # Add a grace period of 120 seconds to the timeout
>     # in case a request occurs for a session occurs at approx
>     # the time it is about to expire
>
>     grace_period = 120  # seconds
>     mtime = os.stat(filename).st_mtime
>     if time.time() - mtime < (fast_timeout + grace_period):
>         continue
>     else:
>         # do the rest of the cleanup
>
> As long as grace_period is longer than the longest request the race 
> condition expressed in cases 4,7 and 9 will never occur. At this point 
> I think we can completely avoid session locking in the cleanup, and my 
> concern about global lock collisions also disappears.
>
> I'll modify the code and do some testing tonight. If all goes well you 
> should see the new FileSession tomorrow morning.

I'm not so sure if this is always going to work. I quite often have HTTP 
requests that stay open for several minutes.
So case 9 above could cause a problem: A Loads, A Processes, [> 120 
seconds elapses], B Checks, A Saves, B Deletes

Or am I misreading this?

David

Re: FileSession - a couple of small fixes

Posted by Jim Gallacher <jg...@sympatico.ca>.

Hi Nicolas,

I've had a chance to review your analysis and I think there is a really 
simple solution. Is there an emoticon for a light bulb going on?

Nicolas Lehuen wrote:
> 
> The sequence of events that we don't want is :
> 
> - process/thread A loads an old session S1 (older than the session
> timeout) which has not been cleaned up yet. It begins processing the
> request.
> - process/thread B loads another session S2, the random test on
> session cleanup pass, so a session cleanup is triggered. S1 is seen as
> too old so it...
> - process/thread A saves S1 to disk, with a brand new access time
> - process/thread B ... erases the session file for S1 since it is too
> old. Ooops, we've lost a perfectly valid session.
> 
> There is another scenario which seems OK to me :
> 
> - A loads an old session S1
> - B loads another session S2, triggers cleanup, delete the old session
> file for S1
> - A goes on processing the request then eventually saves S1 to disk
> 
> No problem, we've just resurrected the session. Well it's not exactly
> true : what if a third process/thread C tries to access the session S1
> between the delete and the save ? Fortunately, this is not possible
> due to the individual session locking.

True unless the application code creates the session without locking. 
Another instance of the need for the documentation to address the 
importance of session locking.

> Let's try to do this methodically. There are 2 concurrent sequences of
> events : in process A, we have L,P,S as Loading session, Processing
> request and Saving session. In process B, we have C,D as Checking the
> session timeout and Deleting the session because it's too old. Thus,
> we have 10 possible schedulings of these events :
> 
> 01) CDLPS => OK, the old session is lost and a new session is built and saved
> 02) CLDPS => the session is resurrected ; individual session locking is needed
> 03) CLPDS => the session is resurrected ; individual session locking is needed
> 04) CLPSD => this is the problem mentioned above
> 05) LCDPS => the session is resurrected ; individual session locking is needed
> 06) LCPDS => the session is resurrected ; individual session locking is needed
> 07) LCPSD => this is the problem mentioned above
> 08) LPCDS => the session is resurrected ; individual session locking is needed
> 09) LPCSD => this is the problem mentioned above
> 10) LPSCD => this won't happen since when the session is saved, its
> timeout goes down to 0
> 
> First, let's notice that the problem of deleting a valid session must
> be quite rare. You'll have to send a request with an session that
> should have timed out, just as a random occuring session cleanup takes
> place, and you have to fall into 3 of the 10 possible schedulings
> mentioned above.

Yes, rare but possible.

> Anyway, if we really want to make this rock-solid, we'll have to find
> a locking scheme that prevents the scheduling sequences 4,7 and 9 to
> happen. I can see two solutions :
> 
> 1) individually acquire the session lock before checking for the
> session and deleting it.
> 2) have a reader / writer lock (a kind of lock which can be shared for
> reading but is exclusive for writing), so that sessions get a read
> lock and the cleanup code tries to get a write lock.
> 
> The problem in both solutions is that the lock can be held for quite a
> long time, so you'll penalize either the cleanup (which runs into a
> normal request processing thread/process, remember !) or the other
> requests. I really don't think that preventing the quite unprobable
> cases 4,7 and 9 is worth the price of slowing everyone.

So lets avoid all session locking. See below.

> So I suggest this : why not refuse to load a timed-out session ? If
> the session is timed-out, then drop it AT LOAD TIME and create a new
> one. This way, there won't be any race condition between the request
> processing and the cleaning of sessions.

BaseSession.load(), which calls FileSession.do_load() already checks if 
the session has expired. Expired sessions will never get loaded. We 
still have the problem of cases 4,7 and 9 where the session save falls 
between the cleanup check and the cleanup delete. .

And then the light came on...

The important question that needs to asked - What is the purpose of 
running the cleanup?

Answer: We don't want to consume all of our disk space with expired 
session files.

WHEN the cleanup runs is really not all that relevant. As long as there 
is enough free disk space we could run it only once a year if we wanted. 
So why the obsession with the cleanup interfering with a session that is 
on the brink of expiring? Give the session a grace period before it is 
cleaned up and all will be well.

Our cleanup code becomes something like this:

     # Add a grace period of 120 seconds to the timeout
     # in case a request occurs for a session occurs at approx
     # the time it is about to expire

     grace_period = 120  # seconds
     mtime = os.stat(filename).st_mtime
     if time.time() - mtime < (fast_timeout + grace_period):
         continue
     else:
         # do the rest of the cleanup

As long as grace_period is longer than the longest request the race 
condition expressed in cases 4,7 and 9 will never occur. At this point I 
think we can completely avoid session locking in the cleanup, and my 
concern about global lock collisions also disappears.

I'll modify the code and do some testing tonight. If all goes well you 
should see the new FileSession tomorrow morning.

Regards,
Jim

Re: FileSession - a couple of small fixes

Posted by Jim Gallacher <jg...@sympatico.ca>.

Nicolas Lehuen wrote:
> On 4/14/05, Jim Gallacher <jg...@sympatico.ca> wrote:
> 
>>>Anyway, I've made some tests like you, and it seems that unless I've
>>>done something stupid in the _global_trylock function, the
>>>implementation of trylock in the APR library is flaky on Win32.  With
>>>one single client, everything is OK, but as soon as I test with more
>>>clients (using ab -c 30 for example), I randomly get strange errors in
>>>the error log, once in a while :
>>>
>>
>>   < snipped error log msg >
>>
>>>On top of that, it seems that trylock is not implemented on your
>>>platform (what is it ?),
>>
>>Linux, Debian Unstable, apache 2.0.53 mpm-prefork. I need the prefork
>>model rather than the threaded model as the server is also running php4,
>>which I understand is not thread safe.
>>
>>
>>>so I guess this is the wrong way to go...
>>
>>So it would seem. How about this code snippet to ensure only one cleanup
>>will run at a time?:
>>
>>lockfile = os.path.join(sessdir,'.lck')
>>
>># grab the global lock
>># This will block if another request is already running do cleanup,
>># but since the lock is only held for a short time it will not impact
>># performance.
>>_apache._global_lock(req.server, 0)
>>     lock_file_exists =  os.exists(lockfile)
>>     if not lock_file_exists:
>>         fp = file(lockfile,'w')
>>         fp.write('lock it')
>>_apache._global_unlock(req.server, 0)
>>
>>if lock_file_exists:
>>     # another do_cleanup is already running
>>     # don't exit this one without running
>>     return
>>else:
>>     do_filesession_cleanup()
>>
>>os.unlink(lockfile)
>>
>># end of code.
>>
>>I haven't tested this code yet, but I think is should be ok.
> 
> 
> Yes, altough you are kind of reimplementing what should be done in the
> APR. The first file existence test could also be skipped with an
> exclusive file opening (using the O_EXCL flag) for which there was a
> claim of atomicity on this mailing list.

This was sloshing around in the back of my brain but didn't make it to 
the keyboard when I wrote the above code. Borrowing from Barry's mail on 
file locking, how about the following, assuming that we can't get 
global_trylock to work:

lockfile = os.path.join(sessdir,'.mp_sess.lck')
try:
     lock_fd = os.open(lockfile,
	 os.O_CREAT | os.O_EXCL | os.O_WRONLY, 0660)
except:
     req.log_error('another process is already running.')
     return

try:
     do_filesession_cleanup()

finally:
     os.close(lock_fd)
     os.unlink(lockfile)


>>We still need to deal with locking individual session files. Am I being
>>overly concerned about the performance impact of locking/unlocking a
>>large number of sessions? If it's not that big of a deal, we can just
>>lock each session an be done with it.
> 

I must be off to work now, and will digest the rest of your reply later.

Regards,
Jim

> The sequence of events that we don't want is :
> 
> - process/thread A loads an old session S1 (older than the session
> timeout) which has not been cleaned up yet. It begins processing the
> request.
> - process/thread B loads another session S2, the random test on
> session cleanup pass, so a session cleanup is triggered. S1 is seen as
> too old so it...
> - process/thread A saves S1 to disk, with a brand new access time
> - process/thread B ... erases the session file for S1 since it is too
> old. Ooops, we've lost a perfectly valid session.
> 
> There is another scenario which seems OK to me :
> 
> - A loads an old session S1
> - B loads another session S2, triggers cleanup, delete the old session
> file for S1
> - A goes on processing the request then eventually saves S1 to disk
> 
> No problem, we've just resurrected the session. Well it's not exactly
> true : what if a third process/thread C tries to access the session S1
> between the delete and the save ? Fortunately, this is not possible
> due to the individual session locking.
>  
> Let's try to do this methodically. There are 2 concurrent sequences of
> events : in process A, we have L,P,S as Loading session, Processing
> request and Saving session. In process B, we have C,D as Checking the
> session timeout and Deleting the session because it's too old. Thus,
> we have 10 possible schedulings of these events :
> 
> 01) CDLPS => OK, the old session is lost and a new session is built and saved
> 02) CLDPS => the session is resurrected ; individual session locking is needed
> 03) CLPDS => the session is resurrected ; individual session locking is needed
> 04) CLPSD => this is the problem mentioned above
> 05) LCDPS => the session is resurrected ; individual session locking is needed
> 06) LCPDS => the session is resurrected ; individual session locking is needed
> 07) LCPSD => this is the problem mentioned above
> 08) LPCDS => the session is resurrected ; individual session locking is needed
> 09) LPCSD => this is the problem mentioned above
> 10) LPSCD => this won't happen since when the session is saved, its
> timeout goes down to 0
> 
> First, let's notice that the problem of deleting a valid session must
> be quite rare. You'll have to send a request with an session that
> should have timed out, just as a random occuring session cleanup takes
> place, and you have to fall into 3 of the 10 possible schedulings
> mentioned above.
> 
> Anyway, if we really want to make this rock-solid, we'll have to find
> a locking scheme that prevents the scheduling sequences 4,7 and 9 to
> happen. I can see two solutions :
> 
> 1) individually acquire the session lock before checking for the
> session and deleting it.
> 2) have a reader / writer lock (a kind of lock which can be shared for
> reading but is exclusive for writing), so that sessions get a read
> lock and the cleanup code tries to get a write lock.
> 
> The problem in both solutions is that the lock can be held for quite a
> long time, so you'll penalize either the cleanup (which runs into a
> normal request processing thread/process, remember !) or the other
> requests. I really don't think that preventing the quite unprobable
> cases 4,7 and 9 is worth the price of slowing everyone.
> 
> So I suggest this : why not refuse to load a timed-out session ? If
> the session is timed-out, then drop it AT LOAD TIME and create a new
> one. This way, there won't be any race condition between the request
> processing and the cleaning of sessions.
> 
> Regards,
> Nicolas
> 
> 
>>Regards,
>>Jim
>>
>>
>>>On 4/13/05, Jim Gallacher <jg...@sympatico.ca> wrote:
>>>
>>>
>>>>Nicolas Lehuen wrote:
>>>>
>>>>
>>>>>>Hi Jim,
>>>>>>
>>>>>>OK, I've integrated your fixes.
>>>>>>
>>>>>>Regarding the do_cleanup locking issue, I had two ideas in mind :
>>>>>>
>>>>>>1) Hold the lock n° 0 (AKA the "dbm lock") while doing the cleanup to
>>>>>>prevent two concurrent cleanups in the case we are very unlucky (two
>>>>>>cleanups triggered at the same time).
>>>>
>>>>This could end up holding the global lock for a long time on a heavily
>>>>loaded server. I'm not too concerned about the possibility of concurrent
>>>>cleanups. If first one deletes particular a session file before the
>>>>other, the second will simply raise an error and continue processing -
>>>>no big deal. If the trylock function described below cannot be made to
>>>>work, blocking a second do_cleanup thread while the first completes
>>>>would be a bad thing. And of course there is always a very small chance
>>>>that a THIRD do_cleanup could start up and get blocked...
>>>>
>>>>
>>>>
>>>>>>More than that, we should try to
>>>>>>hold the lock with an APR try_lock call (which I think is not
>>>>>>implemented yet in _apache.c), so that if the lock is currently held,
>>>>>>we know that a cleanup is already being performed, so we can safely
>>>>>>skip it.
>>>>
>>>>If try_lock works then the global lock would be OK.
>>>>
>>>>
>>>>
>>>>>>2) while the cleanup is being performed, hold the lock on each session
>>>>>>with a try_lock call, so that if a session is locked, it will never be
>>>>>>candidate for a cleanup.
>>>>
>>>>A lock/unlock for each session seems like there would be a whole lotta
>>>>lockin' goin' on. I don't know how many sessions a site may end up with
>>>>- 1000, 10000, 50000? - but acquiring and releasing this many locks in
>>>>the duration of one request just can't be good for performance.
>>>>
>>>>
>>>>
>>>>>Nicolas Lehuen wrote:
>>>>>I have checked in an implementation of _apache._global_trylock, which
>>>>>is a shameless copy/paste of _global_lock with some adaptation to use
>>>>>apr_global_mutex_trylock. Unfortunately my tests seem to show that the
>>>>>implementation is not good in that it erroneously tells me that the
>>>>>lock is acquired even when it isn't... Must work on it a little more.
>>>>
>>>>_apachemodule.c would not compile for me without the following change:
>>>>@@ -487,7 +487,8 @@
>>>>         ap_log_error(APLOG_MARK, APLOG_WARNING, rv, s,
>>>>                      "Failed to acquire global mutex lock at index
>>>>%d", index);         PyErr_SetString(PyExc_ValueError,
>>>>-                        "Failed to acquire global mutex lock %i",rv);
>>>>+                        "Failed to acquire global mutex lock");
>>>>+        //                "Failed to acquire global mutex lock %i",rv);
>>>>         return NULL;
>>>>     }
>>>> }
>>>>
>>>>Once compiled, I tested _apache._global_trylock(req.server, "test") on
>>>>apache2-mpm-prefork-2.0.53 and got the following error in the apache log:
>>>>
>>>>[Tue Apr 12 12:29:04 2005] [warn] (70023)This function has not been
>>>>implemented on this platform: Failed to acquire global mutex lock at index 2
>>>>
>>>>I'm not sure it this is the same problem you had, or a different one.
>>>>
>>>>I've been walking through the do_cleanup code, trying to understand
>>>>exactly where there may be conflicts with do_cleanup, and any sessions
>>>>being accessed by other processes/threads. Once I have it clear in my
>>>>head, I'll post my ramblings. Also, I haven't posted any timings showing
>>>>the do_cleanup performance with the optimizations on and off, but I will
>>>>in the next day or so. You can get 10x performance boost on a lightly
>>>>loaded server with more than 10000 session files and probably even
>>>>better when the server is under a heavy load.
>>>>
>>>>Also, I just realized that do_cleanup, when it is run, is run at the end
>>>>of BaseSession.__init__() rather than at the end of the request. The
>>>>do_cleanup method should really just be a call to
>>>>self._req.register_cleanup (do_filesession_cleanup,self) and our current
>>>>do_cleanup code moved to do_filesession_cleanup. This is the same
>>>>pattern used by DbmSession. I'll implement this and send you a patch.
>>>>
>>>>Regards,
>>>>Jim
>>>>
>>>>
>>>
>>>
>>
>

Re: FileSession - a couple of small fixes

Posted by Nicolas Lehuen <ni...@gmail.com>.

On 4/14/05, Jim Gallacher <jg...@sympatico.ca> wrote:
> > Anyway, I've made some tests like you, and it seems that unless I've
> > done something stupid in the _global_trylock function, the
> > implementation of trylock in the APR library is flaky on Win32.  With
> > one single client, everything is OK, but as soon as I test with more
> > clients (using ab -c 30 for example), I randomly get strange errors in
> > the error log, once in a while :
> >
> 
>    < snipped error log msg >
> 
> > On top of that, it seems that trylock is not implemented on your
> > platform (what is it ?),
> 
> Linux, Debian Unstable, apache 2.0.53 mpm-prefork. I need the prefork
> model rather than the threaded model as the server is also running php4,
> which I understand is not thread safe.
> 
> > so I guess this is the wrong way to go...
> 
> So it would seem. How about this code snippet to ensure only one cleanup
> will run at a time?:
> 
> lockfile = os.path.join(sessdir,'.lck')
> 
> # grab the global lock
> # This will block if another request is already running do cleanup,
> # but since the lock is only held for a short time it will not impact
> # performance.
> _apache._global_lock(req.server, 0)
>      lock_file_exists =  os.exists(lockfile)
>      if not lock_file_exists:
>          fp = file(lockfile,'w')
>          fp.write('lock it')
> _apache._global_unlock(req.server, 0)
> 
> if lock_file_exists:
>      # another do_cleanup is already running
>      # don't exit this one without running
>      return
> else:
>      do_filesession_cleanup()
> 
> os.unlink(lockfile)
> 
> # end of code.
> 
> I haven't tested this code yet, but I think is should be ok.

Yes, altough you are kind of reimplementing what should be done in the
APR. The first file existence test could also be skipped with an
exclusive file opening (using the O_EXCL flag) for which there was a
claim of atomicity on this mailing list.
 
> We still need to deal with locking individual session files. Am I being
> overly concerned about the performance impact of locking/unlocking a
> large number of sessions? If it's not that big of a deal, we can just
> lock each session an be done with it.

The sequence of events that we don't want is :

- process/thread A loads an old session S1 (older than the session
timeout) which has not been cleaned up yet. It begins processing the
request.
- process/thread B loads another session S2, the random test on
session cleanup pass, so a session cleanup is triggered. S1 is seen as
too old so it...
- process/thread A saves S1 to disk, with a brand new access time
- process/thread B ... erases the session file for S1 since it is too
old. Ooops, we've lost a perfectly valid session.

There is another scenario which seems OK to me :

- A loads an old session S1
- B loads another session S2, triggers cleanup, delete the old session
file for S1
- A goes on processing the request then eventually saves S1 to disk

No problem, we've just resurrected the session. Well it's not exactly
true : what if a third process/thread C tries to access the session S1
between the delete and the save ? Fortunately, this is not possible
due to the individual session locking.
 
Let's try to do this methodically. There are 2 concurrent sequences of
events : in process A, we have L,P,S as Loading session, Processing
request and Saving session. In process B, we have C,D as Checking the
session timeout and Deleting the session because it's too old. Thus,
we have 10 possible schedulings of these events :

01) CDLPS => OK, the old session is lost and a new session is built and saved
02) CLDPS => the session is resurrected ; individual session locking is needed
03) CLPDS => the session is resurrected ; individual session locking is needed
04) CLPSD => this is the problem mentioned above
05) LCDPS => the session is resurrected ; individual session locking is needed
06) LCPDS => the session is resurrected ; individual session locking is needed
07) LCPSD => this is the problem mentioned above
08) LPCDS => the session is resurrected ; individual session locking is needed
09) LPCSD => this is the problem mentioned above
10) LPSCD => this won't happen since when the session is saved, its
timeout goes down to 0

First, let's notice that the problem of deleting a valid session must
be quite rare. You'll have to send a request with an session that
should have timed out, just as a random occuring session cleanup takes
place, and you have to fall into 3 of the 10 possible schedulings
mentioned above.

Anyway, if we really want to make this rock-solid, we'll have to find
a locking scheme that prevents the scheduling sequences 4,7 and 9 to
happen. I can see two solutions :

1) individually acquire the session lock before checking for the
session and deleting it.
2) have a reader / writer lock (a kind of lock which can be shared for
reading but is exclusive for writing), so that sessions get a read
lock and the cleanup code tries to get a write lock.

The problem in both solutions is that the lock can be held for quite a
long time, so you'll penalize either the cleanup (which runs into a
normal request processing thread/process, remember !) or the other
requests. I really don't think that preventing the quite unprobable
cases 4,7 and 9 is worth the price of slowing everyone.

So I suggest this : why not refuse to load a timed-out session ? If
the session is timed-out, then drop it AT LOAD TIME and create a new
one. This way, there won't be any race condition between the request
processing and the cleaning of sessions.

Regards,
Nicolas

> Regards,
> Jim
> 
> > On 4/13/05, Jim Gallacher <jg...@sympatico.ca> wrote:
> >
> >>Nicolas Lehuen wrote:
> >>
> >>>>Hi Jim,
> >>>>
> >>>>OK, I've integrated your fixes.
> >>>>
> >>>>Regarding the do_cleanup locking issue, I had two ideas in mind :
> >>>>
> >>>>1) Hold the lock n° 0 (AKA the "dbm lock") while doing the cleanup to
> >>>>prevent two concurrent cleanups in the case we are very unlucky (two
> >>>>cleanups triggered at the same time).
> >>
> >>This could end up holding the global lock for a long time on a heavily
> >>loaded server. I'm not too concerned about the possibility of concurrent
> >>cleanups. If first one deletes particular a session file before the
> >>other, the second will simply raise an error and continue processing -
> >>no big deal. If the trylock function described below cannot be made to
> >>work, blocking a second do_cleanup thread while the first completes
> >>would be a bad thing. And of course there is always a very small chance
> >>that a THIRD do_cleanup could start up and get blocked...
> >>
> >>
> >>>>More than that, we should try to
> >>>>hold the lock with an APR try_lock call (which I think is not
> >>>>implemented yet in _apache.c), so that if the lock is currently held,
> >>>>we know that a cleanup is already being performed, so we can safely
> >>>>skip it.
> >>
> >>If try_lock works then the global lock would be OK.
> >>
> >>
> >>>>2) while the cleanup is being performed, hold the lock on each session
> >>>>with a try_lock call, so that if a session is locked, it will never be
> >>>>candidate for a cleanup.
> >>
> >>A lock/unlock for each session seems like there would be a whole lotta
> >>lockin' goin' on. I don't know how many sessions a site may end up with
> >>- 1000, 10000, 50000? - but acquiring and releasing this many locks in
> >>the duration of one request just can't be good for performance.
> >>
> >>
> >>>Nicolas Lehuen wrote:
> >>>I have checked in an implementation of _apache._global_trylock, which
> >>>is a shameless copy/paste of _global_lock with some adaptation to use
> >>>apr_global_mutex_trylock. Unfortunately my tests seem to show that the
> >>>implementation is not good in that it erroneously tells me that the
> >>>lock is acquired even when it isn't... Must work on it a little more.
> >>
> >>_apachemodule.c would not compile for me without the following change:
> >>@@ -487,7 +487,8 @@
> >>          ap_log_error(APLOG_MARK, APLOG_WARNING, rv, s,
> >>                       "Failed to acquire global mutex lock at index
> >>%d", index);         PyErr_SetString(PyExc_ValueError,
> >>-                        "Failed to acquire global mutex lock %i",rv);
> >>+                        "Failed to acquire global mutex lock");
> >>+        //                "Failed to acquire global mutex lock %i",rv);
> >>          return NULL;
> >>      }
> >>  }
> >>
> >>Once compiled, I tested _apache._global_trylock(req.server, "test") on
> >>apache2-mpm-prefork-2.0.53 and got the following error in the apache log:
> >>
> >>[Tue Apr 12 12:29:04 2005] [warn] (70023)This function has not been
> >>implemented on this platform: Failed to acquire global mutex lock at index 2
> >>
> >>I'm not sure it this is the same problem you had, or a different one.
> >>
> >>I've been walking through the do_cleanup code, trying to understand
> >>exactly where there may be conflicts with do_cleanup, and any sessions
> >>being accessed by other processes/threads. Once I have it clear in my
> >>head, I'll post my ramblings. Also, I haven't posted any timings showing
> >>the do_cleanup performance with the optimizations on and off, but I will
> >>in the next day or so. You can get 10x performance boost on a lightly
> >>loaded server with more than 10000 session files and probably even
> >>better when the server is under a heavy load.
> >>
> >>Also, I just realized that do_cleanup, when it is run, is run at the end
> >>of BaseSession.__init__() rather than at the end of the request. The
> >>do_cleanup method should really just be a call to
> >>self._req.register_cleanup (do_filesession_cleanup,self) and our current
> >>do_cleanup code moved to do_filesession_cleanup. This is the same
> >>pattern used by DbmSession. I'll implement this and send you a patch.
> >>
> >>Regards,
> >>Jim
> >>
> >>
> >
> >
> 
>

Re: FileSession - a couple of small fixes

Posted by Jim Gallacher <jg...@sympatico.ca>.

Nicolas Lehuen wrote:
> Weird. I don't understand wht this code would not compile, as the line
> you commented out is perfectly valid. It compiles on my machine,
> moreover.

Weird is the word.

> Anyway, I've made some tests like you, and it seems that unless I've
> done something stupid in the _global_trylock function, the
> implementation of trylock in the APR library is flaky on Win32.  With
> one single client, everything is OK, but as soon as I test with more
> clients (using ab -c 30 for example), I randomly get strange errors in
> the error log, once in a while :
> 

   < snipped error log msg >

> On top of that, it seems that trylock is not implemented on your
> platform (what is it ?),

Linux, Debian Unstable, apache 2.0.53 mpm-prefork. I need the prefork 
model rather than the threaded model as the server is also running php4, 
which I understand is not thread safe.

> so I guess this is the wrong way to go...

So it would seem. How about this code snippet to ensure only one cleanup 
will run at a time?:

lockfile = os.path.join(sessdir,'.lck')

# grab the global lock
# This will block if another request is already running do cleanup,
# but since the lock is only held for a short time it will not impact
# performance.
_apache._global_lock(req.server, 0)
     lock_file_exists =  os.exists(lockfile)
     if not lock_file_exists:
         fp = file(lockfile,'w')
         fp.write('lock it')
_apache._global_unlock(req.server, 0)

if lock_file_exists:
     # another do_cleanup is already running
     # don't exit this one without running
     return
else:
     do_filesession_cleanup()

os.unlink(lockfile)

# end of code.

I haven't tested this code yet, but I think is should be ok.

We still need to deal with locking individual session files. Am I being 
overly concerned about the performance impact of locking/unlocking a 
large number of sessions? If it's not that big of a deal, we can just 
lock each session an be done with it.

Regards,
Jim

> On 4/13/05, Jim Gallacher <jg...@sympatico.ca> wrote:
> 
>>Nicolas Lehuen wrote:
>>
>>>>Hi Jim,
>>>>
>>>>OK, I've integrated your fixes.
>>>>
>>>>Regarding the do_cleanup locking issue, I had two ideas in mind :
>>>>
>>>>1) Hold the lock n° 0 (AKA the "dbm lock") while doing the cleanup to
>>>>prevent two concurrent cleanups in the case we are very unlucky (two
>>>>cleanups triggered at the same time).
>>
>>This could end up holding the global lock for a long time on a heavily
>>loaded server. I'm not too concerned about the possibility of concurrent
>>cleanups. If first one deletes particular a session file before the
>>other, the second will simply raise an error and continue processing -
>>no big deal. If the trylock function described below cannot be made to
>>work, blocking a second do_cleanup thread while the first completes
>>would be a bad thing. And of course there is always a very small chance
>>that a THIRD do_cleanup could start up and get blocked...
>>
>>
>>>>More than that, we should try to
>>>>hold the lock with an APR try_lock call (which I think is not
>>>>implemented yet in _apache.c), so that if the lock is currently held,
>>>>we know that a cleanup is already being performed, so we can safely
>>>>skip it.
>>
>>If try_lock works then the global lock would be OK.
>>
>>
>>>>2) while the cleanup is being performed, hold the lock on each session
>>>>with a try_lock call, so that if a session is locked, it will never be
>>>>candidate for a cleanup.
>>
>>A lock/unlock for each session seems like there would be a whole lotta
>>lockin' goin' on. I don't know how many sessions a site may end up with
>>- 1000, 10000, 50000? - but acquiring and releasing this many locks in
>>the duration of one request just can't be good for performance.
>>
>>
>>>Nicolas Lehuen wrote:
>>>I have checked in an implementation of _apache._global_trylock, which
>>>is a shameless copy/paste of _global_lock with some adaptation to use
>>>apr_global_mutex_trylock. Unfortunately my tests seem to show that the
>>>implementation is not good in that it erroneously tells me that the
>>>lock is acquired even when it isn't... Must work on it a little more.
>>
>>_apachemodule.c would not compile for me without the following change:
>>@@ -487,7 +487,8 @@
>>          ap_log_error(APLOG_MARK, APLOG_WARNING, rv, s,
>>                       "Failed to acquire global mutex lock at index
>>%d", index);         PyErr_SetString(PyExc_ValueError,
>>-                        "Failed to acquire global mutex lock %i",rv);
>>+                        "Failed to acquire global mutex lock");
>>+        //                "Failed to acquire global mutex lock %i",rv);
>>          return NULL;
>>      }
>>  }
>>
>>Once compiled, I tested _apache._global_trylock(req.server, "test") on
>>apache2-mpm-prefork-2.0.53 and got the following error in the apache log:
>>
>>[Tue Apr 12 12:29:04 2005] [warn] (70023)This function has not been
>>implemented on this platform: Failed to acquire global mutex lock at index 2
>>
>>I'm not sure it this is the same problem you had, or a different one.
>>
>>I've been walking through the do_cleanup code, trying to understand
>>exactly where there may be conflicts with do_cleanup, and any sessions
>>being accessed by other processes/threads. Once I have it clear in my
>>head, I'll post my ramblings. Also, I haven't posted any timings showing
>>the do_cleanup performance with the optimizations on and off, but I will
>>in the next day or so. You can get 10x performance boost on a lightly
>>loaded server with more than 10000 session files and probably even
>>better when the server is under a heavy load.
>>
>>Also, I just realized that do_cleanup, when it is run, is run at the end
>>of BaseSession.__init__() rather than at the end of the request. The
>>do_cleanup method should really just be a call to
>>self._req.register_cleanup (do_filesession_cleanup,self) and our current
>>do_cleanup code moved to do_filesession_cleanup. This is the same
>>pattern used by DbmSession. I'll implement this and send you a patch.
>>
>>Regards,
>>Jim
>>
>>
> 
>

Re: FileSession - a couple of small fixes

Posted by Jim Gallacher <jg...@sympatico.ca>.

Nicolas Lehuen wrote:
> Anyway, I've made some tests like you, and it seems that unless I've
> done something stupid in the _global_trylock function, the
> implementation of trylock in the APR library is flaky on Win32.  With
> one single client, everything is OK, but as soon as I test with more
> clients (using ab -c 30 for example), I randomly get strange errors in
> the error log, once in a while :
> 

I tried a new locking scheme, attached for discussion purposes. Each 
session is locked and unlocked sequentially. I have session locking on 
by default, so the request handling the cleanup has it's session locked. 
I thought I was being clever by making sure that the cleanup request did 
not try and lock itself.

For my tests, I had a large number of session files, but the cleanup 
code would only scan from 2 to 28 files before deadlocking. So much for 
being clever.

Much banging of head against monitor ensued. And then the light went on...

We only have (typically) 31 mutexes available and they are stored in an 
array, not the beloved python dict. Each session, which is represented 
by an md5 hash, will be mapped to an integer between 1 and 32.

Here is the revelant bit of code for _global_lock and/or _global_trylock 
from _apachemodule.c

     // key is session id passed into the function.
     int hash = PyObject_Hash(key);
     hash = abs(hash);
     index = (hash % (glb->nlocks-1)+1);

Scanning a large number of session files, with one of them already 
locked by the request running the cleanup, will almost certainly result 
in a deadlock. When I unlocked the session of the cleanup request, the 
deadlock issue disappeared.

I don't know if this related to the random errors you've seen with 
global_trylock, but the fact that it happens when serving 30 concurrent 
requests sure feels like there is a relationship.

I would ask that you not merge the attached code until I've done some 
additional testing. Plus it is polluted with log_error calls :)

Regards,
Jim

Re: FileSession - a couple of small fixes

Posted by Nicolas Lehuen <ni...@gmail.com>.

Weird. I don't understand wht this code would not compile, as the line
you commented out is perfectly valid. It compiles on my machine,
moreover.

Anyway, I've made some tests like you, and it seems that unless I've
done something stupid in the _global_trylock function, the
implementation of trylock in the APR library is flaky on Win32.  With
one single client, everything is OK, but as soon as I test with more
clients (using ab -c 30 for example), I randomly get strange errors in
the error log, once in a while :

[Thu Apr 14 21:32:31 2005] [warn] (OS 2)Le fichier spécifié est
introuvable.  : Failed to acquire global mutex lock at index 12
[Thu Apr 14 21:32:31 2005] [error] [client 127.0.0.1] PythonHandler
mod_python.publisher: Traceback (most recent call last):
[Thu Apr 14 21:32:31 2005] [error] [client 127.0.0.1] PythonHandler
mod_python.publisher:   File
"E:\\Python24\\Lib\\site-packages\\mod_python\\apache.py", line 299,
in HandlerDispatch\n    result = object(req)
[Thu Apr 14 21:32:31 2005] [error] [client 127.0.0.1] PythonHandler
mod_python.publisher:   File
"E:\\Python24\\Lib\\site-packages\\mod_python\\publisher.py", line
143, in handler\n    result = util.apply_fs_data(object, req.form,
req=req)
[Thu Apr 14 21:32:31 2005] [error] [client 127.0.0.1] PythonHandler
mod_python.publisher:   File
"E:\\Python24\\Lib\\site-packages\\mod_python\\util.py", line 424, in
apply_fs_data\n    return object(**args)
[Thu Apr 14 21:32:31 2005] [error] [client 127.0.0.1] PythonHandler
mod_python.publisher:   File "E:/Program Files/Apache
Group/Apache2/htdocs\\test.py", line 7, in index\n    if not
_apache._global_trylock(req.server,"toto"):
[Thu Apr 14 21:32:31 2005] [error] [client 127.0.0.1] PythonHandler
mod_python.publisher: Traceback (most recent call last):
[Thu Apr 14 21:32:31 2005] [error] [client 127.0.0.1] PythonHandler
mod_python.publisher:   File
"E:\\Python24\\Lib\\site-packages\\mod_python\\apache.py", line 299,
in HandlerDispatch\n    result = object(req)
[Thu Apr 14 21:32:31 2005] [error] [client 127.0.0.1] PythonHandler
mod_python.publisher:   File
"E:\\Python24\\Lib\\site-packages\\mod_python\\publisher.py", line
143, in handler\n    result = util.apply_fs_data(object, req.form,
req=req)
[Thu Apr 14 21:32:31 2005] [error] [client 127.0.0.1] PythonHandler
mod_python.publisher:   File
"E:\\Python24\\Lib\\site-packages\\mod_python\\util.py", line 424, in
apply_fs_data\n    return object(**args)
[Thu Apr 14 21:32:31 2005] [error] [client 127.0.0.1] PythonHandler
mod_python.publisher:   File "E:/Program Files/Apache
Group/Apache2/htdocs\\test.py", line 7, in index\n    if not
_apache._global_trylock(req.server,"toto"):

On top of that, it seems that trylock is not implemented on your
platform (what is it ?), so I guess this is the wrong way to go...

Regards,
Nicolas

On 4/13/05, Jim Gallacher <jg...@sympatico.ca> wrote:
> Nicolas Lehuen wrote:
> >>Hi Jim,
> >>
> >>OK, I've integrated your fixes.
> >>
> >>Regarding the do_cleanup locking issue, I had two ideas in mind :
> >>
> >>1) Hold the lock n° 0 (AKA the "dbm lock") while doing the cleanup to
> >>prevent two concurrent cleanups in the case we are very unlucky (two
> >>cleanups triggered at the same time).
> 
> This could end up holding the global lock for a long time on a heavily
> loaded server. I'm not too concerned about the possibility of concurrent
> cleanups. If first one deletes particular a session file before the
> other, the second will simply raise an error and continue processing -
> no big deal. If the trylock function described below cannot be made to
> work, blocking a second do_cleanup thread while the first completes
> would be a bad thing. And of course there is always a very small chance
> that a THIRD do_cleanup could start up and get blocked...
> 
> >> More than that, we should try to
> >>hold the lock with an APR try_lock call (which I think is not
> >>implemented yet in _apache.c), so that if the lock is currently held,
> >>we know that a cleanup is already being performed, so we can safely
> >>skip it.
> 
> If try_lock works then the global lock would be OK.
> 
> >>2) while the cleanup is being performed, hold the lock on each session
> >>with a try_lock call, so that if a session is locked, it will never be
> >>candidate for a cleanup.
> 
> A lock/unlock for each session seems like there would be a whole lotta
> lockin' goin' on. I don't know how many sessions a site may end up with
> - 1000, 10000, 50000? - but acquiring and releasing this many locks in
> the duration of one request just can't be good for performance.
> 
> > Nicolas Lehuen wrote:
> > I have checked in an implementation of _apache._global_trylock, which
> > is a shameless copy/paste of _global_lock with some adaptation to use
> > apr_global_mutex_trylock. Unfortunately my tests seem to show that the
> > implementation is not good in that it erroneously tells me that the
> > lock is acquired even when it isn't... Must work on it a little more.
> 
> _apachemodule.c would not compile for me without the following change:
> @@ -487,7 +487,8 @@
>           ap_log_error(APLOG_MARK, APLOG_WARNING, rv, s,
>                        "Failed to acquire global mutex lock at index
> %d", index);         PyErr_SetString(PyExc_ValueError,
> -                        "Failed to acquire global mutex lock %i",rv);
> +                        "Failed to acquire global mutex lock");
> +        //                "Failed to acquire global mutex lock %i",rv);
>           return NULL;
>       }
>   }
> 
> Once compiled, I tested _apache._global_trylock(req.server, "test") on
> apache2-mpm-prefork-2.0.53 and got the following error in the apache log:
> 
> [Tue Apr 12 12:29:04 2005] [warn] (70023)This function has not been
> implemented on this platform: Failed to acquire global mutex lock at index 2
> 
> I'm not sure it this is the same problem you had, or a different one.
> 
> I've been walking through the do_cleanup code, trying to understand
> exactly where there may be conflicts with do_cleanup, and any sessions
> being accessed by other processes/threads. Once I have it clear in my
> head, I'll post my ramblings. Also, I haven't posted any timings showing
> the do_cleanup performance with the optimizations on and off, but I will
> in the next day or so. You can get 10x performance boost on a lightly
> loaded server with more than 10000 session files and probably even
> better when the server is under a heavy load.
> 
> Also, I just realized that do_cleanup, when it is run, is run at the end
> of BaseSession.__init__() rather than at the end of the request. The
> do_cleanup method should really just be a call to
> self._req.register_cleanup (do_filesession_cleanup,self) and our current
> do_cleanup code moved to do_filesession_cleanup. This is the same
> pattern used by DbmSession. I'll implement this and send you a patch.
> 
> Regards,
> Jim
> 
>

Re: FileSession - a couple of small fixes

Posted by Jim Gallacher <jg...@sympatico.ca>.

Nicolas Lehuen wrote:
>>Hi Jim,
>>
>>OK, I've integrated your fixes.
>>
>>Regarding the do_cleanup locking issue, I had two ideas in mind :
>>
>>1) Hold the lock n° 0 (AKA the "dbm lock") while doing the cleanup to
>>prevent two concurrent cleanups in the case we are very unlucky (two
>>cleanups triggered at the same time).

This could end up holding the global lock for a long time on a heavily 
loaded server. I'm not too concerned about the possibility of concurrent 
cleanups. If first one deletes particular a session file before the 
other, the second will simply raise an error and continue processing - 
no big deal. If the trylock function described below cannot be made to 
work, blocking a second do_cleanup thread while the first completes 
would be a bad thing. And of course there is always a very small chance 
that a THIRD do_cleanup could start up and get blocked...

>> More than that, we should try to
>>hold the lock with an APR try_lock call (which I think is not
>>implemented yet in _apache.c), so that if the lock is currently held,
>>we know that a cleanup is already being performed, so we can safely
>>skip it.

If try_lock works then the global lock would be OK.

>>2) while the cleanup is being performed, hold the lock on each session
>>with a try_lock call, so that if a session is locked, it will never be
>>candidate for a cleanup.

A lock/unlock for each session seems like there would be a whole lotta 
lockin' goin' on. I don't know how many sessions a site may end up with 
- 1000, 10000, 50000? - but acquiring and releasing this many locks in 
the duration of one request just can't be good for performance.

> Nicolas Lehuen wrote:
> I have checked in an implementation of _apache._global_trylock, which
> is a shameless copy/paste of _global_lock with some adaptation to use
> apr_global_mutex_trylock. Unfortunately my tests seem to show that the
> implementation is not good in that it erroneously tells me that the
> lock is acquired even when it isn't... Must work on it a little more.

_apachemodule.c would not compile for me without the following change:
@@ -487,7 +487,8 @@
          ap_log_error(APLOG_MARK, APLOG_WARNING, rv, s,
                       "Failed to acquire global mutex lock at index 
%d", index);         PyErr_SetString(PyExc_ValueError,
-                        "Failed to acquire global mutex lock %i",rv);
+                        "Failed to acquire global mutex lock");
+        //                "Failed to acquire global mutex lock %i",rv);
          return NULL;
      }
  }

Once compiled, I tested _apache._global_trylock(req.server, "test") on 
apache2-mpm-prefork-2.0.53 and got the following error in the apache log:

[Tue Apr 12 12:29:04 2005] [warn] (70023)This function has not been 
implemented on this platform: Failed to acquire global mutex lock at index 2

I'm not sure it this is the same problem you had, or a different one.

I've been walking through the do_cleanup code, trying to understand 
exactly where there may be conflicts with do_cleanup, and any sessions 
being accessed by other processes/threads. Once I have it clear in my 
head, I'll post my ramblings. Also, I haven't posted any timings showing 
the do_cleanup performance with the optimizations on and off, but I will 
in the next day or so. You can get 10x performance boost on a lightly 
loaded server with more than 10000 session files and probably even 
better when the server is under a heavy load.

Also, I just realized that do_cleanup, when it is run, is run at the end 
of BaseSession.__init__() rather than at the end of the request. The 
do_cleanup method should really just be a call to 
self._req.register_cleanup (do_filesession_cleanup,self) and our current 
do_cleanup code moved to do_filesession_cleanup. This is the same 
pattern used by DbmSession. I'll implement this and send you a patch.

Regards,
Jim

Re: FileSession - a couple of small fixes

Posted by Nicolas Lehuen <ni...@gmail.com>.

On Apr 12, 2005 8:19 AM, Nicolas Lehuen <ni...@gmail.com> wrote:
> On Apr 12, 2005 6:48 AM, Jim Gallacher <ji...@jgassociates.ca> wrote:
> > I've made a couple of small changes to FileSession.
> >
> > 1. _fast_timeout now uses Session.DFT_TIMEOUT when timeout=0
> >
> > 2. Fixed stupid indent error in do_cleanup method.
> >
> > 3. Added comments to do_cleanup method to highlight possible
> > file/session locking issues.
> >
> > 4. Removed some log_error calls that slipped in from my  debugging version.
> >
> > Patch is attached. Is there any sort of convention for patch file naming?
> >
> > Jim
> 
> Hi Jim,
> 
> OK, I've integrated your fixes.
> 
> Regarding the do_cleanup locking issue, I had two ideas in mind :
> 
> 1) Hold the lock n° 0 (AKA the "dbm lock") while doing the cleanup to
> prevent two concurrent cleanups in the case we are very unlucky (two
> cleanups triggered at the same time). More than that, we should try to
> hold the lock with an APR try_lock call (which I think is not
> implemented yet in _apache.c), so that if the lock is currently held,
> we know that a cleanup is already being performed, so we can safely
> skip it.
> 
> 2) while the cleanup is being performed, hold the lock on each session
> with a try_lock call, so that if a session is locked, it will never be
> candidate for a cleanup.
> 
> Regards,
> Nicolas

I have checked in an implementation of _apache._global_trylock, which
is a shameless copy/paste of _global_lock with some adaptation to use
apr_global_mutex_trylock. Unfortunately my tests seem to show that the
implementation is not good in that it erroneously tells me that the
lock is acquired even when it isn't... Must work on it a little more.
But for now, I'm off to real work !

Regards,
Nicolas

Re: FileSession - a couple of small fixes

Posted by Nicolas Lehuen <ni...@gmail.com>.

On Apr 12, 2005 6:48 AM, Jim Gallacher <ji...@jgassociates.ca> wrote:
> I've made a couple of small changes to FileSession.
> 
> 1. _fast_timeout now uses Session.DFT_TIMEOUT when timeout=0
> 
> 2. Fixed stupid indent error in do_cleanup method.
> 
> 3. Added comments to do_cleanup method to highlight possible
> file/session locking issues.
> 
> 4. Removed some log_error calls that slipped in from my  debugging version.
> 
> Patch is attached. Is there any sort of convention for patch file naming?
> 
> Jim

Hi Jim,

OK, I've integrated your fixes.

Regarding the do_cleanup locking issue, I had two ideas in mind :

1) Hold the lock n° 0 (AKA the "dbm lock") while doing the cleanup to
prevent two concurrent cleanups in the case we are very unlucky (two
cleanups triggered at the same time). More than that, we should try to
hold the lock with an APR try_lock call (which I think is not
implemented yet in _apache.c), so that if the lock is currently held,
we know that a cleanup is already being performed, so we can safely
skip it.

2) while the cleanup is being performed, hold the lock on each session
with a try_lock call, so that if a session is locked, it will never be
candidate for a cleanup.

Regards,
Nicolas