You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Michael McCandless <lu...@mikemccandless.com> on 2006/09/21 21:47:25 UTC

help on Lock.obtain(lockWaitTimeout)

I'm working on a LockFactory that uses java.nio.* (OS native locks)
for its locks.

This should be a big help for people who keep finding their lock files
left on disk due to abnormal shutdown, etc (because OS will free the
locks, nomatter what, "in theory").

I thought I was nearly done but .... in testing the new LockFactory on
an NFS server that didn't have locks properly configured (I think
possibly a common situtation) I found a problem with how the
Lock.obtain(lockWaitTimeout) works.

That function precomputes how many times to try to obtain the lock
(just divides lockWaitTimeout parameter and LOCK_POLL_INTERVAL) and
then tries Lock.obtain() followed by a sleep of LOCK_POLL_INTERVAL,
that many times, before timing out.

The problem is, in the above test case: the call to Lock.obtain() can
apparently take a looooong time (35 seconds, I assume some kind of
underlying timeout contacting "lockd" from the NFS client) only to
finally return "false".  But the "try N times" approach makes the
assumption that this call will take zero time.  (In fact, as things
stand now, when Lock.obtain() takes non-zero time, it causes the
timeout to be longer than what was asked for; but likely this is
typically a small amount?).

Anyway, my first reaction was to change this to use
System.currentTimeMillis() to measure elapsed time, but then I
remembered is a dangerous approach because whenever the clock on the
machine is updated (eg by a time-sync NTP client) it would mess up
this function, causing it to either take longer than was asked for (if
clock is moved backwards) or, to timeout in [much] less time than was
asked for (if clock was moved forwards).  I've hit such issues in the
past and it's devilish.  Timezone and daylight savings time don't
matter because it's measuring GMT.

So then what to do?  What's the best way to change the function to
"really" measure time?  In Java 1.5 there is now a "nanoTime()" which
is closer to what I need, but it's 1.5 (and we're still on 1.4), and
apparently it can "fallback" to currentTimeMillis() on some platforms.
In the past I've used separate a separate "clock" thread that just
sleeps & increments a counter, but I don't really like the idea of
spawning a whole new thread (Lucene doesn't launch its own threads
now, except for ParallelMultiSearcher).

Does anyone know of a good solution?

Alternatively, since this is really a "misconfiguration" (ie the
Lock.obtain() is never going to succeed), maybe we could try to obtain
a random "test" lock on creation of the LockFactory, just to confirm
that locking even "works" at all in the current environment, and then
leave the current implementation of Lock.obtain() unchanged (when NFS
locking is properly configured it seems to be fairly fast)?

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: help on Lock.obtain(lockWaitTimeout)

Posted by Michael McCandless <lu...@mikemccandless.com>.

Yonik Seeley wrote:
> On 9/21/06, Michael McCandless <lu...@mikemccandless.com> wrote:
>> Anyway, my first reaction was to change this to use
>> System.currentTimeMillis() to measure elapsed time, but then I
>> remembered is a dangerous approach because whenever the clock on the
>> machine is updated (eg by a time-sync NTP client) it would mess up
>> this function, causing it to either take longer than was asked for (if
>> clock is moved backwards) or, to timeout in [much] less time than was
>> asked for (if clock was moved forwards).
> 
> Um, wow... that's thorough design work!

Thanks :) I've hit just one too many bugs due to system time changing!
Time is always a sneaky thing to work with.  Basically you can't
really use system time as a reliable way to measure elapsed time.

> In this case, I don't think it's something to worry about though.
> NTP corrections are likely to be very small, not on the scale of
> lock-obtain timeouts.
> If one can't obtain a lock, it's due to something else asynchronously
> happening, and that throws a lot bigger time variation into the
> equation anyway.

Yes, I hope so, in a well-behaved server environment that's already
converged its clock and is tracking well to "real time", has the right
command line options to ntp, and doesn't have an admin coming in and
making clock changes.  But in more "chaotic" user's desktop where the
user could update the clock at random times themselves, it would be
horrible to let such an event "falsely" throw a Lock obtain timed out
to any desktop deployments of Lucene.

Even with lock-less commits we will still need to obtain the write
lock (eg for the interleaved add/delete case, until we can fix
IndexWriter to handle deletes, the write lock is being acquired fairly
"often").  Each of these obtains is then vulnerable if [too large] a
clock change is made during this call.

Lucene doens't currently have this issue (relying on currentTimeMillis
to measure elapsed time) so I'd hate to be the one to introduce it.

Are there any objections to the "acquire a random test lock" approach?

If your locking is mis-configured, you will get an error on
creating the NativeFSLockFactory.  But if it is configured
properly, it will quickly get the lock (and release it) and move on.

Also, there is a single instance of NativeFSLockFactory per [canonical]
lock directory, so it would only be the first time (per JVM instance)
that the NativeFSLockFactory is created for the given directory that
this simple test would be performed.

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: help on Lock.obtain(lockWaitTimeout)

Posted by Michael McCandless <lu...@mikemccandless.com>.

Doron Cohen wrote:
> For obtain(timeout), to prevent waiting too long you could compute the
> maximum number of times that obtain() can be executed (assuming, as in
> current code, that obtain() executes in no time). Then break if either it
> was executed sufficiently many times or if time is up. I don't see how to
> prevent waiting too short.

Yeah this is still relying on the system time to measure elapsed time
which is [sadly, sneakily] dangerous.  I'm afraid it will eventually
come back and haunt us (me!) if we do this.

Actually, there is a java.util.Timer, but 1) it goes an launches
another Thread under the hood, and 2) apparently (some posts I found
through Google), there are clock-shift cases where even this class is
in fact unreliable.

> Btw, I wonder what happens if the time change as of sync occurs in the
> middle of the sleep - since sleep is implemented natively this must be
> taken care of correctly by the underlying OS...?

I *think* sleep is in general robust to clock shifting.  At least, I
sure hope so, because using sleep to make a separate [low resoution]
clock thread has been my workaround for this issue (not relying on
system time to measure elapsed time) in the past.  I believe (I hope!)
I had tested this in the past and came to this conclusion.

There is the spooky InterruptedException that Thread.sleep can throw
-- I think it's only if another thread explicitly interrupts this
thread but I'm not certain?  The sleep() call (in C on many Unix's)
will also stop early if a signal is received while it's sleeping.

Time is just never simple!

Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: help on Lock.obtain(lockWaitTimeout)

Posted by Doron Cohen <DO...@il.ibm.com>.

For obtain(timeout), to prevent waiting too long you could compute the
maximum number of times that obtain() can be executed (assuming, as in
current code, that obtain() executes in no time). Then break if either it
was executed sufficiently many times or if time is up. I don't see how to
prevent waiting too short.

Btw, I wonder what happens if the time change as of sync occurs in the
middle of the sleep - since sleep is implemented natively this must be
taken care of correctly by the underlying OS...?

yseeley@gmail.com wrote on 21/09/2006 13:05:06:
> On 9/21/06, Michael McCandless <lu...@mikemccandless.com> wrote:
> > Anyway, my first reaction was to change this to use
> > System.currentTimeMillis() to measure elapsed time, but then I
> > remembered is a dangerous approach because whenever the clock on the
> > machine is updated (eg by a time-sync NTP client) it would mess up
> > this function, causing it to either take longer than was asked for (if
> > clock is moved backwards) or, to timeout in [much] less time than was
> > asked for (if clock was moved forwards).
>
> Um, wow... that's thorough design work!
>
> In this case, I don't think it's something to worry about though.
> NTP corrections are likely to be very small, not on the scale of
> lock-obtain timeouts.
> If one can't obtain a lock, it's due to something else asynchronously
> happening, and that throws a lot bigger time variation into the
> equation anyway.
>
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
server
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-dev-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: help on Lock.obtain(lockWaitTimeout)

Posted by Yonik Seeley <yo...@apache.org>.

On 9/21/06, Michael McCandless <lu...@mikemccandless.com> wrote:
> Anyway, my first reaction was to change this to use
> System.currentTimeMillis() to measure elapsed time, but then I
> remembered is a dangerous approach because whenever the clock on the
> machine is updated (eg by a time-sync NTP client) it would mess up
> this function, causing it to either take longer than was asked for (if
> clock is moved backwards) or, to timeout in [much] less time than was
> asked for (if clock was moved forwards).

Um, wow... that's thorough design work!

In this case, I don't think it's something to worry about though.
NTP corrections are likely to be very small, not on the scale of
lock-obtain timeouts.
If one can't obtain a lock, it's due to something else asynchronously
happening, and that throws a lot bigger time variation into the
equation anyway.


-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org