You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@river.apache.org by Christopher Dolan <ch...@avid.com> on 2011/04/29 20:40:12 UTC

client hang in com.sun.jini.jeri.internal.mux.Mux.start()

I've experienced occasional cases where clients get stuck in the
following block of code in Mux.start. Has anyone experienced this
problem? I have a proposed solution below. Has anyone thought about a
similar solution already?

-- Current code --
1 	    asyncSendClientConnectionHeader();
2 	    synchronized (muxLock) {
3 		while (!muxDown && !clientConnectionReady) {
4 		    try {
5 			muxLock.wait();		// REMIND: timeout?
6 		    } catch (InterruptedException e) {
7 			...
8 		    }
9 		}
10		if (muxDown) {
11		    IOException ioe = new IOException(muxDownMessage);
12		    ioe.initCause(muxDownCause);
13		    throw ioe;
14		}
15	    }

-- Explanation of the code --
This code handles the initial client-server handshake that starts a JERI
connection. In line 1, the client sends its 8-byte greeting to the
server. Then in the loop on lines 3-9, it waits for the server's
response. If the reader thread gets a satisfactory response from the
server, it sets clientConnectionReady=true and calls
muxLock.notifyAll(). In all other cases (aborted connection, mismatched
protocol version, etc) the reader invokes Mux.setDown() which sets
muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it throws if
the handshake was a failure.

In my scenario (which uses simple TCP sockets, nothing fancy), the
invoker thread sits on line 5 indefinitely. My problem hard to
reproduce, so I haven't found out what the server is doing in this case.
I hope to figure that out eventually, but presently I'm interested in
the "REMIND: timeout?" comment.

-- Timeout solution --
It seems obvious to me that there should be a timeout here. There are
lots of imaginable cases where the client could get stuck here:
server-side deadlock, abrupt server crash, logic error in client Mux
code. You'd expect that the server would either respond with its 8-byte
handshake very quickly or never, so a modest timeout (like 15 or 30
seconds) should be good. If that timeout is triggered, I would expect
that the code above would call Mux.setDown() and throw an IOException.
That exception would either cause a retry or be thrown up to the invoker
as a RemoteException.

-- Proposed code (untested) --
3 		long now = System.currentTimeMillis();
4 		long endTime = now + timeoutMillis;
5 		while (!muxDown && !clientConnectionReady) {
6 		    if (now >= endTime) {
7 			setDown("timeout waiting for server to respond
to handshake", null);
8 		    } else {
9 			try {
10			    muxLock.wait(endTime - now);
11			    now = System.currentTimeMillis();
12			} catch (InterruptedException e) {
13			    setDown("interrupt waiting for connection
header", e);
14			}
15		    }
16		}

This code assumes a configurable timeoutMillis parameter has been set
earlier.

I can't think of any alternative solutions. Putting the timeout in the
Reader logic seems higher risk. There's incomplete code in JERI to
implement a ping packet (see Mux.asyncSendPing, never used), but that
would only be relevant after the initial handshake and wouldn't help
here.

Thanks,
Chris

RE: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Christopher Dolan <ch...@avid.com>.

I created a Jira issue with a proposed patch and the test case included
below.
https://issues.apache.org/jira/browse/RIVER-397
Chris

-----Original Message-----
From: Christopher Dolan [mailto:christopher.dolan@avid.com] 
Sent: Wednesday, May 04, 2011 1:05 PM
To: dev@river.apache.org
Subject: RE: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Here's a test that consistently fails with the current Mux
implementation and passes with the patch I proposed at the beginning of
this thread. In my test I explicitly pretend that the server side of the
connect has blocked. In reality, all we need to agree on is that it's
possible for the server side to block.

The proposed patch needs a little more work to make the timeout be
configurable. If so, the test can be sped up by setting that timeout to
something unrealistically short.

public class MuxStartTimeout {
    @Test
    public void test() throws IOException, InterruptedException {
        // make fake input and output streams.
        OutputStream os = new ByteArrayOutputStream();
        InputStream is = new InputStream() {
            @Override
            public synchronized int read() throws IOException {
                try {
                    // block indefinitely
                    while (true)
                        wait();
                } catch (InterruptedException e) {
                    return 0;
                }
            }
        };

        final AtomicBoolean finished = new AtomicBoolean(false);
        final AtomicBoolean succeeded = new AtomicBoolean(false);
        final AtomicBoolean failed = new AtomicBoolean(false);
        final MuxClient muxClient = new MuxClient(os, is);
        try {
            Thread t = new Thread(new Runnable() {
                public void run() {
                    try {
                        muxClient.start();
                        succeeded.set(true);
                    } catch (IOException e) {
                        failed.set(true);
                    }
                    finished.set(true);
                }
            });
            t.start();
            t.join(20000);
            Assert.assertTrue(finished.get());
            Assert.assertFalse(succeeded.get());
            Assert.assertTrue(failed.get());
            if (!t.isInterrupted())
                t.interrupt();
        } finally {
            muxClient.shutdown("end of test");
        }
    }
}

Chris

P.S. Amusingly, I actually compiled the test against
org.testng.annotations.Test org.testng.Assert but it should also work as
written against org.junit.Test and org.junit.Assert

-----Original Message-----
From: Patricia Shanahan [mailto:pats@acm.org] 
Sent: Wednesday, May 04, 2011 11:24 AM
To: dev@river.apache.org
Subject: Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

This raises a more general question that has been troubling me: What 
should we do about theoretical deadlocks and similar concurrency issues 
that have not been demonstrated in practice?

On the one hand, I like to have a test to show that a change really 
fixed something. On the other hand, a concurrency problem can contribute

to general flakiness without ever reaching the point of being reported 
as a bug or having a test that demonstrates it.

Patricia

On 5/4/2011 8:47 AM, Christopher Dolan wrote:
...
> I haven't conclusively witnessed that specific deadlock, but I've had
a
> closely related problem where another process coincidentally grabs
port
> 4160 before Reggie gets it. This happens because Win2k, WinXP and
Win2k3
> use 1024-5000 for their dynamic port range, contrary to IANA
> recommendations. I suspect the deadlock described above happens in
real
> life, but I've never gotten detailed enough logs to prove it, just
> client stack traces showing the hang in Mux.start().
...

RE: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Christopher Dolan <ch...@avid.com>.

Here's a test that consistently fails with the current Mux
implementation and passes with the patch I proposed at the beginning of
this thread. In my test I explicitly pretend that the server side of the
connect has blocked. In reality, all we need to agree on is that it's
possible for the server side to block.

The proposed patch needs a little more work to make the timeout be
configurable. If so, the test can be sped up by setting that timeout to
something unrealistically short.

public class MuxStartTimeout {
    @Test
    public void test() throws IOException, InterruptedException {
        // make fake input and output streams.
        OutputStream os = new ByteArrayOutputStream();
        InputStream is = new InputStream() {
            @Override
            public synchronized int read() throws IOException {
                try {
                    // block indefinitely
                    while (true)
                        wait();
                } catch (InterruptedException e) {
                    return 0;
                }
            }
        };

        final AtomicBoolean finished = new AtomicBoolean(false);
        final AtomicBoolean succeeded = new AtomicBoolean(false);
        final AtomicBoolean failed = new AtomicBoolean(false);
        final MuxClient muxClient = new MuxClient(os, is);
        try {
            Thread t = new Thread(new Runnable() {
                public void run() {
                    try {
                        muxClient.start();
                        succeeded.set(true);
                    } catch (IOException e) {
                        failed.set(true);
                    }
                    finished.set(true);
                }
            });
            t.start();
            t.join(20000);
            Assert.assertTrue(finished.get());
            Assert.assertFalse(succeeded.get());
            Assert.assertTrue(failed.get());
            if (!t.isInterrupted())
                t.interrupt();
        } finally {
            muxClient.shutdown("end of test");
        }
    }
}

Chris

P.S. Amusingly, I actually compiled the test against
org.testng.annotations.Test org.testng.Assert but it should also work as
written against org.junit.Test and org.junit.Assert

-----Original Message-----
From: Patricia Shanahan [mailto:pats@acm.org] 
Sent: Wednesday, May 04, 2011 11:24 AM
To: dev@river.apache.org
Subject: Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

This raises a more general question that has been troubling me: What 
should we do about theoretical deadlocks and similar concurrency issues 
that have not been demonstrated in practice?

On the one hand, I like to have a test to show that a change really 
fixed something. On the other hand, a concurrency problem can contribute

to general flakiness without ever reaching the point of being reported 
as a bug or having a test that demonstrates it.

Patricia

On 5/4/2011 8:47 AM, Christopher Dolan wrote:
...
> I haven't conclusively witnessed that specific deadlock, but I've had
a
> closely related problem where another process coincidentally grabs
port
> 4160 before Reggie gets it. This happens because Win2k, WinXP and
Win2k3
> use 1024-5000 for their dynamic port range, contrary to IANA
> recommendations. I suspect the deadlock described above happens in
real
> life, but I've never gotten detailed enough logs to prove it, just
> client stack traces showing the hang in Mux.start().
...

Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Patricia Shanahan <pa...@acm.org>.

This raises a more general question that has been troubling me: What 
should we do about theoretical deadlocks and similar concurrency issues 
that have not been demonstrated in practice?

On the one hand, I like to have a test to show that a change really 
fixed something. On the other hand, a concurrency problem can contribute 
to general flakiness without ever reaching the point of being reported 
as a bug or having a test that demonstrates it.

Patricia

On 5/4/2011 8:47 AM, Christopher Dolan wrote:
...
> I haven't conclusively witnessed that specific deadlock, but I've had a
> closely related problem where another process coincidentally grabs port
> 4160 before Reggie gets it. This happens because Win2k, WinXP and Win2k3
> use 1024-5000 for their dynamic port range, contrary to IANA
> recommendations. I suspect the deadlock described above happens in real
> life, but I've never gotten detailed enough logs to prove it, just
> client stack traces showing the hang in Mux.start().
...

RE: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Christopher Dolan <ch...@avid.com>.

Here's another more obscure scenario caused by dynamic port allocation.
The server process ends without informing Reggie (OS or JVM crash,
perhaps). On server restart, a non-JERI TCP connection gets assigned
that port number. Then, the client tries to connect to the port and
sends its 8-byte JERI handshake. The client waits for server to respond,
but the server also waits for more bytes from the client because it's
using a different protocol --> deadlock. A timeout would solve break
that deadlock.

I haven't conclusively witnessed that specific deadlock, but I've had a
closely related problem where another process coincidentally grabs port
4160 before Reggie gets it. This happens because Win2k, WinXP and Win2k3
use 1024-5000 for their dynamic port range, contrary to IANA
recommendations. I suspect the deadlock described above happens in real
life, but I've never gotten detailed enough logs to prove it, just
client stack traces showing the hang in Mux.start().

Chris

-----Original Message-----
From: Gregg Wonderly [mailto:greggwon@gmail.com] 
Sent: Tuesday, May 03, 2011 11:49 PM
To: dev@river.apache.org
Subject: Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Right.  So an obvious issue is a network segmentation event where some
form of working routing is no longer present.  We'd then have normal TCP
timers to deal with, keep-alive not withstanding.  These are not unusual
problems, just well known, but often unexpected.  When networks were
slow and intermittently connected, TCP timeout values on the order of
minutes were sensible (time to redial the modem).  Much shorter
time-outs can be sensible on modern networks as long as you don't exceed
the systems limits to transition through the needed states.

Gregg

Sent from my iPad

On May 3, 2011, at 2:28 AM, Dan Creswell <da...@gmail.com> wrote:

> Mmm, agreed though I think to do it requires that Chris get to the
> root cause that is producing the need for a client timeout.
> 
> On 2 May 2011 23:17, Gregg Wonderly <ge...@cox.net> wrote:
>> Yes I think a timeout is what will work the best.  I just want to
make sure we can release any associated server resources at the same
time so that this doesn't result in a resource leak in the server.
>> 
>> Gregg
>> 
>> Sent from my iPhone
>> 
>> On May 2, 2011, at 11:27 AM, "Christopher Dolan"
<ch...@avid.com> wrote:
>> 
>>> I strongly agree  with Gregg's comments about timeouts in general,
but I
>>> think this might be a special case. In the Mux.start() data flow,
the
>>> client sends an 8-byte handshake and expects the server to respond
with
>>> a similar 8-byte handshake (or error) promptly. I'm seeing
indefinite
>>> stalls in real-world cases, so I need to do something and a timeout
is
>>> the only solution I can think of.
>>> 
>>> Chris
>>> 
>>> -----Original Message-----
>>> From: Gregg Wonderly [mailto:gergg@cox.net]
>>> Sent: Saturday, April 30, 2011 10:03 AM
>>> To: dev@river.apache.org
>>> Cc: Mike McGrady
>>> Subject: Re: client hang in
com.sun.jini.jeri.internal.mux.Mux.start()
>>> 
>>> In the history of original Java threading model, and the NIO
development
>>> to use
>>> "select/poll" from your own thread, rather than registering call
back
>>> methods
>>> (via an interface) kept a lot of development from using a model
where
>>> threading
>>> was managed internally by the package or by the JVM.  As a result,
we
>>> have
>>> structures like today where notifications are less common.  In this
code
>>> though,
>>> I think the structure is internal enough that it's not necessary to
>>> really use
>>> Future or some other mechanisms.
>>> 
>>> Timeouts are always a "hard way" to manage "loss of functionality"
>>> because you
>>> really don't know when things are "not working", only that something
is
>>> taking
>>> longer than your timeout accounted for.   Timeouts can make it
possible
>>> for more
>>> pending work to pile up on the other end that might slow the results
>>> even more.
>>>  E.g. you wait 30seconds and retry while the actual work on the
other
>>> end is
>>> taking 35 seconds to get to and thus the queue rate exceeds the
dequeue
>>> rate and
>>> things start piling up.
>>> 
>>> If you are going to use a timeout, we need to have some sort of
>>> indication from
>>> both ends perspective that the attempt has been aborted, as early as
>>> possible.
>>> For cases where I/O traffic is written/read, that usually means
closing
>>> the
>>> socket.  I'm not sure of the ramifications of doing that, since I
>>> haven't looked
>>> too hard at this code.
>>> 
>>> Gregg Wonderly
>>> 
>>> On 4/29/2011 4:38 PM, Mike McGrady wrote:
>>>> Throwing my two cents in here just to state my opinion.  This is an
>>> effort that
>>>> could pay dividends if it were done with a view toward the future -
>>> absolutely
>>>> no pun intended.  I do not know the details of the code but if
>>> futures could be
>>>> useful here, they would be welcomed by myself.
>>>> 
>>>> MG
>>>> 
>>>> 
>>>> On Apr 29, 2011, at 1:33 PM, Christopher Dolan wrote:
>>>> 
>>>>> Thanks, Tom.
>>>>> 
>>>>> I don't really understand your Future suggestion.  Are you
suggesting
>>>>> changing the async handshake to a Future? If so, that sounds like
a
>>> very
>>>>> involved change, touching a lot of code in Mux and its subclasses.
>>>>> 
>>>>> setDown() changes the value of the muxDown boolean to true, so
it's a
>>>>> valid way out of the loop.
>>>>> 
>>>>> Chris
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Tom Hobbs [mailto:tvhobbs@googlemail.com]
>>>>> Sent: Friday, April 29, 2011 2:27 PM
>>>>> To: dev@river.apache.org
>>>>> Subject: Re: client hang in
>>> com.sun.jini.jeri.internal.mux.Mux.start()
>>>>> 
>>>>> The proposed code looks fine to me.  Two points leap out, more
>>>>> discussion
>>>>> points than anything else.
>>>>> 
>>>>> For some reason I've recently developed an aversion to writing my
on
>>>>> timeout
>>>>> logic.  did you consider using something like a Future here or
might
>>>>> that be
>>>>> serious overkill (it wouldn't surprise me if it was)?
>>>>> 
>>>>> Also is Setdown intended to break out of the while loop?  Because
I
>>>>> can't
>>>>> see a way to escape it.  (I don't have the rest of the code in
front
>>> of
>>>>> me)
>>>>> 
>>>>> Thanks for keep raising these issues - especially because you
usually
>>>>> supply
>>>>> fixes for them!
>>>>> 
>>>>> Tom
>>>>> 
>>>>> On 29 Apr 2011 19:41, "Christopher
Dolan"<ch...@avid.com>
>>>>> wrote:
>>>>>> I've experienced occasional cases where clients get stuck in the
>>>>>> following block of code in Mux.start. Has anyone experienced this
>>>>>> problem? I have a proposed solution below. Has anyone thought
about
>>> a
>>>>>> similar solution already?
>>>>>> 
>>>>>> -- Current code --
>>>>>> 1 asyncSendClientConnectionHeader();
>>>>>> 2 synchronized (muxLock) {
>>>>>> 3 while (!muxDown&&  !clientConnectionReady) {
>>>>>> 4 try {
>>>>>> 5 muxLock.wait(); // REMIND: timeout?
>>>>>> 6 } catch (InterruptedException e) {
>>>>>> 7 ...
>>>>>> 8 }
>>>>>> 9 }
>>>>>> 10 if (muxDown) {
>>>>>> 11 IOException ioe = new IOException(muxDownMessage);
>>>>>> 12 ioe.initCause(muxDownCause);
>>>>>> 13 throw ioe;
>>>>>> 14 }
>>>>>> 15 }
>>>>>> 
>>>>>> -- Explanation of the code --
>>>>>> This code handles the initial client-server handshake that starts
a
>>>>> JERI
>>>>>> connection. In line 1, the client sends its 8-byte greeting to
the
>>>>>> server. Then in the loop on lines 3-9, it waits for the server's
>>>>>> response. If the reader thread gets a satisfactory response from
the
>>>>>> server, it sets clientConnectionReady=true and calls
>>>>>> muxLock.notifyAll(). In all other cases (aborted connection,
>>>>> mismatched
>>>>>> protocol version, etc) the reader invokes Mux.setDown() which
sets
>>>>>> muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it
>>> throws
>>>>> if
>>>>>> the handshake was a failure.
>>>>>> 
>>>>>> In my scenario (which uses simple TCP sockets, nothing fancy),
the
>>>>>> invoker thread sits on line 5 indefinitely. My problem hard to
>>>>>> reproduce, so I haven't found out what the server is doing in
this
>>>>> case.
>>>>>> I hope to figure that out eventually, but presently I'm
interested
>>> in
>>>>>> the "REMIND: timeout?" comment.
>>>>>> 
>>>>>> -- Timeout solution --
>>>>>> It seems obvious to me that there should be a timeout here. There
>>> are
>>>>>> lots of imaginable cases where the client could get stuck here:
>>>>>> server-side deadlock, abrupt server crash, logic error in client
Mux
>>>>>> code. You'd expect that the server would either respond with its
>>>>> 8-byte
>>>>>> handshake very quickly or never, so a modest timeout (like 15 or
30
>>>>>> seconds) should be good. If that timeout is triggered, I would
>>> expect
>>>>>> that the code above would call Mux.setDown() and throw an
>>> IOException.
>>>>>> That exception would either cause a retry or be thrown up to the
>>>>> invoker
>>>>>> as a RemoteException.
>>>>>> 
>>>>>> -- Proposed code (untested) --
>>>>>> 3 long now = System.currentTimeMillis();
>>>>>> 4 long endTime = now + timeoutMillis;
>>>>>> 5 while (!muxDown&&  !clientConnectionReady) {
>>>>>> 6 if (now>= endTime) {
>>>>>> 7 setDown("timeout waiting for server to respond
>>>>>> to handshake", null);
>>>>>> 8 } else {
>>>>>> 9 try {
>>>>>> 10 muxLock.wait(endTime - now);
>>>>>> 11 now = System.currentTimeMillis();
>>>>>> 12 } catch (InterruptedException e) {
>>>>>> 13 setDown("interrupt waiting for connection
>>>>>> header", e);
>>>>>> 14 }
>>>>>> 15 }
>>>>>> 16 }
>>>>>> 
>>>>>> This code assumes a configurable timeoutMillis parameter has been
>>> set
>>>>>> earlier.
>>>>>> 
>>>>>> I can't think of any alternative solutions. Putting the timeout
in
>>> the
>>>>>> Reader logic seems higher risk. There's incomplete code in JERI
to
>>>>>> implement a ping packet (see Mux.asyncSendPing, never used), but
>>> that
>>>>>> would only be relevant after the initial handshake and wouldn't
help
>>>>>> here.
>>>>>> 
>>>>>> Thanks,
>>>>>> Chris
>>>>>> 
>>>> 
>>>> Michael McGrady
>>>> Chief Systems Architect
>>>> Topia Technology, Inc.
>>>> Cel 1.253.720.3365
>>>> Work 1.253.572.9712 extension 2037
>>>> Skype ID: michael.mcgrady5
>>>> mmcgrady@topiatechnology.com
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>

Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Gregg Wonderly <gr...@gmail.com>.

Right.  So an obvious issue is a network segmentation event where some form of working routing is no longer present.  We'd then have normal TCP timers to deal with, keep-alive not withstanding.  These are not unusual problems, just well known, but often unexpected.  When networks were slow and intermittently connected, TCP timeout values on the order of minutes were sensible (time to redial the modem).  Much shorter time-outs can be sensible on modern networks as long as you don't exceed the systems limits to transition through the needed states.

Gregg

Sent from my iPad

On May 3, 2011, at 2:28 AM, Dan Creswell <da...@gmail.com> wrote:

> Mmm, agreed though I think to do it requires that Chris get to the
> root cause that is producing the need for a client timeout.
> 
> On 2 May 2011 23:17, Gregg Wonderly <ge...@cox.net> wrote:
>> Yes I think a timeout is what will work the best.  I just want to make sure we can release any associated server resources at the same time so that this doesn't result in a resource leak in the server.
>> 
>> Gregg
>> 
>> Sent from my iPhone
>> 
>> On May 2, 2011, at 11:27 AM, "Christopher Dolan" <ch...@avid.com> wrote:
>> 
>>> I strongly agree  with Gregg's comments about timeouts in general, but I
>>> think this might be a special case. In the Mux.start() data flow, the
>>> client sends an 8-byte handshake and expects the server to respond with
>>> a similar 8-byte handshake (or error) promptly. I'm seeing indefinite
>>> stalls in real-world cases, so I need to do something and a timeout is
>>> the only solution I can think of.
>>> 
>>> Chris
>>> 
>>> -----Original Message-----
>>> From: Gregg Wonderly [mailto:gergg@cox.net]
>>> Sent: Saturday, April 30, 2011 10:03 AM
>>> To: dev@river.apache.org
>>> Cc: Mike McGrady
>>> Subject: Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()
>>> 
>>> In the history of original Java threading model, and the NIO development
>>> to use
>>> "select/poll" from your own thread, rather than registering call back
>>> methods
>>> (via an interface) kept a lot of development from using a model where
>>> threading
>>> was managed internally by the package or by the JVM.  As a result, we
>>> have
>>> structures like today where notifications are less common.  In this code
>>> though,
>>> I think the structure is internal enough that it's not necessary to
>>> really use
>>> Future or some other mechanisms.
>>> 
>>> Timeouts are always a "hard way" to manage "loss of functionality"
>>> because you
>>> really don't know when things are "not working", only that something is
>>> taking
>>> longer than your timeout accounted for.   Timeouts can make it possible
>>> for more
>>> pending work to pile up on the other end that might slow the results
>>> even more.
>>>  E.g. you wait 30seconds and retry while the actual work on the other
>>> end is
>>> taking 35 seconds to get to and thus the queue rate exceeds the dequeue
>>> rate and
>>> things start piling up.
>>> 
>>> If you are going to use a timeout, we need to have some sort of
>>> indication from
>>> both ends perspective that the attempt has been aborted, as early as
>>> possible.
>>> For cases where I/O traffic is written/read, that usually means closing
>>> the
>>> socket.  I'm not sure of the ramifications of doing that, since I
>>> haven't looked
>>> too hard at this code.
>>> 
>>> Gregg Wonderly
>>> 
>>> On 4/29/2011 4:38 PM, Mike McGrady wrote:
>>>> Throwing my two cents in here just to state my opinion.  This is an
>>> effort that
>>>> could pay dividends if it were done with a view toward the future -
>>> absolutely
>>>> no pun intended.  I do not know the details of the code but if
>>> futures could be
>>>> useful here, they would be welcomed by myself.
>>>> 
>>>> MG
>>>> 
>>>> 
>>>> On Apr 29, 2011, at 1:33 PM, Christopher Dolan wrote:
>>>> 
>>>>> Thanks, Tom.
>>>>> 
>>>>> I don't really understand your Future suggestion.  Are you suggesting
>>>>> changing the async handshake to a Future? If so, that sounds like a
>>> very
>>>>> involved change, touching a lot of code in Mux and its subclasses.
>>>>> 
>>>>> setDown() changes the value of the muxDown boolean to true, so it's a
>>>>> valid way out of the loop.
>>>>> 
>>>>> Chris
>>>>> 
>>>>> -----Original Message-----
>>>>> From: Tom Hobbs [mailto:tvhobbs@googlemail.com]
>>>>> Sent: Friday, April 29, 2011 2:27 PM
>>>>> To: dev@river.apache.org
>>>>> Subject: Re: client hang in
>>> com.sun.jini.jeri.internal.mux.Mux.start()
>>>>> 
>>>>> The proposed code looks fine to me.  Two points leap out, more
>>>>> discussion
>>>>> points than anything else.
>>>>> 
>>>>> For some reason I've recently developed an aversion to writing my on
>>>>> timeout
>>>>> logic.  did you consider using something like a Future here or might
>>>>> that be
>>>>> serious overkill (it wouldn't surprise me if it was)?
>>>>> 
>>>>> Also is Setdown intended to break out of the while loop?  Because I
>>>>> can't
>>>>> see a way to escape it.  (I don't have the rest of the code in front
>>> of
>>>>> me)
>>>>> 
>>>>> Thanks for keep raising these issues - especially because you usually
>>>>> supply
>>>>> fixes for them!
>>>>> 
>>>>> Tom
>>>>> 
>>>>> On 29 Apr 2011 19:41, "Christopher Dolan"<ch...@avid.com>
>>>>> wrote:
>>>>>> I've experienced occasional cases where clients get stuck in the
>>>>>> following block of code in Mux.start. Has anyone experienced this
>>>>>> problem? I have a proposed solution below. Has anyone thought about
>>> a
>>>>>> similar solution already?
>>>>>> 
>>>>>> -- Current code --
>>>>>> 1 asyncSendClientConnectionHeader();
>>>>>> 2 synchronized (muxLock) {
>>>>>> 3 while (!muxDown&&  !clientConnectionReady) {
>>>>>> 4 try {
>>>>>> 5 muxLock.wait(); // REMIND: timeout?
>>>>>> 6 } catch (InterruptedException e) {
>>>>>> 7 ...
>>>>>> 8 }
>>>>>> 9 }
>>>>>> 10 if (muxDown) {
>>>>>> 11 IOException ioe = new IOException(muxDownMessage);
>>>>>> 12 ioe.initCause(muxDownCause);
>>>>>> 13 throw ioe;
>>>>>> 14 }
>>>>>> 15 }
>>>>>> 
>>>>>> -- Explanation of the code --
>>>>>> This code handles the initial client-server handshake that starts a
>>>>> JERI
>>>>>> connection. In line 1, the client sends its 8-byte greeting to the
>>>>>> server. Then in the loop on lines 3-9, it waits for the server's
>>>>>> response. If the reader thread gets a satisfactory response from the
>>>>>> server, it sets clientConnectionReady=true and calls
>>>>>> muxLock.notifyAll(). In all other cases (aborted connection,
>>>>> mismatched
>>>>>> protocol version, etc) the reader invokes Mux.setDown() which sets
>>>>>> muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it
>>> throws
>>>>> if
>>>>>> the handshake was a failure.
>>>>>> 
>>>>>> In my scenario (which uses simple TCP sockets, nothing fancy), the
>>>>>> invoker thread sits on line 5 indefinitely. My problem hard to
>>>>>> reproduce, so I haven't found out what the server is doing in this
>>>>> case.
>>>>>> I hope to figure that out eventually, but presently I'm interested
>>> in
>>>>>> the "REMIND: timeout?" comment.
>>>>>> 
>>>>>> -- Timeout solution --
>>>>>> It seems obvious to me that there should be a timeout here. There
>>> are
>>>>>> lots of imaginable cases where the client could get stuck here:
>>>>>> server-side deadlock, abrupt server crash, logic error in client Mux
>>>>>> code. You'd expect that the server would either respond with its
>>>>> 8-byte
>>>>>> handshake very quickly or never, so a modest timeout (like 15 or 30
>>>>>> seconds) should be good. If that timeout is triggered, I would
>>> expect
>>>>>> that the code above would call Mux.setDown() and throw an
>>> IOException.
>>>>>> That exception would either cause a retry or be thrown up to the
>>>>> invoker
>>>>>> as a RemoteException.
>>>>>> 
>>>>>> -- Proposed code (untested) --
>>>>>> 3 long now = System.currentTimeMillis();
>>>>>> 4 long endTime = now + timeoutMillis;
>>>>>> 5 while (!muxDown&&  !clientConnectionReady) {
>>>>>> 6 if (now>= endTime) {
>>>>>> 7 setDown("timeout waiting for server to respond
>>>>>> to handshake", null);
>>>>>> 8 } else {
>>>>>> 9 try {
>>>>>> 10 muxLock.wait(endTime - now);
>>>>>> 11 now = System.currentTimeMillis();
>>>>>> 12 } catch (InterruptedException e) {
>>>>>> 13 setDown("interrupt waiting for connection
>>>>>> header", e);
>>>>>> 14 }
>>>>>> 15 }
>>>>>> 16 }
>>>>>> 
>>>>>> This code assumes a configurable timeoutMillis parameter has been
>>> set
>>>>>> earlier.
>>>>>> 
>>>>>> I can't think of any alternative solutions. Putting the timeout in
>>> the
>>>>>> Reader logic seems higher risk. There's incomplete code in JERI to
>>>>>> implement a ping packet (see Mux.asyncSendPing, never used), but
>>> that
>>>>>> would only be relevant after the initial handshake and wouldn't help
>>>>>> here.
>>>>>> 
>>>>>> Thanks,
>>>>>> Chris
>>>>>> 
>>>> 
>>>> Michael McGrady
>>>> Chief Systems Architect
>>>> Topia Technology, Inc.
>>>> Cel 1.253.720.3365
>>>> Work 1.253.572.9712 extension 2037
>>>> Skype ID: michael.mcgrady5
>>>> mmcgrady@topiatechnology.com
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>

Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Dan Creswell <da...@gmail.com>.

I think maybe the right tradeoff is to firstly figure out root
cause/fix the problem and then later, as part of a general move
towards JDK 5 re-structure to make use of "Java 5 timeout stuff".

My basic concern would be we would otherwise run the risk of
introducing complications/more bugs whilst we try and sort the
original.

On 3 May 2011 09:12, Tom Hobbs <tv...@googlemail.com> wrote:
> Dan raises a good point, which is probably of higher importance than the one
> I did.
>
> My only comment about the use of a timeout was questioning whether we could
> start to use some Java 5 timeout stuff.  You guys have been looking at that
> bit of the code more than me, so if the answer is "no" then I can accept
> that.
>
> On 3 May 2011 08:29, "Dan Creswell" <da...@gmail.com> wrote:
>> Mmm, agreed though I think to do it requires that Chris get to the
>> root cause that is producing the need for a client timeout.
>>
>> On 2 May 2011 23:17, Gregg Wonderly <ge...@cox.net> wrote:
>>> Yes I think a timeout is what will work the best.  I just want to make
> sure we can release any associated server resources at the same time so that
> this doesn't result in a resource leak in the server.
>>>
>>> Gregg
>>>
>>> Sent from my iPhone
>>>
>>> On May 2, 2011, at 11:27 AM, "Christopher Dolan" <
> christopher.dolan@avid.com> wrote:
>>>
>>>> I strongly agree  with Gregg's comments about timeouts in general, but I
>>>> think this might be a special case. In the Mux.start() data flow, the
>>>> client sends an 8-byte handshake and expects the server to respond with
>>>> a similar 8-byte handshake (or error) promptly. I'm seeing indefinite
>>>> stalls in real-world cases, so I need to do something and a timeout is
>>>> the only solution I can think of.
>>>>
>>>> Chris
>>>>
>>>> -----Original Message-----
>>>> From: Gregg Wonderly [mailto:gergg@cox.net]
>>>> Sent: Saturday, April 30, 2011 10:03 AM
>>>> To: dev@river.apache.org
>>>> Cc: Mike McGrady
>>>> Subject: Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()
>>>>
>>>> In the history of original Java threading model, and the NIO development
>>>> to use
>>>> "select/poll" from your own thread, rather than registering call back
>>>> methods
>>>> (via an interface) kept a lot of development from using a model where
>>>> threading
>>>> was managed internally by the package or by the JVM.  As a result, we
>>>> have
>>>> structures like today where notifications are less common.  In this code
>>>> though,
>>>> I think the structure is internal enough that it's not necessary to
>>>> really use
>>>> Future or some other mechanisms.
>>>>
>>>> Timeouts are always a "hard way" to manage "loss of functionality"
>>>> because you
>>>> really don't know when things are "not working", only that something is
>>>> taking
>>>> longer than your timeout accounted for.   Timeouts can make it possible
>>>> for more
>>>> pending work to pile up on the other end that might slow the results
>>>> even more.
>>>>  E.g. you wait 30seconds and retry while the actual work on the other
>>>> end is
>>>> taking 35 seconds to get to and thus the queue rate exceeds the dequeue
>>>> rate and
>>>> things start piling up.
>>>>
>>>> If you are going to use a timeout, we need to have some sort of
>>>> indication from
>>>> both ends perspective that the attempt has been aborted, as early as
>>>> possible.
>>>> For cases where I/O traffic is written/read, that usually means closing
>>>> the
>>>> socket.  I'm not sure of the ramifications of doing that, since I
>>>> haven't looked
>>>> too hard at this code.
>>>>
>>>> Gregg Wonderly
>>>>
>>>> On 4/29/2011 4:38 PM, Mike McGrady wrote:
>>>>> Throwing my two cents in here just to state my opinion.  This is an
>>>> effort that
>>>>> could pay dividends if it were done with a view toward the future -
>>>> absolutely
>>>>> no pun intended.  I do not know the details of the code but if
>>>> futures could be
>>>>> useful here, they would be welcomed by myself.
>>>>>
>>>>> MG
>>>>>
>>>>>
>>>>> On Apr 29, 2011, at 1:33 PM, Christopher Dolan wrote:
>>>>>
>>>>>> Thanks, Tom.
>>>>>>
>>>>>> I don't really understand your Future suggestion.  Are you suggesting
>>>>>> changing the async handshake to a Future? If so, that sounds like a
>>>> very
>>>>>> involved change, touching a lot of code in Mux and its subclasses.
>>>>>>
>>>>>> setDown() changes the value of the muxDown boolean to true, so it's a
>>>>>> valid way out of the loop.
>>>>>>
>>>>>> Chris
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Tom Hobbs [mailto:tvhobbs@googlemail.com]
>>>>>> Sent: Friday, April 29, 2011 2:27 PM
>>>>>> To: dev@river.apache.org
>>>>>> Subject: Re: client hang in
>>>> com.sun.jini.jeri.internal.mux.Mux.start()
>>>>>>
>>>>>> The proposed code looks fine to me.  Two points leap out, more
>>>>>> discussion
>>>>>> points than anything else.
>>>>>>
>>>>>> For some reason I've recently developed an aversion to writing my on
>>>>>> timeout
>>>>>> logic.  did you consider using something like a Future here or might
>>>>>> that be
>>>>>> serious overkill (it wouldn't surprise me if it was)?
>>>>>>
>>>>>> Also is Setdown intended to break out of the while loop?  Because I
>>>>>> can't
>>>>>> see a way to escape it.  (I don't have the rest of the code in front
>>>> of
>>>>>> me)
>>>>>>
>>>>>> Thanks for keep raising these issues - especially because you usually
>>>>>> supply
>>>>>> fixes for them!
>>>>>>
>>>>>> Tom
>>>>>>
>>>>>> On 29 Apr 2011 19:41, "Christopher Dolan"<ch...@avid.com>
>>>>>> wrote:
>>>>>>> I've experienced occasional cases where clients get stuck in the
>>>>>>> following block of code in Mux.start. Has anyone experienced this
>>>>>>> problem? I have a proposed solution below. Has anyone thought about
>>>> a
>>>>>>> similar solution already?
>>>>>>>
>>>>>>> -- Current code --
>>>>>>> 1 asyncSendClientConnectionHeader();
>>>>>>> 2 synchronized (muxLock) {
>>>>>>> 3 while (!muxDown&&  !clientConnectionReady) {
>>>>>>> 4 try {
>>>>>>> 5 muxLock.wait(); // REMIND: timeout?
>>>>>>> 6 } catch (InterruptedException e) {
>>>>>>> 7 ...
>>>>>>> 8 }
>>>>>>> 9 }
>>>>>>> 10 if (muxDown) {
>>>>>>> 11 IOException ioe = new IOException(muxDownMessage);
>>>>>>> 12 ioe.initCause(muxDownCause);
>>>>>>> 13 throw ioe;
>>>>>>> 14 }
>>>>>>> 15 }
>>>>>>>
>>>>>>> -- Explanation of the code --
>>>>>>> This code handles the initial client-server handshake that starts a
>>>>>> JERI
>>>>>>> connection. In line 1, the client sends its 8-byte greeting to the
>>>>>>> server. Then in the loop on lines 3-9, it waits for the server's
>>>>>>> response. If the reader thread gets a satisfactory response from the
>>>>>>> server, it sets clientConnectionReady=true and calls
>>>>>>> muxLock.notifyAll(). In all other cases (aborted connection,
>>>>>> mismatched
>>>>>>> protocol version, etc) the reader invokes Mux.setDown() which sets
>>>>>>> muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it
>>>> throws
>>>>>> if
>>>>>>> the handshake was a failure.
>>>>>>>
>>>>>>> In my scenario (which uses simple TCP sockets, nothing fancy), the
>>>>>>> invoker thread sits on line 5 indefinitely. My problem hard to
>>>>>>> reproduce, so I haven't found out what the server is doing in this
>>>>>> case.
>>>>>>> I hope to figure that out eventually, but presently I'm interested
>>>> in
>>>>>>> the "REMIND: timeout?" comment.
>>>>>>>
>>>>>>> -- Timeout solution --
>>>>>>> It seems obvious to me that there should be a timeout here. There
>>>> are
>>>>>>> lots of imaginable cases where the client could get stuck here:
>>>>>>> server-side deadlock, abrupt server crash, logic error in client Mux
>>>>>>> code. You'd expect that the server would either respond with its
>>>>>> 8-byte
>>>>>>> handshake very quickly or never, so a modest timeout (like 15 or 30
>>>>>>> seconds) should be good. If that timeout is triggered, I would
>>>> expect
>>>>>>> that the code above would call Mux.setDown() and throw an
>>>> IOException.
>>>>>>> That exception would either cause a retry or be thrown up to the
>>>>>> invoker
>>>>>>> as a RemoteException.
>>>>>>>
>>>>>>> -- Proposed code (untested) --
>>>>>>> 3 long now = System.currentTimeMillis();
>>>>>>> 4 long endTime = now + timeoutMillis;
>>>>>>> 5 while (!muxDown&&  !clientConnectionReady) {
>>>>>>> 6 if (now>= endTime) {
>>>>>>> 7 setDown("timeout waiting for server to respond
>>>>>>> to handshake", null);
>>>>>>> 8 } else {
>>>>>>> 9 try {
>>>>>>> 10 muxLock.wait(endTime - now);
>>>>>>> 11 now = System.currentTimeMillis();
>>>>>>> 12 } catch (InterruptedException e) {
>>>>>>> 13 setDown("interrupt waiting for connection
>>>>>>> header", e);
>>>>>>> 14 }
>>>>>>> 15 }
>>>>>>> 16 }
>>>>>>>
>>>>>>> This code assumes a configurable timeoutMillis parameter has been
>>>> set
>>>>>>> earlier.
>>>>>>>
>>>>>>> I can't think of any alternative solutions. Putting the timeout in
>>>> the
>>>>>>> Reader logic seems higher risk. There's incomplete code in JERI to
>>>>>>> implement a ping packet (see Mux.asyncSendPing, never used), but
>>>> that
>>>>>>> would only be relevant after the initial handshake and wouldn't help
>>>>>>> here.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Chris
>>>>>>>
>>>>>
>>>>> Michael McGrady
>>>>> Chief Systems Architect
>>>>> Topia Technology, Inc.
>>>>> Cel 1.253.720.3365
>>>>> Work 1.253.572.9712 extension 2037
>>>>> Skype ID: michael.mcgrady5
>>>>> mmcgrady@topiatechnology.com
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>

Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Tom Hobbs <tv...@googlemail.com>.

Dan raises a good point, which is probably of higher importance than the one
I did.

My only comment about the use of a timeout was questioning whether we could
start to use some Java 5 timeout stuff.  You guys have been looking at that
bit of the code more than me, so if the answer is "no" then I can accept
that.

On 3 May 2011 08:29, "Dan Creswell" <da...@gmail.com> wrote:
> Mmm, agreed though I think to do it requires that Chris get to the
> root cause that is producing the need for a client timeout.
>
> On 2 May 2011 23:17, Gregg Wonderly <ge...@cox.net> wrote:
>> Yes I think a timeout is what will work the best.  I just want to make
sure we can release any associated server resources at the same time so that
this doesn't result in a resource leak in the server.
>>
>> Gregg
>>
>> Sent from my iPhone
>>
>> On May 2, 2011, at 11:27 AM, "Christopher Dolan" <
christopher.dolan@avid.com> wrote:
>>
>>> I strongly agree  with Gregg's comments about timeouts in general, but I
>>> think this might be a special case. In the Mux.start() data flow, the
>>> client sends an 8-byte handshake and expects the server to respond with
>>> a similar 8-byte handshake (or error) promptly. I'm seeing indefinite
>>> stalls in real-world cases, so I need to do something and a timeout is
>>> the only solution I can think of.
>>>
>>> Chris
>>>
>>> -----Original Message-----
>>> From: Gregg Wonderly [mailto:gergg@cox.net]
>>> Sent: Saturday, April 30, 2011 10:03 AM
>>> To: dev@river.apache.org
>>> Cc: Mike McGrady
>>> Subject: Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()
>>>
>>> In the history of original Java threading model, and the NIO development
>>> to use
>>> "select/poll" from your own thread, rather than registering call back
>>> methods
>>> (via an interface) kept a lot of development from using a model where
>>> threading
>>> was managed internally by the package or by the JVM.  As a result, we
>>> have
>>> structures like today where notifications are less common.  In this code
>>> though,
>>> I think the structure is internal enough that it's not necessary to
>>> really use
>>> Future or some other mechanisms.
>>>
>>> Timeouts are always a "hard way" to manage "loss of functionality"
>>> because you
>>> really don't know when things are "not working", only that something is
>>> taking
>>> longer than your timeout accounted for.   Timeouts can make it possible
>>> for more
>>> pending work to pile up on the other end that might slow the results
>>> even more.
>>>  E.g. you wait 30seconds and retry while the actual work on the other
>>> end is
>>> taking 35 seconds to get to and thus the queue rate exceeds the dequeue
>>> rate and
>>> things start piling up.
>>>
>>> If you are going to use a timeout, we need to have some sort of
>>> indication from
>>> both ends perspective that the attempt has been aborted, as early as
>>> possible.
>>> For cases where I/O traffic is written/read, that usually means closing
>>> the
>>> socket.  I'm not sure of the ramifications of doing that, since I
>>> haven't looked
>>> too hard at this code.
>>>
>>> Gregg Wonderly
>>>
>>> On 4/29/2011 4:38 PM, Mike McGrady wrote:
>>>> Throwing my two cents in here just to state my opinion.  This is an
>>> effort that
>>>> could pay dividends if it were done with a view toward the future -
>>> absolutely
>>>> no pun intended.  I do not know the details of the code but if
>>> futures could be
>>>> useful here, they would be welcomed by myself.
>>>>
>>>> MG
>>>>
>>>>
>>>> On Apr 29, 2011, at 1:33 PM, Christopher Dolan wrote:
>>>>
>>>>> Thanks, Tom.
>>>>>
>>>>> I don't really understand your Future suggestion.  Are you suggesting
>>>>> changing the async handshake to a Future? If so, that sounds like a
>>> very
>>>>> involved change, touching a lot of code in Mux and its subclasses.
>>>>>
>>>>> setDown() changes the value of the muxDown boolean to true, so it's a
>>>>> valid way out of the loop.
>>>>>
>>>>> Chris
>>>>>
>>>>> -----Original Message-----
>>>>> From: Tom Hobbs [mailto:tvhobbs@googlemail.com]
>>>>> Sent: Friday, April 29, 2011 2:27 PM
>>>>> To: dev@river.apache.org
>>>>> Subject: Re: client hang in
>>> com.sun.jini.jeri.internal.mux.Mux.start()
>>>>>
>>>>> The proposed code looks fine to me.  Two points leap out, more
>>>>> discussion
>>>>> points than anything else.
>>>>>
>>>>> For some reason I've recently developed an aversion to writing my on
>>>>> timeout
>>>>> logic.  did you consider using something like a Future here or might
>>>>> that be
>>>>> serious overkill (it wouldn't surprise me if it was)?
>>>>>
>>>>> Also is Setdown intended to break out of the while loop?  Because I
>>>>> can't
>>>>> see a way to escape it.  (I don't have the rest of the code in front
>>> of
>>>>> me)
>>>>>
>>>>> Thanks for keep raising these issues - especially because you usually
>>>>> supply
>>>>> fixes for them!
>>>>>
>>>>> Tom
>>>>>
>>>>> On 29 Apr 2011 19:41, "Christopher Dolan"<ch...@avid.com>
>>>>> wrote:
>>>>>> I've experienced occasional cases where clients get stuck in the
>>>>>> following block of code in Mux.start. Has anyone experienced this
>>>>>> problem? I have a proposed solution below. Has anyone thought about
>>> a
>>>>>> similar solution already?
>>>>>>
>>>>>> -- Current code --
>>>>>> 1 asyncSendClientConnectionHeader();
>>>>>> 2 synchronized (muxLock) {
>>>>>> 3 while (!muxDown&&  !clientConnectionReady) {
>>>>>> 4 try {
>>>>>> 5 muxLock.wait(); // REMIND: timeout?
>>>>>> 6 } catch (InterruptedException e) {
>>>>>> 7 ...
>>>>>> 8 }
>>>>>> 9 }
>>>>>> 10 if (muxDown) {
>>>>>> 11 IOException ioe = new IOException(muxDownMessage);
>>>>>> 12 ioe.initCause(muxDownCause);
>>>>>> 13 throw ioe;
>>>>>> 14 }
>>>>>> 15 }
>>>>>>
>>>>>> -- Explanation of the code --
>>>>>> This code handles the initial client-server handshake that starts a
>>>>> JERI
>>>>>> connection. In line 1, the client sends its 8-byte greeting to the
>>>>>> server. Then in the loop on lines 3-9, it waits for the server's
>>>>>> response. If the reader thread gets a satisfactory response from the
>>>>>> server, it sets clientConnectionReady=true and calls
>>>>>> muxLock.notifyAll(). In all other cases (aborted connection,
>>>>> mismatched
>>>>>> protocol version, etc) the reader invokes Mux.setDown() which sets
>>>>>> muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it
>>> throws
>>>>> if
>>>>>> the handshake was a failure.
>>>>>>
>>>>>> In my scenario (which uses simple TCP sockets, nothing fancy), the
>>>>>> invoker thread sits on line 5 indefinitely. My problem hard to
>>>>>> reproduce, so I haven't found out what the server is doing in this
>>>>> case.
>>>>>> I hope to figure that out eventually, but presently I'm interested
>>> in
>>>>>> the "REMIND: timeout?" comment.
>>>>>>
>>>>>> -- Timeout solution --
>>>>>> It seems obvious to me that there should be a timeout here. There
>>> are
>>>>>> lots of imaginable cases where the client could get stuck here:
>>>>>> server-side deadlock, abrupt server crash, logic error in client Mux
>>>>>> code. You'd expect that the server would either respond with its
>>>>> 8-byte
>>>>>> handshake very quickly or never, so a modest timeout (like 15 or 30
>>>>>> seconds) should be good. If that timeout is triggered, I would
>>> expect
>>>>>> that the code above would call Mux.setDown() and throw an
>>> IOException.
>>>>>> That exception would either cause a retry or be thrown up to the
>>>>> invoker
>>>>>> as a RemoteException.
>>>>>>
>>>>>> -- Proposed code (untested) --
>>>>>> 3 long now = System.currentTimeMillis();
>>>>>> 4 long endTime = now + timeoutMillis;
>>>>>> 5 while (!muxDown&&  !clientConnectionReady) {
>>>>>> 6 if (now>= endTime) {
>>>>>> 7 setDown("timeout waiting for server to respond
>>>>>> to handshake", null);
>>>>>> 8 } else {
>>>>>> 9 try {
>>>>>> 10 muxLock.wait(endTime - now);
>>>>>> 11 now = System.currentTimeMillis();
>>>>>> 12 } catch (InterruptedException e) {
>>>>>> 13 setDown("interrupt waiting for connection
>>>>>> header", e);
>>>>>> 14 }
>>>>>> 15 }
>>>>>> 16 }
>>>>>>
>>>>>> This code assumes a configurable timeoutMillis parameter has been
>>> set
>>>>>> earlier.
>>>>>>
>>>>>> I can't think of any alternative solutions. Putting the timeout in
>>> the
>>>>>> Reader logic seems higher risk. There's incomplete code in JERI to
>>>>>> implement a ping packet (see Mux.asyncSendPing, never used), but
>>> that
>>>>>> would only be relevant after the initial handshake and wouldn't help
>>>>>> here.
>>>>>>
>>>>>> Thanks,
>>>>>> Chris
>>>>>>
>>>>
>>>> Michael McGrady
>>>> Chief Systems Architect
>>>> Topia Technology, Inc.
>>>> Cel 1.253.720.3365
>>>> Work 1.253.572.9712 extension 2037
>>>> Skype ID: michael.mcgrady5
>>>> mmcgrady@topiatechnology.com
>>>>
>>>>
>>>>
>>>>
>>>
>>

Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Dan Creswell <da...@gmail.com>.

Mmm, agreed though I think to do it requires that Chris get to the
root cause that is producing the need for a client timeout.

On 2 May 2011 23:17, Gregg Wonderly <ge...@cox.net> wrote:
> Yes I think a timeout is what will work the best.  I just want to make sure we can release any associated server resources at the same time so that this doesn't result in a resource leak in the server.
>
> Gregg
>
> Sent from my iPhone
>
> On May 2, 2011, at 11:27 AM, "Christopher Dolan" <ch...@avid.com> wrote:
>
>> I strongly agree  with Gregg's comments about timeouts in general, but I
>> think this might be a special case. In the Mux.start() data flow, the
>> client sends an 8-byte handshake and expects the server to respond with
>> a similar 8-byte handshake (or error) promptly. I'm seeing indefinite
>> stalls in real-world cases, so I need to do something and a timeout is
>> the only solution I can think of.
>>
>> Chris
>>
>> -----Original Message-----
>> From: Gregg Wonderly [mailto:gergg@cox.net]
>> Sent: Saturday, April 30, 2011 10:03 AM
>> To: dev@river.apache.org
>> Cc: Mike McGrady
>> Subject: Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()
>>
>> In the history of original Java threading model, and the NIO development
>> to use
>> "select/poll" from your own thread, rather than registering call back
>> methods
>> (via an interface) kept a lot of development from using a model where
>> threading
>> was managed internally by the package or by the JVM.  As a result, we
>> have
>> structures like today where notifications are less common.  In this code
>> though,
>> I think the structure is internal enough that it's not necessary to
>> really use
>> Future or some other mechanisms.
>>
>> Timeouts are always a "hard way" to manage "loss of functionality"
>> because you
>> really don't know when things are "not working", only that something is
>> taking
>> longer than your timeout accounted for.   Timeouts can make it possible
>> for more
>> pending work to pile up on the other end that might slow the results
>> even more.
>>  E.g. you wait 30seconds and retry while the actual work on the other
>> end is
>> taking 35 seconds to get to and thus the queue rate exceeds the dequeue
>> rate and
>> things start piling up.
>>
>> If you are going to use a timeout, we need to have some sort of
>> indication from
>> both ends perspective that the attempt has been aborted, as early as
>> possible.
>> For cases where I/O traffic is written/read, that usually means closing
>> the
>> socket.  I'm not sure of the ramifications of doing that, since I
>> haven't looked
>> too hard at this code.
>>
>> Gregg Wonderly
>>
>> On 4/29/2011 4:38 PM, Mike McGrady wrote:
>>> Throwing my two cents in here just to state my opinion.  This is an
>> effort that
>>> could pay dividends if it were done with a view toward the future -
>> absolutely
>>> no pun intended.  I do not know the details of the code but if
>> futures could be
>>> useful here, they would be welcomed by myself.
>>>
>>> MG
>>>
>>>
>>> On Apr 29, 2011, at 1:33 PM, Christopher Dolan wrote:
>>>
>>>> Thanks, Tom.
>>>>
>>>> I don't really understand your Future suggestion.  Are you suggesting
>>>> changing the async handshake to a Future? If so, that sounds like a
>> very
>>>> involved change, touching a lot of code in Mux and its subclasses.
>>>>
>>>> setDown() changes the value of the muxDown boolean to true, so it's a
>>>> valid way out of the loop.
>>>>
>>>> Chris
>>>>
>>>> -----Original Message-----
>>>> From: Tom Hobbs [mailto:tvhobbs@googlemail.com]
>>>> Sent: Friday, April 29, 2011 2:27 PM
>>>> To: dev@river.apache.org
>>>> Subject: Re: client hang in
>> com.sun.jini.jeri.internal.mux.Mux.start()
>>>>
>>>> The proposed code looks fine to me.  Two points leap out, more
>>>> discussion
>>>> points than anything else.
>>>>
>>>> For some reason I've recently developed an aversion to writing my on
>>>> timeout
>>>> logic.  did you consider using something like a Future here or might
>>>> that be
>>>> serious overkill (it wouldn't surprise me if it was)?
>>>>
>>>> Also is Setdown intended to break out of the while loop?  Because I
>>>> can't
>>>> see a way to escape it.  (I don't have the rest of the code in front
>> of
>>>> me)
>>>>
>>>> Thanks for keep raising these issues - especially because you usually
>>>> supply
>>>> fixes for them!
>>>>
>>>> Tom
>>>>
>>>> On 29 Apr 2011 19:41, "Christopher Dolan"<ch...@avid.com>
>>>> wrote:
>>>>> I've experienced occasional cases where clients get stuck in the
>>>>> following block of code in Mux.start. Has anyone experienced this
>>>>> problem? I have a proposed solution below. Has anyone thought about
>> a
>>>>> similar solution already?
>>>>>
>>>>> -- Current code --
>>>>> 1 asyncSendClientConnectionHeader();
>>>>> 2 synchronized (muxLock) {
>>>>> 3 while (!muxDown&&  !clientConnectionReady) {
>>>>> 4 try {
>>>>> 5 muxLock.wait(); // REMIND: timeout?
>>>>> 6 } catch (InterruptedException e) {
>>>>> 7 ...
>>>>> 8 }
>>>>> 9 }
>>>>> 10 if (muxDown) {
>>>>> 11 IOException ioe = new IOException(muxDownMessage);
>>>>> 12 ioe.initCause(muxDownCause);
>>>>> 13 throw ioe;
>>>>> 14 }
>>>>> 15 }
>>>>>
>>>>> -- Explanation of the code --
>>>>> This code handles the initial client-server handshake that starts a
>>>> JERI
>>>>> connection. In line 1, the client sends its 8-byte greeting to the
>>>>> server. Then in the loop on lines 3-9, it waits for the server's
>>>>> response. If the reader thread gets a satisfactory response from the
>>>>> server, it sets clientConnectionReady=true and calls
>>>>> muxLock.notifyAll(). In all other cases (aborted connection,
>>>> mismatched
>>>>> protocol version, etc) the reader invokes Mux.setDown() which sets
>>>>> muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it
>> throws
>>>> if
>>>>> the handshake was a failure.
>>>>>
>>>>> In my scenario (which uses simple TCP sockets, nothing fancy), the
>>>>> invoker thread sits on line 5 indefinitely. My problem hard to
>>>>> reproduce, so I haven't found out what the server is doing in this
>>>> case.
>>>>> I hope to figure that out eventually, but presently I'm interested
>> in
>>>>> the "REMIND: timeout?" comment.
>>>>>
>>>>> -- Timeout solution --
>>>>> It seems obvious to me that there should be a timeout here. There
>> are
>>>>> lots of imaginable cases where the client could get stuck here:
>>>>> server-side deadlock, abrupt server crash, logic error in client Mux
>>>>> code. You'd expect that the server would either respond with its
>>>> 8-byte
>>>>> handshake very quickly or never, so a modest timeout (like 15 or 30
>>>>> seconds) should be good. If that timeout is triggered, I would
>> expect
>>>>> that the code above would call Mux.setDown() and throw an
>> IOException.
>>>>> That exception would either cause a retry or be thrown up to the
>>>> invoker
>>>>> as a RemoteException.
>>>>>
>>>>> -- Proposed code (untested) --
>>>>> 3 long now = System.currentTimeMillis();
>>>>> 4 long endTime = now + timeoutMillis;
>>>>> 5 while (!muxDown&&  !clientConnectionReady) {
>>>>> 6 if (now>= endTime) {
>>>>> 7 setDown("timeout waiting for server to respond
>>>>> to handshake", null);
>>>>> 8 } else {
>>>>> 9 try {
>>>>> 10 muxLock.wait(endTime - now);
>>>>> 11 now = System.currentTimeMillis();
>>>>> 12 } catch (InterruptedException e) {
>>>>> 13 setDown("interrupt waiting for connection
>>>>> header", e);
>>>>> 14 }
>>>>> 15 }
>>>>> 16 }
>>>>>
>>>>> This code assumes a configurable timeoutMillis parameter has been
>> set
>>>>> earlier.
>>>>>
>>>>> I can't think of any alternative solutions. Putting the timeout in
>> the
>>>>> Reader logic seems higher risk. There's incomplete code in JERI to
>>>>> implement a ping packet (see Mux.asyncSendPing, never used), but
>> that
>>>>> would only be relevant after the initial handshake and wouldn't help
>>>>> here.
>>>>>
>>>>> Thanks,
>>>>> Chris
>>>>>
>>>
>>> Michael McGrady
>>> Chief Systems Architect
>>> Topia Technology, Inc.
>>> Cel 1.253.720.3365
>>> Work 1.253.572.9712 extension 2037
>>> Skype ID: michael.mcgrady5
>>> mmcgrady@topiatechnology.com
>>>
>>>
>>>
>>>
>>
>

Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Gregg Wonderly <ge...@cox.net>.

Yes I think a timeout is what will work the best.  I just want to make sure we can release any associated server resources at the same time so that this doesn't result in a resource leak in the server.

Gregg

Sent from my iPhone

On May 2, 2011, at 11:27 AM, "Christopher Dolan" <ch...@avid.com> wrote:

> I strongly agree  with Gregg's comments about timeouts in general, but I
> think this might be a special case. In the Mux.start() data flow, the
> client sends an 8-byte handshake and expects the server to respond with
> a similar 8-byte handshake (or error) promptly. I'm seeing indefinite
> stalls in real-world cases, so I need to do something and a timeout is
> the only solution I can think of.
> 
> Chris
> 
> -----Original Message-----
> From: Gregg Wonderly [mailto:gergg@cox.net] 
> Sent: Saturday, April 30, 2011 10:03 AM
> To: dev@river.apache.org
> Cc: Mike McGrady
> Subject: Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()
> 
> In the history of original Java threading model, and the NIO development
> to use 
> "select/poll" from your own thread, rather than registering call back
> methods 
> (via an interface) kept a lot of development from using a model where
> threading 
> was managed internally by the package or by the JVM.  As a result, we
> have 
> structures like today where notifications are less common.  In this code
> though, 
> I think the structure is internal enough that it's not necessary to
> really use 
> Future or some other mechanisms.
> 
> Timeouts are always a "hard way" to manage "loss of functionality"
> because you 
> really don't know when things are "not working", only that something is
> taking 
> longer than your timeout accounted for.   Timeouts can make it possible
> for more 
> pending work to pile up on the other end that might slow the results
> even more. 
>  E.g. you wait 30seconds and retry while the actual work on the other
> end is 
> taking 35 seconds to get to and thus the queue rate exceeds the dequeue
> rate and 
> things start piling up.
> 
> If you are going to use a timeout, we need to have some sort of
> indication from 
> both ends perspective that the attempt has been aborted, as early as
> possible. 
> For cases where I/O traffic is written/read, that usually means closing
> the 
> socket.  I'm not sure of the ramifications of doing that, since I
> haven't looked 
> too hard at this code.
> 
> Gregg Wonderly
> 
> On 4/29/2011 4:38 PM, Mike McGrady wrote:
>> Throwing my two cents in here just to state my opinion.  This is an
> effort that
>> could pay dividends if it were done with a view toward the future -
> absolutely
>> no pun intended.  I do not know the details of the code but if
> futures could be
>> useful here, they would be welcomed by myself.
>> 
>> MG
>> 
>> 
>> On Apr 29, 2011, at 1:33 PM, Christopher Dolan wrote:
>> 
>>> Thanks, Tom.
>>> 
>>> I don't really understand your Future suggestion.  Are you suggesting
>>> changing the async handshake to a Future? If so, that sounds like a
> very
>>> involved change, touching a lot of code in Mux and its subclasses.
>>> 
>>> setDown() changes the value of the muxDown boolean to true, so it's a
>>> valid way out of the loop.
>>> 
>>> Chris
>>> 
>>> -----Original Message-----
>>> From: Tom Hobbs [mailto:tvhobbs@googlemail.com]
>>> Sent: Friday, April 29, 2011 2:27 PM
>>> To: dev@river.apache.org
>>> Subject: Re: client hang in
> com.sun.jini.jeri.internal.mux.Mux.start()
>>> 
>>> The proposed code looks fine to me.  Two points leap out, more
>>> discussion
>>> points than anything else.
>>> 
>>> For some reason I've recently developed an aversion to writing my on
>>> timeout
>>> logic.  did you consider using something like a Future here or might
>>> that be
>>> serious overkill (it wouldn't surprise me if it was)?
>>> 
>>> Also is Setdown intended to break out of the while loop?  Because I
>>> can't
>>> see a way to escape it.  (I don't have the rest of the code in front
> of
>>> me)
>>> 
>>> Thanks for keep raising these issues - especially because you usually
>>> supply
>>> fixes for them!
>>> 
>>> Tom
>>> 
>>> On 29 Apr 2011 19:41, "Christopher Dolan"<ch...@avid.com>
>>> wrote:
>>>> I've experienced occasional cases where clients get stuck in the
>>>> following block of code in Mux.start. Has anyone experienced this
>>>> problem? I have a proposed solution below. Has anyone thought about
> a
>>>> similar solution already?
>>>> 
>>>> -- Current code --
>>>> 1 asyncSendClientConnectionHeader();
>>>> 2 synchronized (muxLock) {
>>>> 3 while (!muxDown&&  !clientConnectionReady) {
>>>> 4 try {
>>>> 5 muxLock.wait(); // REMIND: timeout?
>>>> 6 } catch (InterruptedException e) {
>>>> 7 ...
>>>> 8 }
>>>> 9 }
>>>> 10 if (muxDown) {
>>>> 11 IOException ioe = new IOException(muxDownMessage);
>>>> 12 ioe.initCause(muxDownCause);
>>>> 13 throw ioe;
>>>> 14 }
>>>> 15 }
>>>> 
>>>> -- Explanation of the code --
>>>> This code handles the initial client-server handshake that starts a
>>> JERI
>>>> connection. In line 1, the client sends its 8-byte greeting to the
>>>> server. Then in the loop on lines 3-9, it waits for the server's
>>>> response. If the reader thread gets a satisfactory response from the
>>>> server, it sets clientConnectionReady=true and calls
>>>> muxLock.notifyAll(). In all other cases (aborted connection,
>>> mismatched
>>>> protocol version, etc) the reader invokes Mux.setDown() which sets
>>>> muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it
> throws
>>> if
>>>> the handshake was a failure.
>>>> 
>>>> In my scenario (which uses simple TCP sockets, nothing fancy), the
>>>> invoker thread sits on line 5 indefinitely. My problem hard to
>>>> reproduce, so I haven't found out what the server is doing in this
>>> case.
>>>> I hope to figure that out eventually, but presently I'm interested
> in
>>>> the "REMIND: timeout?" comment.
>>>> 
>>>> -- Timeout solution --
>>>> It seems obvious to me that there should be a timeout here. There
> are
>>>> lots of imaginable cases where the client could get stuck here:
>>>> server-side deadlock, abrupt server crash, logic error in client Mux
>>>> code. You'd expect that the server would either respond with its
>>> 8-byte
>>>> handshake very quickly or never, so a modest timeout (like 15 or 30
>>>> seconds) should be good. If that timeout is triggered, I would
> expect
>>>> that the code above would call Mux.setDown() and throw an
> IOException.
>>>> That exception would either cause a retry or be thrown up to the
>>> invoker
>>>> as a RemoteException.
>>>> 
>>>> -- Proposed code (untested) --
>>>> 3 long now = System.currentTimeMillis();
>>>> 4 long endTime = now + timeoutMillis;
>>>> 5 while (!muxDown&&  !clientConnectionReady) {
>>>> 6 if (now>= endTime) {
>>>> 7 setDown("timeout waiting for server to respond
>>>> to handshake", null);
>>>> 8 } else {
>>>> 9 try {
>>>> 10 muxLock.wait(endTime - now);
>>>> 11 now = System.currentTimeMillis();
>>>> 12 } catch (InterruptedException e) {
>>>> 13 setDown("interrupt waiting for connection
>>>> header", e);
>>>> 14 }
>>>> 15 }
>>>> 16 }
>>>> 
>>>> This code assumes a configurable timeoutMillis parameter has been
> set
>>>> earlier.
>>>> 
>>>> I can't think of any alternative solutions. Putting the timeout in
> the
>>>> Reader logic seems higher risk. There's incomplete code in JERI to
>>>> implement a ping packet (see Mux.asyncSendPing, never used), but
> that
>>>> would only be relevant after the initial handshake and wouldn't help
>>>> here.
>>>> 
>>>> Thanks,
>>>> Chris
>>>> 
>> 
>> Michael McGrady
>> Chief Systems Architect
>> Topia Technology, Inc.
>> Cel 1.253.720.3365
>> Work 1.253.572.9712 extension 2037
>> Skype ID: michael.mcgrady5
>> mmcgrady@topiatechnology.com
>> 
>> 
>> 
>> 
>

RE: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Christopher Dolan <ch...@avid.com>.

I strongly agree  with Gregg's comments about timeouts in general, but I
think this might be a special case. In the Mux.start() data flow, the
client sends an 8-byte handshake and expects the server to respond with
a similar 8-byte handshake (or error) promptly. I'm seeing indefinite
stalls in real-world cases, so I need to do something and a timeout is
the only solution I can think of.

Chris

-----Original Message-----
From: Gregg Wonderly [mailto:gergg@cox.net] 
Sent: Saturday, April 30, 2011 10:03 AM
To: dev@river.apache.org
Cc: Mike McGrady
Subject: Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

In the history of original Java threading model, and the NIO development
to use 
"select/poll" from your own thread, rather than registering call back
methods 
(via an interface) kept a lot of development from using a model where
threading 
was managed internally by the package or by the JVM.  As a result, we
have 
structures like today where notifications are less common.  In this code
though, 
I think the structure is internal enough that it's not necessary to
really use 
Future or some other mechanisms.

Timeouts are always a "hard way" to manage "loss of functionality"
because you 
really don't know when things are "not working", only that something is
taking 
longer than your timeout accounted for.   Timeouts can make it possible
for more 
pending work to pile up on the other end that might slow the results
even more. 
  E.g. you wait 30seconds and retry while the actual work on the other
end is 
taking 35 seconds to get to and thus the queue rate exceeds the dequeue
rate and 
things start piling up.

If you are going to use a timeout, we need to have some sort of
indication from 
both ends perspective that the attempt has been aborted, as early as
possible. 
For cases where I/O traffic is written/read, that usually means closing
the 
socket.  I'm not sure of the ramifications of doing that, since I
haven't looked 
too hard at this code.

Gregg Wonderly

On 4/29/2011 4:38 PM, Mike McGrady wrote:
> Throwing my two cents in here just to state my opinion.  This is an
effort that
 > could pay dividends if it were done with a view toward the future -
absolutely
 > no pun intended.  I do not know the details of the code but if
futures could be
 > useful here, they would be welcomed by myself.
>
> MG
>
>
> On Apr 29, 2011, at 1:33 PM, Christopher Dolan wrote:
>
>> Thanks, Tom.
>>
>> I don't really understand your Future suggestion.  Are you suggesting
>> changing the async handshake to a Future? If so, that sounds like a
very
>> involved change, touching a lot of code in Mux and its subclasses.
>>
>> setDown() changes the value of the muxDown boolean to true, so it's a
>> valid way out of the loop.
>>
>> Chris
>>
>> -----Original Message-----
>> From: Tom Hobbs [mailto:tvhobbs@googlemail.com]
>> Sent: Friday, April 29, 2011 2:27 PM
>> To: dev@river.apache.org
>> Subject: Re: client hang in
com.sun.jini.jeri.internal.mux.Mux.start()
>>
>> The proposed code looks fine to me.  Two points leap out, more
>> discussion
>> points than anything else.
>>
>> For some reason I've recently developed an aversion to writing my on
>> timeout
>> logic.  did you consider using something like a Future here or might
>> that be
>> serious overkill (it wouldn't surprise me if it was)?
>>
>> Also is Setdown intended to break out of the while loop?  Because I
>> can't
>> see a way to escape it.  (I don't have the rest of the code in front
of
>> me)
>>
>> Thanks for keep raising these issues - especially because you usually
>> supply
>> fixes for them!
>>
>> Tom
>>
>> On 29 Apr 2011 19:41, "Christopher Dolan"<ch...@avid.com>
>> wrote:
>>> I've experienced occasional cases where clients get stuck in the
>>> following block of code in Mux.start. Has anyone experienced this
>>> problem? I have a proposed solution below. Has anyone thought about
a
>>> similar solution already?
>>>
>>> -- Current code --
>>> 1 asyncSendClientConnectionHeader();
>>> 2 synchronized (muxLock) {
>>> 3 while (!muxDown&&  !clientConnectionReady) {
>>> 4 try {
>>> 5 muxLock.wait(); // REMIND: timeout?
>>> 6 } catch (InterruptedException e) {
>>> 7 ...
>>> 8 }
>>> 9 }
>>> 10 if (muxDown) {
>>> 11 IOException ioe = new IOException(muxDownMessage);
>>> 12 ioe.initCause(muxDownCause);
>>> 13 throw ioe;
>>> 14 }
>>> 15 }
>>>
>>> -- Explanation of the code --
>>> This code handles the initial client-server handshake that starts a
>> JERI
>>> connection. In line 1, the client sends its 8-byte greeting to the
>>> server. Then in the loop on lines 3-9, it waits for the server's
>>> response. If the reader thread gets a satisfactory response from the
>>> server, it sets clientConnectionReady=true and calls
>>> muxLock.notifyAll(). In all other cases (aborted connection,
>> mismatched
>>> protocol version, etc) the reader invokes Mux.setDown() which sets
>>> muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it
throws
>> if
>>> the handshake was a failure.
>>>
>>> In my scenario (which uses simple TCP sockets, nothing fancy), the
>>> invoker thread sits on line 5 indefinitely. My problem hard to
>>> reproduce, so I haven't found out what the server is doing in this
>> case.
>>> I hope to figure that out eventually, but presently I'm interested
in
>>> the "REMIND: timeout?" comment.
>>>
>>> -- Timeout solution --
>>> It seems obvious to me that there should be a timeout here. There
are
>>> lots of imaginable cases where the client could get stuck here:
>>> server-side deadlock, abrupt server crash, logic error in client Mux
>>> code. You'd expect that the server would either respond with its
>> 8-byte
>>> handshake very quickly or never, so a modest timeout (like 15 or 30
>>> seconds) should be good. If that timeout is triggered, I would
expect
>>> that the code above would call Mux.setDown() and throw an
IOException.
>>> That exception would either cause a retry or be thrown up to the
>> invoker
>>> as a RemoteException.
>>>
>>> -- Proposed code (untested) --
>>> 3 long now = System.currentTimeMillis();
>>> 4 long endTime = now + timeoutMillis;
>>> 5 while (!muxDown&&  !clientConnectionReady) {
>>> 6 if (now>= endTime) {
>>> 7 setDown("timeout waiting for server to respond
>>> to handshake", null);
>>> 8 } else {
>>> 9 try {
>>> 10 muxLock.wait(endTime - now);
>>> 11 now = System.currentTimeMillis();
>>> 12 } catch (InterruptedException e) {
>>> 13 setDown("interrupt waiting for connection
>>> header", e);
>>> 14 }
>>> 15 }
>>> 16 }
>>>
>>> This code assumes a configurable timeoutMillis parameter has been
set
>>> earlier.
>>>
>>> I can't think of any alternative solutions. Putting the timeout in
the
>>> Reader logic seems higher risk. There's incomplete code in JERI to
>>> implement a ping packet (see Mux.asyncSendPing, never used), but
that
>>> would only be relevant after the initial handshake and wouldn't help
>>> here.
>>>
>>> Thanks,
>>> Chris
>>>
>
> Michael McGrady
> Chief Systems Architect
> Topia Technology, Inc.
> Cel 1.253.720.3365
> Work 1.253.572.9712 extension 2037
> Skype ID: michael.mcgrady5
> mmcgrady@topiatechnology.com
>
>
>
>

Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Gregg Wonderly <ge...@cox.net>.

In the history of original Java threading model, and the NIO development to use 
"select/poll" from your own thread, rather than registering call back methods 
(via an interface) kept a lot of development from using a model where threading 
was managed internally by the package or by the JVM.  As a result, we have 
structures like today where notifications are less common.  In this code though, 
I think the structure is internal enough that it's not necessary to really use 
Future or some other mechanisms.

Timeouts are always a "hard way" to manage "loss of functionality" because you 
really don't know when things are "not working", only that something is taking 
longer than your timeout accounted for.   Timeouts can make it possible for more 
pending work to pile up on the other end that might slow the results even more. 
  E.g. you wait 30seconds and retry while the actual work on the other end is 
taking 35 seconds to get to and thus the queue rate exceeds the dequeue rate and 
things start piling up.

If you are going to use a timeout, we need to have some sort of indication from 
both ends perspective that the attempt has been aborted, as early as possible. 
For cases where I/O traffic is written/read, that usually means closing the 
socket.  I'm not sure of the ramifications of doing that, since I haven't looked 
too hard at this code.

Gregg Wonderly

On 4/29/2011 4:38 PM, Mike McGrady wrote:
> Throwing my two cents in here just to state my opinion.  This is an effort that
 > could pay dividends if it were done with a view toward the future - absolutely
 > no pun intended.  I do not know the details of the code but if futures could be
 > useful here, they would be welcomed by myself.
>
> MG
>
>
> On Apr 29, 2011, at 1:33 PM, Christopher Dolan wrote:
>
>> Thanks, Tom.
>>
>> I don't really understand your Future suggestion.  Are you suggesting
>> changing the async handshake to a Future? If so, that sounds like a very
>> involved change, touching a lot of code in Mux and its subclasses.
>>
>> setDown() changes the value of the muxDown boolean to true, so it's a
>> valid way out of the loop.
>>
>> Chris
>>
>> -----Original Message-----
>> From: Tom Hobbs [mailto:tvhobbs@googlemail.com]
>> Sent: Friday, April 29, 2011 2:27 PM
>> To: dev@river.apache.org
>> Subject: Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()
>>
>> The proposed code looks fine to me.  Two points leap out, more
>> discussion
>> points than anything else.
>>
>> For some reason I've recently developed an aversion to writing my on
>> timeout
>> logic.  did you consider using something like a Future here or might
>> that be
>> serious overkill (it wouldn't surprise me if it was)?
>>
>> Also is Setdown intended to break out of the while loop?  Because I
>> can't
>> see a way to escape it.  (I don't have the rest of the code in front of
>> me)
>>
>> Thanks for keep raising these issues - especially because you usually
>> supply
>> fixes for them!
>>
>> Tom
>>
>> On 29 Apr 2011 19:41, "Christopher Dolan"<ch...@avid.com>
>> wrote:
>>> I've experienced occasional cases where clients get stuck in the
>>> following block of code in Mux.start. Has anyone experienced this
>>> problem? I have a proposed solution below. Has anyone thought about a
>>> similar solution already?
>>>
>>> -- Current code --
>>> 1 asyncSendClientConnectionHeader();
>>> 2 synchronized (muxLock) {
>>> 3 while (!muxDown&&  !clientConnectionReady) {
>>> 4 try {
>>> 5 muxLock.wait(); // REMIND: timeout?
>>> 6 } catch (InterruptedException e) {
>>> 7 ...
>>> 8 }
>>> 9 }
>>> 10 if (muxDown) {
>>> 11 IOException ioe = new IOException(muxDownMessage);
>>> 12 ioe.initCause(muxDownCause);
>>> 13 throw ioe;
>>> 14 }
>>> 15 }
>>>
>>> -- Explanation of the code --
>>> This code handles the initial client-server handshake that starts a
>> JERI
>>> connection. In line 1, the client sends its 8-byte greeting to the
>>> server. Then in the loop on lines 3-9, it waits for the server's
>>> response. If the reader thread gets a satisfactory response from the
>>> server, it sets clientConnectionReady=true and calls
>>> muxLock.notifyAll(). In all other cases (aborted connection,
>> mismatched
>>> protocol version, etc) the reader invokes Mux.setDown() which sets
>>> muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it throws
>> if
>>> the handshake was a failure.
>>>
>>> In my scenario (which uses simple TCP sockets, nothing fancy), the
>>> invoker thread sits on line 5 indefinitely. My problem hard to
>>> reproduce, so I haven't found out what the server is doing in this
>> case.
>>> I hope to figure that out eventually, but presently I'm interested in
>>> the "REMIND: timeout?" comment.
>>>
>>> -- Timeout solution --
>>> It seems obvious to me that there should be a timeout here. There are
>>> lots of imaginable cases where the client could get stuck here:
>>> server-side deadlock, abrupt server crash, logic error in client Mux
>>> code. You'd expect that the server would either respond with its
>> 8-byte
>>> handshake very quickly or never, so a modest timeout (like 15 or 30
>>> seconds) should be good. If that timeout is triggered, I would expect
>>> that the code above would call Mux.setDown() and throw an IOException.
>>> That exception would either cause a retry or be thrown up to the
>> invoker
>>> as a RemoteException.
>>>
>>> -- Proposed code (untested) --
>>> 3 long now = System.currentTimeMillis();
>>> 4 long endTime = now + timeoutMillis;
>>> 5 while (!muxDown&&  !clientConnectionReady) {
>>> 6 if (now>= endTime) {
>>> 7 setDown("timeout waiting for server to respond
>>> to handshake", null);
>>> 8 } else {
>>> 9 try {
>>> 10 muxLock.wait(endTime - now);
>>> 11 now = System.currentTimeMillis();
>>> 12 } catch (InterruptedException e) {
>>> 13 setDown("interrupt waiting for connection
>>> header", e);
>>> 14 }
>>> 15 }
>>> 16 }
>>>
>>> This code assumes a configurable timeoutMillis parameter has been set
>>> earlier.
>>>
>>> I can't think of any alternative solutions. Putting the timeout in the
>>> Reader logic seems higher risk. There's incomplete code in JERI to
>>> implement a ping packet (see Mux.asyncSendPing, never used), but that
>>> would only be relevant after the initial handshake and wouldn't help
>>> here.
>>>
>>> Thanks,
>>> Chris
>>>
>
> Michael McGrady
> Chief Systems Architect
> Topia Technology, Inc.
> Cel 1.253.720.3365
> Work 1.253.572.9712 extension 2037
> Skype ID: michael.mcgrady5
> mmcgrady@topiatechnology.com
>
>
>
>

Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Mike McGrady <mi...@gmail.com>.

Throwing my two cents in here just to state my opinion.  This is an effort that could pay dividends if it were done with a view toward the future - absolutely no pun intended.  I do not know the details of the code but if futures could be useful here, they would be welcomed by myself.

MG


On Apr 29, 2011, at 1:33 PM, Christopher Dolan wrote:

> Thanks, Tom.
> 
> I don't really understand your Future suggestion.  Are you suggesting
> changing the async handshake to a Future? If so, that sounds like a very
> involved change, touching a lot of code in Mux and its subclasses.
> 
> setDown() changes the value of the muxDown boolean to true, so it's a
> valid way out of the loop.
> 
> Chris
> 
> -----Original Message-----
> From: Tom Hobbs [mailto:tvhobbs@googlemail.com] 
> Sent: Friday, April 29, 2011 2:27 PM
> To: dev@river.apache.org
> Subject: Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()
> 
> The proposed code looks fine to me.  Two points leap out, more
> discussion
> points than anything else.
> 
> For some reason I've recently developed an aversion to writing my on
> timeout
> logic.  did you consider using something like a Future here or might
> that be
> serious overkill (it wouldn't surprise me if it was)?
> 
> Also is Setdown intended to break out of the while loop?  Because I
> can't
> see a way to escape it.  (I don't have the rest of the code in front of
> me)
> 
> Thanks for keep raising these issues - especially because you usually
> supply
> fixes for them!
> 
> Tom
> 
> On 29 Apr 2011 19:41, "Christopher Dolan" <ch...@avid.com>
> wrote:
>> I've experienced occasional cases where clients get stuck in the
>> following block of code in Mux.start. Has anyone experienced this
>> problem? I have a proposed solution below. Has anyone thought about a
>> similar solution already?
>> 
>> -- Current code --
>> 1 asyncSendClientConnectionHeader();
>> 2 synchronized (muxLock) {
>> 3 while (!muxDown && !clientConnectionReady) {
>> 4 try {
>> 5 muxLock.wait(); // REMIND: timeout?
>> 6 } catch (InterruptedException e) {
>> 7 ...
>> 8 }
>> 9 }
>> 10 if (muxDown) {
>> 11 IOException ioe = new IOException(muxDownMessage);
>> 12 ioe.initCause(muxDownCause);
>> 13 throw ioe;
>> 14 }
>> 15 }
>> 
>> -- Explanation of the code --
>> This code handles the initial client-server handshake that starts a
> JERI
>> connection. In line 1, the client sends its 8-byte greeting to the
>> server. Then in the loop on lines 3-9, it waits for the server's
>> response. If the reader thread gets a satisfactory response from the
>> server, it sets clientConnectionReady=true and calls
>> muxLock.notifyAll(). In all other cases (aborted connection,
> mismatched
>> protocol version, etc) the reader invokes Mux.setDown() which sets
>> muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it throws
> if
>> the handshake was a failure.
>> 
>> In my scenario (which uses simple TCP sockets, nothing fancy), the
>> invoker thread sits on line 5 indefinitely. My problem hard to
>> reproduce, so I haven't found out what the server is doing in this
> case.
>> I hope to figure that out eventually, but presently I'm interested in
>> the "REMIND: timeout?" comment.
>> 
>> -- Timeout solution --
>> It seems obvious to me that there should be a timeout here. There are
>> lots of imaginable cases where the client could get stuck here:
>> server-side deadlock, abrupt server crash, logic error in client Mux
>> code. You'd expect that the server would either respond with its
> 8-byte
>> handshake very quickly or never, so a modest timeout (like 15 or 30
>> seconds) should be good. If that timeout is triggered, I would expect
>> that the code above would call Mux.setDown() and throw an IOException.
>> That exception would either cause a retry or be thrown up to the
> invoker
>> as a RemoteException.
>> 
>> -- Proposed code (untested) --
>> 3 long now = System.currentTimeMillis();
>> 4 long endTime = now + timeoutMillis;
>> 5 while (!muxDown && !clientConnectionReady) {
>> 6 if (now >= endTime) {
>> 7 setDown("timeout waiting for server to respond
>> to handshake", null);
>> 8 } else {
>> 9 try {
>> 10 muxLock.wait(endTime - now);
>> 11 now = System.currentTimeMillis();
>> 12 } catch (InterruptedException e) {
>> 13 setDown("interrupt waiting for connection
>> header", e);
>> 14 }
>> 15 }
>> 16 }
>> 
>> This code assumes a configurable timeoutMillis parameter has been set
>> earlier.
>> 
>> I can't think of any alternative solutions. Putting the timeout in the
>> Reader logic seems higher risk. There's incomplete code in JERI to
>> implement a ping packet (see Mux.asyncSendPing, never used), but that
>> would only be relevant after the initial handshake and wouldn't help
>> here.
>> 
>> Thanks,
>> Chris
>> 

Michael McGrady
Chief Systems Architect
Topia Technology, Inc.
Cel 1.253.720.3365
Work 1.253.572.9712 extension 2037
Skype ID: michael.mcgrady5
mmcgrady@topiatechnology.com

RE: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Christopher Dolan <ch...@avid.com>.

Thanks, Tom.

I don't really understand your Future suggestion.  Are you suggesting
changing the async handshake to a Future? If so, that sounds like a very
involved change, touching a lot of code in Mux and its subclasses.

setDown() changes the value of the muxDown boolean to true, so it's a
valid way out of the loop.

Chris

-----Original Message-----
From: Tom Hobbs [mailto:tvhobbs@googlemail.com] 
Sent: Friday, April 29, 2011 2:27 PM
To: dev@river.apache.org
Subject: Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

The proposed code looks fine to me.  Two points leap out, more
discussion
points than anything else.

For some reason I've recently developed an aversion to writing my on
timeout
logic.  did you consider using something like a Future here or might
that be
serious overkill (it wouldn't surprise me if it was)?

Also is Setdown intended to break out of the while loop?  Because I
can't
see a way to escape it.  (I don't have the rest of the code in front of
me)

Thanks for keep raising these issues - especially because you usually
supply
fixes for them!

Tom

On 29 Apr 2011 19:41, "Christopher Dolan" <ch...@avid.com>
wrote:
> I've experienced occasional cases where clients get stuck in the
> following block of code in Mux.start. Has anyone experienced this
> problem? I have a proposed solution below. Has anyone thought about a
> similar solution already?
>
> -- Current code --
> 1 asyncSendClientConnectionHeader();
> 2 synchronized (muxLock) {
> 3 while (!muxDown && !clientConnectionReady) {
> 4 try {
> 5 muxLock.wait(); // REMIND: timeout?
> 6 } catch (InterruptedException e) {
> 7 ...
> 8 }
> 9 }
> 10 if (muxDown) {
> 11 IOException ioe = new IOException(muxDownMessage);
> 12 ioe.initCause(muxDownCause);
> 13 throw ioe;
> 14 }
> 15 }
>
> -- Explanation of the code --
> This code handles the initial client-server handshake that starts a
JERI
> connection. In line 1, the client sends its 8-byte greeting to the
> server. Then in the loop on lines 3-9, it waits for the server's
> response. If the reader thread gets a satisfactory response from the
> server, it sets clientConnectionReady=true and calls
> muxLock.notifyAll(). In all other cases (aborted connection,
mismatched
> protocol version, etc) the reader invokes Mux.setDown() which sets
> muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it throws
if
> the handshake was a failure.
>
> In my scenario (which uses simple TCP sockets, nothing fancy), the
> invoker thread sits on line 5 indefinitely. My problem hard to
> reproduce, so I haven't found out what the server is doing in this
case.
> I hope to figure that out eventually, but presently I'm interested in
> the "REMIND: timeout?" comment.
>
> -- Timeout solution --
> It seems obvious to me that there should be a timeout here. There are
> lots of imaginable cases where the client could get stuck here:
> server-side deadlock, abrupt server crash, logic error in client Mux
> code. You'd expect that the server would either respond with its
8-byte
> handshake very quickly or never, so a modest timeout (like 15 or 30
> seconds) should be good. If that timeout is triggered, I would expect
> that the code above would call Mux.setDown() and throw an IOException.
> That exception would either cause a retry or be thrown up to the
invoker
> as a RemoteException.
>
> -- Proposed code (untested) --
> 3 long now = System.currentTimeMillis();
> 4 long endTime = now + timeoutMillis;
> 5 while (!muxDown && !clientConnectionReady) {
> 6 if (now >= endTime) {
> 7 setDown("timeout waiting for server to respond
> to handshake", null);
> 8 } else {
> 9 try {
> 10 muxLock.wait(endTime - now);
> 11 now = System.currentTimeMillis();
> 12 } catch (InterruptedException e) {
> 13 setDown("interrupt waiting for connection
> header", e);
> 14 }
> 15 }
> 16 }
>
> This code assumes a configurable timeoutMillis parameter has been set
> earlier.
>
> I can't think of any alternative solutions. Putting the timeout in the
> Reader logic seems higher risk. There's incomplete code in JERI to
> implement a ping packet (see Mux.asyncSendPing, never used), but that
> would only be relevant after the initial handshake and wouldn't help
> here.
>
> Thanks,
> Chris
>

Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Tom Hobbs <tv...@googlemail.com>.

The proposed code looks fine to me.  Two points leap out, more discussion
points than anything else.

For some reason I've recently developed an aversion to writing my on timeout
logic.  did you consider using something like a Future here or might that be
serious overkill (it wouldn't surprise me if it was)?

Also is Setdown intended to break out of the while loop?  Because I can't
see a way to escape it.  (I don't have the rest of the code in front of me)

Thanks for keep raising these issues - especially because you usually supply
fixes for them!

Tom

On 29 Apr 2011 19:41, "Christopher Dolan" <ch...@avid.com>
wrote:
> I've experienced occasional cases where clients get stuck in the
> following block of code in Mux.start. Has anyone experienced this
> problem? I have a proposed solution below. Has anyone thought about a
> similar solution already?
>
> -- Current code --
> 1 asyncSendClientConnectionHeader();
> 2 synchronized (muxLock) {
> 3 while (!muxDown && !clientConnectionReady) {
> 4 try {
> 5 muxLock.wait(); // REMIND: timeout?
> 6 } catch (InterruptedException e) {
> 7 ...
> 8 }
> 9 }
> 10 if (muxDown) {
> 11 IOException ioe = new IOException(muxDownMessage);
> 12 ioe.initCause(muxDownCause);
> 13 throw ioe;
> 14 }
> 15 }
>
> -- Explanation of the code --
> This code handles the initial client-server handshake that starts a JERI
> connection. In line 1, the client sends its 8-byte greeting to the
> server. Then in the loop on lines 3-9, it waits for the server's
> response. If the reader thread gets a satisfactory response from the
> server, it sets clientConnectionReady=true and calls
> muxLock.notifyAll(). In all other cases (aborted connection, mismatched
> protocol version, etc) the reader invokes Mux.setDown() which sets
> muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it throws if
> the handshake was a failure.
>
> In my scenario (which uses simple TCP sockets, nothing fancy), the
> invoker thread sits on line 5 indefinitely. My problem hard to
> reproduce, so I haven't found out what the server is doing in this case.
> I hope to figure that out eventually, but presently I'm interested in
> the "REMIND: timeout?" comment.
>
> -- Timeout solution --
> It seems obvious to me that there should be a timeout here. There are
> lots of imaginable cases where the client could get stuck here:
> server-side deadlock, abrupt server crash, logic error in client Mux
> code. You'd expect that the server would either respond with its 8-byte
> handshake very quickly or never, so a modest timeout (like 15 or 30
> seconds) should be good. If that timeout is triggered, I would expect
> that the code above would call Mux.setDown() and throw an IOException.
> That exception would either cause a retry or be thrown up to the invoker
> as a RemoteException.
>
> -- Proposed code (untested) --
> 3 long now = System.currentTimeMillis();
> 4 long endTime = now + timeoutMillis;
> 5 while (!muxDown && !clientConnectionReady) {
> 6 if (now >= endTime) {
> 7 setDown("timeout waiting for server to respond
> to handshake", null);
> 8 } else {
> 9 try {
> 10 muxLock.wait(endTime - now);
> 11 now = System.currentTimeMillis();
> 12 } catch (InterruptedException e) {
> 13 setDown("interrupt waiting for connection
> header", e);
> 14 }
> 15 }
> 16 }
>
> This code assumes a configurable timeoutMillis parameter has been set
> earlier.
>
> I can't think of any alternative solutions. Putting the timeout in the
> Reader logic seems higher risk. There's incomplete code in JERI to
> implement a ping packet (see Mux.asyncSendPing, never used), but that
> would only be relevant after the initial handshake and wouldn't help
> here.
>
> Thanks,
> Chris
>

RE: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Christopher Dolan <ch...@avid.com>.

Peter,
My patch for this problem was accepted into River 2.2.0 via
https://issues.apache.org/jira/browse/RIVER-397
Chris

-----Original Message-----
From: Peter Jones [mailto:pcj@roundroom.net] 
Sent: Monday, June 20, 2011 5:14 PM
To: dev@river.apache.org
Subject: Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Chris,

A bit late but FWIW: your reasoning and low-risk solution look right to
me-- I believe that it's pretty much what that "REMIND" comment was
intending.

(I think that there was some thought that this method shouldn't actually
block for the handshake response, instead just letting it get processed
asynchronously and only blocking when necessary later.  But, if I
remember correctly, given how the upper layers actually use this code, I
don't think that such a change would have a practical benefit for the
added complexity.)

Cheers,

-- Peter


On Apr 29, 2011, at 2:40 PM, Christopher Dolan wrote:

> I've experienced occasional cases where clients get stuck in the
> following block of code in Mux.start. Has anyone experienced this
> problem? I have a proposed solution below. Has anyone thought about a
> similar solution already?
> 
> -- Current code --
> 1 	    asyncSendClientConnectionHeader();
> 2 	    synchronized (muxLock) {
> 3 		while (!muxDown && !clientConnectionReady) {
> 4 		    try {
> 5 			muxLock.wait();		// REMIND: timeout?
> 6 		    } catch (InterruptedException e) {
> 7 			...
> 8 		    }
> 9 		}
> 10		if (muxDown) {
> 11		    IOException ioe = new IOException(muxDownMessage);
> 12		    ioe.initCause(muxDownCause);
> 13		    throw ioe;
> 14		}
> 15	    }
> 
> -- Explanation of the code --
> This code handles the initial client-server handshake that starts a
JERI
> connection. In line 1, the client sends its 8-byte greeting to the
> server. Then in the loop on lines 3-9, it waits for the server's
> response. If the reader thread gets a satisfactory response from the
> server, it sets clientConnectionReady=true and calls
> muxLock.notifyAll(). In all other cases (aborted connection,
mismatched
> protocol version, etc) the reader invokes Mux.setDown() which sets
> muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it throws
if
> the handshake was a failure.
> 
> In my scenario (which uses simple TCP sockets, nothing fancy), the
> invoker thread sits on line 5 indefinitely. My problem hard to
> reproduce, so I haven't found out what the server is doing in this
case.
> I hope to figure that out eventually, but presently I'm interested in
> the "REMIND: timeout?" comment.
> 
> -- Timeout solution --
> It seems obvious to me that there should be a timeout here. There are
> lots of imaginable cases where the client could get stuck here:
> server-side deadlock, abrupt server crash, logic error in client Mux
> code. You'd expect that the server would either respond with its
8-byte
> handshake very quickly or never, so a modest timeout (like 15 or 30
> seconds) should be good. If that timeout is triggered, I would expect
> that the code above would call Mux.setDown() and throw an IOException.
> That exception would either cause a retry or be thrown up to the
invoker
> as a RemoteException.
> 
> -- Proposed code (untested) --
> 3 		long now = System.currentTimeMillis();
> 4 		long endTime = now + timeoutMillis;
> 5 		while (!muxDown && !clientConnectionReady) {
> 6 		    if (now >= endTime) {
> 7 			setDown("timeout waiting for server to respond
> to handshake", null);
> 8 		    } else {
> 9 			try {
> 10			    muxLock.wait(endTime - now);
> 11			    now = System.currentTimeMillis();
> 12			} catch (InterruptedException e) {
> 13			    setDown("interrupt waiting for connection
> header", e);
> 14			}
> 15		    }
> 16		}
> 
> This code assumes a configurable timeoutMillis parameter has been set
> earlier.
> 
> I can't think of any alternative solutions. Putting the timeout in the
> Reader logic seems higher risk. There's incomplete code in JERI to
> implement a ping packet (see Mux.asyncSendPing, never used), but that
> would only be relevant after the initial handshake and wouldn't help
> here.
> 
> Thanks,
> Chris
>

Re: client hang in com.sun.jini.jeri.internal.mux.Mux.start()

Posted by Peter Jones <pc...@roundroom.net>.

Chris,

A bit late but FWIW: your reasoning and low-risk solution look right to me-- I believe that it's pretty much what that "REMIND" comment was intending.

(I think that there was some thought that this method shouldn't actually block for the handshake response, instead just letting it get processed asynchronously and only blocking when necessary later.  But, if I remember correctly, given how the upper layers actually use this code, I don't think that such a change would have a practical benefit for the added complexity.)

Cheers,

-- Peter


On Apr 29, 2011, at 2:40 PM, Christopher Dolan wrote:

> I've experienced occasional cases where clients get stuck in the
> following block of code in Mux.start. Has anyone experienced this
> problem? I have a proposed solution below. Has anyone thought about a
> similar solution already?
> 
> -- Current code --
> 1 	    asyncSendClientConnectionHeader();
> 2 	    synchronized (muxLock) {
> 3 		while (!muxDown && !clientConnectionReady) {
> 4 		    try {
> 5 			muxLock.wait();		// REMIND: timeout?
> 6 		    } catch (InterruptedException e) {
> 7 			...
> 8 		    }
> 9 		}
> 10		if (muxDown) {
> 11		    IOException ioe = new IOException(muxDownMessage);
> 12		    ioe.initCause(muxDownCause);
> 13		    throw ioe;
> 14		}
> 15	    }
> 
> -- Explanation of the code --
> This code handles the initial client-server handshake that starts a JERI
> connection. In line 1, the client sends its 8-byte greeting to the
> server. Then in the loop on lines 3-9, it waits for the server's
> response. If the reader thread gets a satisfactory response from the
> server, it sets clientConnectionReady=true and calls
> muxLock.notifyAll(). In all other cases (aborted connection, mismatched
> protocol version, etc) the reader invokes Mux.setDown() which sets
> muxDown=true and calls muxLock.notifyAll(). In lines 10-14, it throws if
> the handshake was a failure.
> 
> In my scenario (which uses simple TCP sockets, nothing fancy), the
> invoker thread sits on line 5 indefinitely. My problem hard to
> reproduce, so I haven't found out what the server is doing in this case.
> I hope to figure that out eventually, but presently I'm interested in
> the "REMIND: timeout?" comment.
> 
> -- Timeout solution --
> It seems obvious to me that there should be a timeout here. There are
> lots of imaginable cases where the client could get stuck here:
> server-side deadlock, abrupt server crash, logic error in client Mux
> code. You'd expect that the server would either respond with its 8-byte
> handshake very quickly or never, so a modest timeout (like 15 or 30
> seconds) should be good. If that timeout is triggered, I would expect
> that the code above would call Mux.setDown() and throw an IOException.
> That exception would either cause a retry or be thrown up to the invoker
> as a RemoteException.
> 
> -- Proposed code (untested) --
> 3 		long now = System.currentTimeMillis();
> 4 		long endTime = now + timeoutMillis;
> 5 		while (!muxDown && !clientConnectionReady) {
> 6 		    if (now >= endTime) {
> 7 			setDown("timeout waiting for server to respond
> to handshake", null);
> 8 		    } else {
> 9 			try {
> 10			    muxLock.wait(endTime - now);
> 11			    now = System.currentTimeMillis();
> 12			} catch (InterruptedException e) {
> 13			    setDown("interrupt waiting for connection
> header", e);
> 14			}
> 15		    }
> 16		}
> 
> This code assumes a configurable timeoutMillis parameter has been set
> earlier.
> 
> I can't think of any alternative solutions. Putting the timeout in the
> Reader logic seems higher risk. There's incomplete code in JERI to
> implement a ping packet (see Mux.asyncSendPing, never used), but that
> would only be relevant after the initial handshake and wouldn't help
> here.
> 
> Thanks,
> Chris
>