You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@river.apache.org by Mark Brouwer <ma...@cheiron.org> on 2007/05/04 01:27:31 UTC

[BUG] NPE in com.sun.jini.jeri.internal.runtime.SelectionManager$SelectLoop

Recently I encountered an NPE similar to one I reported more than 2
years ago
(http://archives.java.sun.com/cgi-bin/wa?A2=ind0502&L=jini-users&D=0&I=-3&P=2811) 

and that has been fixed in the meantime, attached you will find the
stacktrace.

The test were done on a network with 5 dual 2.7G Xeon servers
and all clients and services were running on Seven with NIO enabled
for the Tcp(Server)Endpoints and all services shared the same
TcpServerEndpoint.

One server was hosting a JavaSpace while the other servers represented
clients. In total 12 clients were trying each from 100 threads to write
a ~ 15KBytes Entry into the space and at the same time each service
tried to obtain these entries from another 100 threads. All bandwidth
was consumed and it was just a matter of time for the JavaSpace to
crash, OutOfMemoryErrors were showing up in the log in the muxer for the
container hosting the JavaSpace just before the JVM passed away.

No problem one would think, although I also noticed that 2 out 3 JVMs in
which 8 out 12 'clients' were running seem to be dead as well although
their JVMs were still there, the discovery protocols were doing fine as
well as the internal JAR file server.

Looking into the log files for these 2 containers I noticed each request
resulted into the exceptions as can be seen in the attachment. The log
files were rather large and I couldn't find another type of cause that
quickly and due to some time stupid action of me I forget to set them
apart in a place that wasn't going to be wiped out.

Some environment data:

Linux x.y.z 2.6.9-42.ELsmp #1 SMP Sat Aug 12 09:39:11 CDT 2006 i686 i686
i386 GNU/Linux

java version "1.5.0_11"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_11-b03)
Java HotSpot(TM) Server VM (build 1.5.0_11-b03, mixed mode)

I analyzed the code and I'm flabbergasted by this NPE (for the record, I
made no modifications to the code) and so far I can only come to the
conclusion either the memory-model for the Sun JVM 1.5.0_11 for Linux is
broken or the fix done for Porter was not 100% perfect although I
haven't seen the NPE for a long time.

I don't have the log files for close inspection but given the fact 2 out
of 3 JVMs had the some behavior and showed the same stacktraces there
must have been something strange going on beyond flipping a random bit.
Unfortunately I don't have access any longer to the environment to take
some time to investigate further.

Might this also indicate we should give 
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4857137 some priority 
... Also I have no clue how many people enable NIO for Jini ERI. I've 
done it since the beginning and have been bitten a few time by it. But 
maybe we should enable it by default so more people can do the proper 
field testing. If we don't make it the default it will always stay 
suspicious to some extend.
-- 
Mark

Re: [BUG] NPE in com.sun.jini.jeri.internal.runtime.SelectionManager$SelectLoop

Posted by Gregg Wonderly <ge...@cox.net>.

Peter Jones wrote:
> With the current implementation-- and BTW on much older multiprocessor
> hardware (this was quite a while ago)-- we did not see such a benefit.
> Of course, other processors would be busy making or dispatching remote
> invocations, including the higher-level marshalling and unmarshalling
> of all of that data.

Contention for resources that is not immediately visible can make extra threads 
seem non-beneficial too.  Given the experiences and examples that Doug Lea and 
company have illustrated and JVM changes to better manage locking, it might be 
interesting to run such tests on newer JVMs to see what's possible now.

Gregg Wonderly

Re: [BUG] NPE in com.sun.jini.jeri.internal.runtime.SelectionManager$SelectLoop

Posted by Peter Jones <pe...@sun.com>.

On Wed, May 09, 2007 at 04:06:29PM +0200, Mark Brouwer wrote:
> 
> Peter Jones wrote:
> 
> >Also, the NPE shown in your old post above caused termination of the
> >selecting thread, but because of another fix in Porter (2.1):
> >
> >	http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6304782
> >
> >such an unexpected exception will no longer terminate the select loop:
> >instead it will retry (with some throttling in the case of recurring
> >unexpected exceptions).  If the unexpected exception was somehow
> >destabilizing (see below), there could be strange behavior on retry.
> >Do you know if this NPE occurred repeatedly (unprovoked by further I/O
> >activity)?
> >
> > <snip/>
> >
> >What do you mean by "seem to be dead"-- just not process any
> >Tcp(Server)Endpoint communication, or something more than that?
> 
> In the IncaX service browser all services seems to be there but trying 
> to open the ServiceUI to Seven itself didn't work (as part of creating 
> the ServiceUI it contacts Seven through some remote method calls).
> 
> Downloading JAR files from the JVM seemed to be working otherwise a 
> restarted Jini Service Browser wouldn't be able to display the various 
> service attributes. also trying to destroy the load test servers didn't 
> have any effect.

That seems plausibly consistent with just Tcp(Server)Endpoint
communication hanging-- which, when using NIO sockets, would happen if
the select loop fails with NPE.

> I went to the log files and found a huge number of NPEs at the end (the 
> ones I attached earlier), given the fact I was with ssh on the remote 
> server my view on that huge amount of exceptions wasn't that great but I 
> was under the impression that they kept appearing. Whether I did a tail 
> on the log file I'm not sure of but I believe there were bursts of NPEs. 
> To me it appeared I initiated them with the calls to the services, but 
> maybe that had nothing to do with it. At the same time it would have 
> been completely normal the lookup service was making callbacks to the 
> services as they use SDM under the hood, so that could have been the 
> cause to.

If there was somehow a key with a null attachment (other than the key
for the wakeup pipe) stuck in the selected-key set, I would expect the
NPE to keep occurring because nothing will remove that key from the
set (of course I wouldn't expect such a key to be in there in the
first place...).  In that case the bursty aspect would have been (with
Porter's 6304782 fix) the select loop's throttling in the event of
repeated unexpected exceptions (it sleeps ten seconds after ten
unexpected exceptions have occurred in less than five seconds).

> >I am also (as before) pretty stumped by what could be the cause of
> >this NPE with the 6229032 fix in Porter.
> >
> >Do you know if an OutOfMemoryError was logged as having occurred in
> >the affected VMs' select loops before this NPE?  The only explanation
> >that comes to my mind so far is if an OutOfMemoryError occurred during
> >the SelectableChannel.register invocation in
> >SelectionManager.processRenewQueue:
> >
> >		    key.selectionKey = key.channel.register(selector,
> >							    renewMask);
> >		    key.selectionKey.attach(key);
> >
> >
> >after the channel had been internally registered with the selector
> >(which appears possible from code inspection: objects are allocated
> >after that step takes effect), so that the selection key's attachment
> >is left null.  (This is an illustration of how destabilizing an
> >OutOfMemoryError can be!  Much like an asynchronous exception.  Even
> >without the attachment problem, the SelectableChannel's state could be
> >corrupted.)  It does seem very unlikely that this would happen in two
> >of your three VMs during the same test run, though (unless perhaps
> >OutOfMemoryError was occurring repeatedly in those VMs?).
[snip]
> 
> I stepped through the same code as you did (on Windows XP) but I
> couldn't find a location where an unchecked exception could have been
> thrown (direct buffers or something like that) and I'm pretty sure the 3
> client JVM had enough memory. The JavaSpace passed away after eating
> about 2GB, but the client JVM didn't consume that much memory, but I'm
> not going to bet my head on this one.
> 
> All servers were identical and all 3 'client' JVM were running exactly 
> the same tests, so it seems rather unlikely that only 2 of them were 
> having OutOfMemoryErrors repeatedly.

Yes, it seems that my wild guess was very unlikely to have been the
real explanation.  I am very curious what it was, though.

> >>Might this also indicate we should give 
> >>http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4857137 some priority 
> >>...
> >
> >I think so.  It had been committed to Porter but didn't make the cut
> >in the end.  It should be a straightforward fix (just a simplification
> >of the wakeup mechanism currently there).
> 
> Can't wait to see the code flowing into the River.

Yup.

> >>Also I have no clue how many people enable NIO for Jini ERI. I've 
> >>done it since the beginning and have been bitten a few time by it. But 
> >>maybe we should enable it by default so more people can do the proper 
> >>field testing. If we don't make it the default it will always stay 
> >>suspicious to some extend.
> >
> >It would certainly be nice to reduce thread count in the default case.
> >There are some changes that I think would be good to make in this NIO
> >code before considering that, especially:
> >
> >	http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6305650
> >	http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6304047
> >
> >(Like 4857137, 6305650 would largely seem to be a simplification.)
> 
> Cool!
> 
> With regard to 6305650 do you envision going to a maximum of 1 selector
> thread or to multiple selector threads with a round-robin or random
> assignment policy or something else.

I think that supporting the option of multiple selector threads in one
of those ways (as opposed to the way of the current implementation)
would be desirable as part of this change.

> 6305650 mentions "greater numbers have not shown to be beneficial in
> practice ..." but I find it hard to comprehend that running Jini ERI
> with NIO on a T1000 with 8 cores and trying to writing very large
> amounts of bytes to the other side or reading very large amounts of data
> would not benefit from more selector threads.

With the current implementation-- and BTW on much older multiprocessor
hardware (this was quite a while ago)-- we did not see such a benefit.
Of course, other processors would be busy making or dispatching remote
invocations, including the higher-level marshalling and unmarshalling
of all of that data.

-- Peter

Re: [BUG] NPE in com.sun.jini.jeri.internal.runtime.SelectionManager$SelectLoop

Posted by Mark Brouwer <ma...@cheiron.org>.

Hi Peter,

Peter Jones wrote:

> Also, the NPE shown in your old post above caused termination of the
> selecting thread, but because of another fix in Porter (2.1):
> 
> 	http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6304782
> 
> such an unexpected exception will no longer terminate the select loop:
> instead it will retry (with some throttling in the case of recurring
> unexpected exceptions).  If the unexpected exception was somehow
> destabilizing (see below), there could be strange behavior on retry.
> Do you know if this NPE occurred repeatedly (unprovoked by further I/O
> activity)?
 >
 > <snip/>
> 
> What do you mean by "seem to be dead"-- just not process any
> Tcp(Server)Endpoint communication, or something more than that?

In the IncaX service browser all services seems to be there but trying 
to open the ServiceUI to Seven itself didn't work (as part of creating 
the ServiceUI it contacts Seven through some remote method calls).

Downloading JAR files from the JVM seemed to be working otherwise a 
restarted Jini Service Browser wouldn't be able to display the various 
service attributes. also trying to destroy the load test servers didn't 
have any effect.

I went to the log files and found a huge number of NPEs at the end (the 
ones I attached earlier), given the fact I was with ssh on the remote 
server my view on that huge amount of exceptions wasn't that great but I 
was under the impression that they kept appearing. Whether I did a tail 
on the log file I'm not sure of but I believe there were bursts of NPEs. 
To me it appeared I initiated them with the calls to the services, but 
maybe that had nothing to do with it. At the same time it would have 
been completely normal the lookup service was making callbacks to the 
services as they use SDM under the hood, so that could have been the 
cause to.

I can't recall seeing any OutOfMemoryError exceptions in the these JVMs, 
but I could have easily missed it as most logging went into one single 
file but with 800 threads "firing at will" [1] that resulted in many 
RemoteExceptions, therefore it was kind of looking for a needle in a hay 
stack with just vi in some log files.

[1] the load tester was just a quick hack of me to find out about some 
throughput figures so exception handling was lacking, but I was 
expecting a 'feel good' moment for all there and not the scenario that 
happened ;-).

> I am also (as before) pretty stumped by what could be the cause of
> this NPE with the 6229032 fix in Porter.
> 
> Do you know if an OutOfMemoryError was logged as having occurred in
> the affected VMs' select loops before this NPE?  The only explanation
> that comes to my mind so far is if an OutOfMemoryError occurred during
> the SelectableChannel.register invocation in
> SelectionManager.processRenewQueue:
> 
> 		    key.selectionKey = key.channel.register(selector,
> 							    renewMask);
> 		    key.selectionKey.attach(key);
 >
 >
> after the channel had been internally registered with the selector
> (which appears possible from code inspection: objects are allocated
> after that step takes effect), so that the selection key's attachment
> is left null.  (This is an illustration of how destabilizing an
> OutOfMemoryError can be!  Much like an asynchronous exception.  Even
> without the attachment problem, the SelectableChannel's state could be
> corrupted).  It does seem very unlikely that this would happen in two
> of your three VMs during the same test run, though (unless perhaps
> OutOfMemoryError was occurring repeatedly in those VMs?).
 >
> [Incidentally, the above code should just use the overload of
> SelectableChannel.register that takes the initial value for the
> SelectionKey's attachment-- I believe that the only reason it doesn't
> do that is because it was written before that overload was added
> during 1.4 development, back when this code was part of the JDK...]

I stepped through the same code as you did (on Windows XP) but I
couldn't find a location where an unchecked exception could have been
thrown (direct buffers or something like that) and I'm pretty sure the 3
client JVM had enough memory. The JavaSpace passed away after eating
about 2GB, but the client JVM didn't consume that much memory, but I'm
not going to bet my head on this one.

All servers were identical and all 3 'client' JVM were running exactly 
the same tests, so it seems rather unlikely that only 2 of them were 
having OutOfMemoryErrors repeatedly.

Without the log files that were destroyed as result of a rather 
brilliant action of my side I doubt we will ever find out the exact cause.

>> Might this also indicate we should give 
>> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4857137 some priority 
>> ...
> 
> I think so.  It had been committed to Porter but didn't make the cut
> in the end.  It should be a straightforward fix (just a simplification
> of the wakeup mechanism currently there).

Can't wait to see the code flowing into the River.

>> Also I have no clue how many people enable NIO for Jini ERI. I've 
>> done it since the beginning and have been bitten a few time by it. But 
>> maybe we should enable it by default so more people can do the proper 
>> field testing. If we don't make it the default it will always stay 
>> suspicious to some extend.
> 
> It would certainly be nice to reduce thread count in the default case.
> There are some changes that I think would be good to make in this NIO
> code before considering that, especially:
> 
> 	http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6305650
> 	http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6304047
> 
> (Like 4857137, 6305650 would largely seem to be a simplification.)

Cool!

With regard to 6305650 do you envision going to a maximum of 1 selector
thread or to multiple selector threads with a round-robin or random
assignment policy or something else.

6305650 mentions "greater numbers have not shown to be beneficial in
practice ..." but I find it hard to comprehend that running Jini ERI
with NIO on a T1000 with 8 cores and trying to writing very large
amounts of bytes to the other side or reading very large amounts of data
would not benefit from more selector threads.
-- 
Mark

Re: [BUG] NPE in com.sun.jini.jeri.internal.runtime.SelectionManager$SelectLoop

Posted by Peter Jones <pe...@sun.com>.

On Fri, May 04, 2007 at 01:27:31AM +0200, Mark Brouwer wrote:
> Recently I encountered an NPE similar to one I reported more than 2
> years ago
> (http://archives.java.sun.com/cgi-bin/wa?A2=ind0502&L=jini-users&D=0&I=-3&P=2811) 
> 
> and that has been fixed in the meantime, attached you will find the
> stacktrace.

You are referring to this fix:

	http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6229032

Also, the NPE shown in your old post above caused termination of the
selecting thread, but because of another fix in Porter (2.1):

	http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6304782

such an unexpected exception will no longer terminate the select loop:
instead it will retry (with some throttling in the case of recurring
unexpected exceptions).  If the unexpected exception was somehow
destabilizing (see below), there could be strange behavior on retry.
Do you know if this NPE occurred repeatedly (unprovoked by further I/O
activity)?

> The test were done on a network with 5 dual 2.7G Xeon servers
> and all clients and services were running on Seven with NIO enabled
> for the Tcp(Server)Endpoints and all services shared the same
> TcpServerEndpoint.
> 
> One server was hosting a JavaSpace while the other servers represented
> clients. In total 12 clients were trying each from 100 threads to write
> a ~ 15KBytes Entry into the space and at the same time each service
> tried to obtain these entries from another 100 threads. All bandwidth
> was consumed and it was just a matter of time for the JavaSpace to
> crash, OutOfMemoryErrors were showing up in the log in the muxer for the
> container hosting the JavaSpace just before the JVM passed away.
> 
> No problem one would think, although I also noticed that 2 out 3 JVMs in
> which 8 out 12 'clients' were running seem to be dead as well although
> their JVMs were still there, the discovery protocols were doing fine as
> well as the internal JAR file server.

What do you mean by "seem to be dead"-- just not process any
Tcp(Server)Endpoint communication, or something more than that?

> Looking into the log files for these 2 containers I noticed each request
> resulted into the exceptions as can be seen in the attachment. The log
> files were rather large and I couldn't find another type of cause that
> quickly and due to some time stupid action of me I forget to set them
> apart in a place that wasn't going to be wiped out.
> 
> Some environment data:
> 
> Linux x.y.z 2.6.9-42.ELsmp #1 SMP Sat Aug 12 09:39:11 CDT 2006 i686 i686
> i386 GNU/Linux
> 
> java version "1.5.0_11"
> Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_11-b03)
> Java HotSpot(TM) Server VM (build 1.5.0_11-b03, mixed mode)
> 
> I analyzed the code and I'm flabbergasted by this NPE (for the record, I
> made no modifications to the code) and so far I can only come to the
> conclusion either the memory-model for the Sun JVM 1.5.0_11 for Linux is
> broken or the fix done for Porter was not 100% perfect although I
> haven't seen the NPE for a long time.
> 
> I don't have the log files for close inspection but given the fact 2 out
> of 3 JVMs had the some behavior and showed the same stacktraces there
> must have been something strange going on beyond flipping a random bit.
> Unfortunately I don't have access any longer to the environment to take
> some time to investigate further.

I am also (as before) pretty stumped by what could be the cause of
this NPE with the 6229032 fix in Porter.

Do you know if an OutOfMemoryError was logged as having occurred in
the affected VMs' select loops before this NPE?  The only explanation
that comes to my mind so far is if an OutOfMemoryError occurred during
the SelectableChannel.register invocation in
SelectionManager.processRenewQueue:

		    key.selectionKey = key.channel.register(selector,
							    renewMask);
		    key.selectionKey.attach(key);

after the channel had been internally registered with the selector
(which appears possible from code inspection: objects are allocated
after that step takes effect), so that the selection key's attachment
is left null.  (This is an illustration of how destabilizing an
OutOfMemoryError can be!  Much like an asynchronous exception.  Even
without the attachment problem, the SelectableChannel's state could be
corrupted).  It does seem very unlikely that this would happen in two
of your three VMs during the same test run, though (unless perhaps
OutOfMemoryError was occurring repeatedly in those VMs?).

[Incidentally, the above code should just use the overload of
SelectableChannel.register that takes the initial value for the
SelectionKey's attachment-- I believe that the only reason it doesn't
do that is because it was written before that overload was added
during 1.4 development, back when this code was part of the JDK...]

> Might this also indicate we should give 
> http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4857137 some priority 
> ...

I think so.  It had been committed to Porter but didn't make the cut
in the end.  It should be a straightforward fix (just a simplification
of the wakeup mechanism currently there).

> Also I have no clue how many people enable NIO for Jini ERI. I've 
> done it since the beginning and have been bitten a few time by it. But 
> maybe we should enable it by default so more people can do the proper 
> field testing. If we don't make it the default it will always stay 
> suspicious to some extend.

It would certainly be nice to reduce thread count in the default case.
There are some changes that I think would be good to make in this NIO
code before considering that, especially:

	http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6305650
	http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6304047

(Like 4857137, 6305650 would largely seem to be a simplification.)

-- Peter

> Apr 27, 2007 6:15:27 PM com.sun.jini.jeri.internal.runtime.SelectionManager$SelectLoop run (oid=1.3.6.1.4.12543.2.0, group=default, instance=master)
> WARNING: select loop throws
> java.lang.NullPointerException
>         at com.sun.jini.jeri.internal.runtime.SelectionManager.waitForReadyKey(SelectionManager.java:419)
>         at com.sun.jini.jeri.internal.runtime.SelectionManager.access$600(SelectionManager.java:80)
>         at com.sun.jini.jeri.internal.runtime.SelectionManager$SelectLoop.run(SelectionManager.java:287)
>         at com.sun.jini.thread.ThreadPool$Worker.run(ThreadPool.java:136)
>         at java.lang.Thread.run(Thread.java:595)