You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@river.apache.org by Patrick Wright <pd...@gmail.com> on 2009/08/17 15:36:33 UTC

Problem with SDM/LookupCache when LUS unavailable

Hi

We ran into a situation last week where, after a firewall update on a
server hosting an LUS, the LUS was not available for about 45 minutes.
On that server setup, we have two LUS running (LUS-1 and LUS-2), both
configured with the same group and no other attributes--essentially, a
simple redundant setup. All services are registered with both LUS.

On all our clients, we use a ServiceDiscoveryManager and perform
lookups via a LookupCache. What we observed was that, on one
particular client, the client continued to attempt to access a service
instance which we believe was registered with LUS-1, the one which was
no longer reachable; we base this on an extra logging output we in a
DiscoveryListener attached to the SDM, which was throwing an exception
(ConnectionException, timeout) when trying to call
serviceRegistrar.getLocator().toString(). This exception was being
thrown throughout the 45 minutes until the firewall issue was fixed.
Along with those log entries, the client was also reporting that a
given service was not available, although we know it was available on
LUS-2. The exception is not the issue--the issue is that this
particular client was not failing over to the same service instances
registered with LUS-2.

On our other (many) Jini clients, we did not see the same behavior. In
the logs where we've taken a look, the client did report the same
exception, but just once, on trying to call
serviceRegistrar.getLocator().toString(), however, appeared to
continue using the service instances registered with LUS-2 without
problems.

What is unclear to us, from the documentation, is how cases of a
ServiceRegistrar outage are handled by the SDM and the LookupCache.
What we imagine is that the event lease between the client and the
registrar must fail to renew, and at that point, the SDM should note
the problem and remove the registrar from the cache.

In particular, what we're not sure of is if we ourselves have to add
some special handling for this case (e.g. creating a new cache,
calling discard on the service instances) or if this should be handled
automagically, and there is some problem with configuration on our
end.


Thanks in advance,
Patrick

Re: Problem with SDM/LookupCache when LUS unavailable

Posted by Patrick Wright <pd...@gmail.com>.
Hi Mark

Thanks for your reply. I'll try to clear a few things up.

First, LUS-1 and LUS-2 are both tagged with the same public group, no
other attributes, and are reached via multicast by all clients.

What we observe in our logs is that a particular client continued to
try and access a service which (as far as we can tell) it had located
via LUS-1, and calls to that service from that client failed, for the
entire time LUS-1 was unreachable. We also see DiscoveryEvent
notifications arriving at that client referring to LUS-1 during that
45 minute period (we don't have all the details of the events in the
logs, unfortunately), and that LUS-1 was unreachable from that client
during that time.

What confuses us is that other clients complained (so to speak) at
most once when LUS-1 was unavailable, then continued to operate, we
assume, using LUS-2. So we were wondering if there was something we
didn't understand about how an SDM and its LookupCache react to an LUS
no longer being reachable. What we expect is that the SDM will,
perhaps on a lease expiration, remove all entries in the cache related
to that registrar, however, this didn't appear to happen, and we
haven't found any documentation that indicates what it does or should
do.


> Are you suggesting that
> some clients didn't find a particular service in the LookupCache that was
> registered with LUS-2 while LUS-1 was not reachable.

We have (at least) one client which appeared to continue to try and
work with LUS-1 for over 45 minutes after it was last available, and
it appeared to also attempt to retrieve a service proxy, continuously,
from LUS-1 during that period, again unsuccessfully.

>
> In case the SDM in your client was able to see LUS-2 it shouldn't have any
> problem seeing your service even in case LUS-1 became unreachable, assuming
> no other problems than LUS-1 not being reachable occurred.

This is what appeared to occur on other Jini clients in the network,
and what we want.


> is used for finding your lookup service. In that case are you sure that
> LUS-2 was found by the SDM of your client? A good way to find out is to
> configure logging for the logger documented in
> http://java.sun.com/products/jini/2.1/doc/api/net/jini/discovery/LookupDiscovery.html,
> set the level to FINEST.

I think we have a discovery listener and logger of our own configured,
but am not sure if we by default log all events.


> The spec of ServiceDiscoveryListener
> (http://java.sun.com/products/jini/2.1/doc/api/net/jini/lookup/ServiceDiscoveryListener.html)
> talks a lot about these cases.

This spec seems related to service events, not registrar events.


> I've used multiple lookup services for redundancy problems and failure of
> one shouldn't result in a registered service becoming 'invisible' if the
> others were still reachable.

This is what we expect and generally, I think it's worked for us as
well. The particular firewall/iptables mess was a new situation we
hadn't faced in this server configuration before.


Thanks!
Patrick

Re: Problem with SDM/LookupCache when LUS unavailable

Posted by Mark Brouwer <ma...@marbro.org>.
Hi Patrick,

Patrick Wright wrote:
> Hi
> 
> We ran into a situation last week where, after a firewall update on a
> server hosting an LUS, the LUS was not available for about 45 minutes.
> On that server setup, we have two LUS running (LUS-1 and LUS-2), both
> configured with the same group and no other attributes--essentially, a
> simple redundant setup. All services are registered with both LUS.
> 
> On all our clients, we use a ServiceDiscoveryManager and perform
> lookups via a LookupCache. What we observed was that, on one
> particular client, the client continued to attempt to access a service
> instance which we believe was registered with LUS-1, the one which was
> no longer reachable; we base this on an extra logging output we in a
> DiscoveryListener attached to the SDM, which was throwing an exception
> (ConnectionException, timeout) when trying to call
> serviceRegistrar.getLocator().toString(). This exception was being
> thrown throughout the 45 minutes until the firewall issue was fixed.
> Along with those log entries, the client was also reporting that a
> given service was not available, although we know it was available on
> LUS-2. The exception is not the issue--the issue is that this
> particular client was not failing over to the same service instances
> registered with LUS-2.
 >
> On our other (many) Jini clients, we did not see the same behavior. In
> the logs where we've taken a look, the client did report the same
> exception, but just once, on trying to call
> serviceRegistrar.getLocator().toString(), however, appeared to
> continue using the service instances registered with LUS-2 without
> problems.

I am a bit confused by your outline of your problem. Are you suggesting 
that some clients didn't find a particular service in the LookupCache 
that was registered with LUS-2 while LUS-1 was not reachable.

In case the SDM in your client was able to see LUS-2 it shouldn't have 
any problem seeing your service even in case LUS-1 became unreachable, 
assuming no other problems than LUS-1 not being reachable occurred.

I don't know whether the lookup services are to be found based on 
multicast and/or unicast, from your reference of 'same group' I think 
only multicast is used for finding your lookup service. In that case are 
you sure that LUS-2 was found by the SDM of your client? A good way to 
find out is to configure logging for the logger documented in 
http://java.sun.com/products/jini/2.1/doc/api/net/jini/discovery/LookupDiscovery.html, 
set the level to FINEST.

> What is unclear to us, from the documentation, is how cases of a
> ServiceRegistrar outage are handled by the SDM and the LookupCache.
> What we imagine is that the event lease between the client and the
> registrar must fail to renew, and at that point, the SDM should note
> the problem and remove the registrar from the cache.
> 
> In particular, what we're not sure of is if we ourselves have to add
> some special handling for this case (e.g. creating a new cache,
> calling discard on the service instances) or if this should be handled
> automagically, and there is some problem with configuration on our
> end.

The spec of ServiceDiscoveryListener 
(http://java.sun.com/products/jini/2.1/doc/api/net/jini/lookup/ServiceDiscoveryListener.html) 
talks a lot about these cases.

I've used multiple lookup services for redundancy problems and failure 
of one shouldn't result in a registered service becoming 'invisible' if 
the others were still reachable.

Regards,
-- 
Mark

Re: Problem with SDM/LookupCache when LUS unavailable

Posted by Patrick Wright <pd...@gmail.com>.
>> Well actually that seems OK with me assuming that the SDM also has a
>> registration with LUS-2. The SDM implementation does remove all instances
>> associated with LUS-1 from the lookup cache when LUS-1 is discarded, but if
>> the service is also registered with LUS-2 it will maintain in the view
>> provided by the lookup cache.
>
> There is a subtle but important issue here.  If LUS-1 and LUS-2 are both
> active and LUS-2 can see the service advertisement from the LUS-1 machine,
> it will continue to advertise that service, even though the LUS-1 machine is
> unreachable from the client.

Interesting! Didn't realize that.


> You said the LUS-1 machine (and thus associated services) was unreachable.
>  Was that only from the clients perspective, or from the LUS-2 machines
> perspective as well?

LUS-1 was completely unreachable by any machine on the network. Hard
reboot required. Networking hosed.


Thanks
Patrick

Re: Problem with SDM/LookupCache when LUS unavailable

Posted by Gregg Wonderly <ge...@cox.net>.
On Aug 17, 2009, at 5:14 PM, Mark Brouwer wrote:
> Well actually that seems OK with me assuming that the SDM also has a  
> registration with LUS-2. The SDM implementation does remove all  
> instances associated with LUS-1 from the lookup cache when LUS-1 is  
> discarded, but if the service is also registered with LUS-2 it will  
> maintain in the view provided by the lookup cache.

There is a subtle but important issue here.  If LUS-1 and LUS-2 are  
both active and LUS-2 can see the service advertisement from the LUS-1  
machine, it will continue to advertise that service, even though the  
LUS-1 machine is unreachable from the client.

You said the LUS-1 machine (and thus associated services) was  
unreachable.  Was that only from the clients perspective, or from the  
LUS-2 machines perspective as well?

Gregg Wonderly

Re: Problem with SDM/LookupCache when LUS unavailable

Posted by Patrick Wright <pd...@gmail.com>.
Hi Mark

> From the above it is not completely clear whether LUS-1 and LUS-2 were
> running on the same server, bound to the same IP number or that only LUS-1
> was affected by the firewall update and that LUS-2 is on a different server
> or bound to another IP number and not affacted by the firewall update.

They were running on two different servers.

>
> Also what went exactly wrong with the firewall update, was nothing reachable
> or was it that just certain services were blocked.

>From my understanding of the situation, an admin loaded changes to
iptables config and the box where LUS-1 was running was thereafter
completely unreachable until a hard reboot.


> What was exactly misconfigured with the firewall update. What if the event
> registration fails because certain ports being blocked, while multicast and
> unicast discovery is allowed through the firewall.

I don't know the details, but know enough to say that the box was
unreachable over the network until it was rebooted with the prior
iptables config.


>
> At INFO level for net.jini.lookup.ServiceDiscoveryManager a failure of lease
> creation or renewal should be visible in the logs.

OK, I will make sure we have this enabled in the future.


>
> In case the SDM (by means of an implementation of DiscoveryManagement)
> encounters a definite failure of a lookup service it will discard that
> lookup service, but that lookup service will be eligible for (re)discovery,
> meaning that when the SDM receives another multicast message that indicates
> the lookup service is available on the network it will try to register with
> that lookup service. That will fail in your case and it will be discarded.

OK, thanks for the clarification. I just found the section of
http://java.sun.com/products/jini/2.1/doc/specs/html/servicediscutil-spec.html
(under "The DiscoveryManagement Interface") which describes this.
However, it's not clear to me how a lookup helper class (our clients
are configured to use LookupDiscoveryManager) "determine" if a lookup
service is no longer available. In the Discovery Utilities Spec
(http://java.sun.com/products/jini/2.1/doc/specs/html/discoveryutil-spec.html),
I find:

"Currently, there exist utilities such as the LookupDiscovery and
LookupDiscoveryManager helper utilities that will, on behalf of a
discovering entity, automatically discard a lookup service upon
determining that the lookup service has become unreachable or
uninteresting. Although most entities will typically employ such a
utility to help with both its discovery as well as its discard duties,
it is important to note that if the entity itself determines that the
lookup service is unavailable, it is the responsibility of the entity
to invoke the discard method. This scenario usually happens when the
entity attempts to interact with a lookup service, but encounters an
exceptional condition (for example, a communication failure). When the
entity actively discards a lookup service, the discarded lookup
service becomes eligible to be re-discovered. Allowing unreachable
lookup services to remain in the managed set can result in repeated
and unnecessary attempts to interact with lookup services with which
the entity can no longer communicate. Thus, the mechanism provided by
this method is intended to provide a way to remove such "stale" lookup
service references from the managed set."

However, I don't find any more detail on the topic. Thus it is unclear
if we need to call discard(registrar) when we believe the registrar is
no longer available. At least in this one case, it appears that the
registrar may not have been discarded.


Thanks a lot for helping out with this, Mark. I'm going to rework the
logging and then see if I can reproduce this, or at least have better
logging enabled if it reappears. May be some confusion on our end.


Regards
Patrick

Re: Problem with SDM/LookupCache when LUS unavailable

Posted by Mark Brouwer <ma...@marbro.org>.
Hi Patrick,

Patrick Wright wrote:
> Hi
> 
> We ran into a situation last week where, after a firewall update on a
> server hosting an LUS, the LUS was not available for about 45 minutes.
> On that server setup, we have two LUS running (LUS-1 and LUS-2), both
> configured with the same group and no other attributes--essentially, a
> simple redundant setup. All services are registered with both LUS.

 From the above it is not completely clear whether LUS-1 and LUS-2 were 
running on the same server, bound to the same IP number or that only 
LUS-1 was affected by the firewall update and that LUS-2 is on a 
different server or bound to another IP number and not affacted by the 
firewall update.

Also what went exactly wrong with the firewall update, was nothing 
reachable or was it that just certain services were blocked.

> On all our clients, we use a ServiceDiscoveryManager and perform
> lookups via a LookupCache. What we observed was that, on one
> particular client, the client continued to attempt to access a service
> instance which we believe was registered with LUS-1, the one which was
> no longer reachable; we base this on an extra logging output we in a
> DiscoveryListener attached to the SDM, which was throwing an exception
> (ConnectionException, timeout) when trying to call
> serviceRegistrar.getLocator().toString(). This exception was being
> thrown throughout the 45 minutes until the firewall issue was fixed.
> Along with those log entries, the client was also reporting that a
> given service was not available, although we know it was available on
> LUS-2. The exception is not the issue--the issue is that this
> particular client was not failing over to the same service instances
> registered with LUS-2.
 >
> On our other (many) Jini clients, we did not see the same behavior. In
> the logs where we've taken a look, the client did report the same
> exception, but just once, on trying to call
> serviceRegistrar.getLocator().toString(), however, appeared to
> continue using the service instances registered with LUS-2 without
> problems.
> 
> What is unclear to us, from the documentation, is how cases of a
> ServiceRegistrar outage are handled by the SDM and the LookupCache.
> What we imagine is that the event lease between the client and the
> registrar must fail to renew, and at that point, the SDM should note
> the problem and remove the registrar from the cache.

What was exactly misconfigured with the firewall update. What if the 
event registration fails because certain ports being blocked, while 
multicast and unicast discovery is allowed through the firewall.

At INFO level for net.jini.lookup.ServiceDiscoveryManager a failure of 
lease creation or renewal should be visible in the logs.

In case the SDM (by means of an implementation of DiscoveryManagement) 
encounters a definite failure of a lookup service it will discard that 
lookup service, but that lookup service will be eligible for 
(re)discovery, meaning that when the SDM receives another multicast 
message that indicates the lookup service is available on the network it 
will try to register with that lookup service. That will fail in your 
case and it will be discarded.

In the other email you wrote:

"What we observed was that, on one
particular client, the client continued to attempt to access a service
instance which we believe was registered with LUS-1, the one which was
no longer reachable;"

Well actually that seems OK with me assuming that the SDM also has a 
registration with LUS-2. The SDM implementation does remove all 
instances associated with LUS-1 from the lookup cache when LUS-1 is 
discarded, but if the service is also registered with LUS-2 it will 
maintain in the view provided by the lookup cache.

Regards,
-- 
Mark