You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@river.apache.org by Dan Creswell <da...@dcrdev.demon.co.uk> on 2007/06/01 12:11:40 UTC

Re: SourceAliveRemoteEvent Part II

So I'm still confused about what the exact use cases here so I'll guess:

One seems related to firewall traversal etc and whether or not callbacks
can performed.

A second appears to be attempting to divine whether or not you've lost
some events.

A third appears to be attempting to divine the health of a remote service.
	
Mark Brouwer wrote:
> Hi Dan,
> 
> Dan Creswell wrote:
>> Hi all,
>>
>> It started with a discussion under the Javaspaces.notify() not reliable
>> conversation and I've now had a bit more time to formulate my thoughts.
>>
>> Without this extra feature we do something like the following in the
>> client:
>>
>> (1)    Setup a watchdog timer with a suitable expiry
>> (2)    On receiving a remote event, reset our watchdog timer
>> (3)    If timer expires, check to see if our source is still alive, check
>> to see if we might've missed an event.
>>
>> What's being proposed, if I understand correctly is the the source if
>> it's alive and hasn't generated events in a particular time period
>> confirm that by posting a SourceAliveRemoteEvent to the client
>> confirming this.
> 
> The idea has 3 aspects:
> 
> 1) the SourceAliveRemoteEvent (SARE) protocol is triggered by a
>    QoS invocation constraints set upon registration;
> 
> 2) the source must send a SARE as the first event (this is helpful in
>    finding out whether callbacks are possible);
> 
> 3) the source should send a SARE in case a certain time after the last
>    remote event sent has elapsed.
> 
> Below I will try to clarify why I consider this having advantages over
> performing a ping.
> 
>> This would potentially change the above client code to reset the timer
>> on just a SourceAliveRemoteEvent (SARE).
>>
>> Things of note:
>>
>> (1)    The original solution places the responsibility and load on the
>> client (bar the pinging of the server).  This naturally scales out quite
>> well as the server only has to respond to pings and chances are a client
>> only maintains timers for a few services.  If client timeouts are tuned
>> appropriately to event frequency/typical pause, pings will be rare.
> 
> The SARE protocol is 'triggered' based on a QoS invocation constraint,
> i.e. only clients that have interest in SAREs will register for
> receiving them with their event registration. A server won't be sending
> SAREs for those who have shown no interest, also the constraints can be
> rejected in case the timeout period requested would be too small and
> the server wants to refuse, i.e. the server has a say in the 'tuning'.
> Preventing a client to invoke ping because it sets a very small time-out
> seems to be much harder to control.
> 
> I must say it really depends on what the ping constitutes before I would
> be able to say ping is a trivial operation for the server.
> 
>> (2)    The new solution places much of the responsibility with the
>> server.
>>  I believe there may be a scaling problem here.  In contrast to the
>> client-side approach a server might have a large number of clients to
>> cope with.  This potentially means the server has significant load
>> tracking a large number of timer events for all it's clients and posting
>> SARE's in addition to what it already does.
> 
> No denial the proposal brings additional complexity to those services
> that wish to support the constraint.
> 
> I've been implementing SARE in Seven last week and I have it working,
> the event framework became more complex although due to experience in
> building a few of these similar mechanisms at the application layer I
> was able to make some optimizations in the code that gives me the
> impression the overhead is quite minimal assuming a time-out is used
> that relates to the average expected event rate.
> 
> Therefore I'm not that afraid of scalability issues given the fact the
> time-out period is expected to be in line with and probably larger then
> the event rate at which you will be sending events. Or in other words,
> the time-out is likely only small in case you expect a high remote event
> frequency, meaning SAREs won't be sent that often. If they do your
> server is likely capable of dealing with large number of events anyway.
> 
> And on the positive side one must find a proper usage for all these
> multi-core/CMT CPUs coming our way.
> 
>> (3)    The only difference between old and new approach from a client
>> coding perspective is what causes a reset of the watchdog timer.
> 
> For a client is seems to me SARE is easier than performing a remote
> method invocation (ping) that might take some time to return. I expect
> with SARE none to a minimal amount of ordinary remote method invocations
> (ping) to take place so for clients it is less likely to take additional
> roundtrip time (and the possibility of timing out) of these calls into
> account (the calls are exceptions and not the norm). In the ping case
> your timer probably will hand of to something that will perform the ping
> asynchronously to prevent from interfering with the timer itself.
> 

Yes my timer will indeed hand off but most of what's needed is already
in the JDK.  About all I need to do is write the logic to allow a
programmer to pass the RemoteEvent stream through the watchdog and
provide some kind of callback to invoke if the stream appears to have
been interrupted.

And for the client-side approach as per your solution, pinging or some
other client action will only be triggered in the case where no remote
events arrived in the programmer's defined time period such that the
watchdog fired.

The key difference is my client might make a decision to switch
erroneously however I don't see SARE solving that problem because it's
nigh on impossible to guarentee the event will arrive at all and/or on time.

At this stage I'm trying to fathom how generating an additional event
which can be lost/not-delivered in timely fashion is of much use in
dealing with an environment that loses events in general.

> When your watchdog goes of with SARE you know some QoS criteria hasn't
> been met by your source versus go figure out whether it did send events
> which haven't arrived. In many cases with SARE you won't perform a
> request to your source, you might go straight for a backup service and
> ignore the service altogether, or you ring the alarm bell of some
> Network Operations Center. But of course there will be cases you want to
> be a bit more persistent about your event registration.
> 
>> (4)    SARE's like any other event can be lost - if it's lost the client
>> watchdog will trigger just as it would in the old approach given
>> sufficient time between RemoteEvents.
> 
> Indeed it is possible a SARE will be lost. Although for most type
> of services I've coded (no multiple hops and no event payload provided
> by "I mess up the codebase clients") the chance a SARE will be forever
> lost due to a transitory failure I consider small compared to the other
> expected failures.
> 
>> (5)    If the source has sent events but they've been lost it won't
>> send an
>> SARE and, again client watchdog will timeout and ping.
>>
>> Based on the above it seems to me that whilst an SARE might save a few
>> pings there's additional complexity and greater server load.  If I've
>> missed some subtleties, please shout because right now I don't see
>> enough benefit in this to justify the "pain".
> 
> So far I'm not sure in the above what you exactly mean with a 'ping'. Is
> it just a way to check whether the service is alive or do you envision
> more, something that has a correlation with the event registration and
> internal event framework and that can say meaningful things about its
> ability to deliver event. If it is only something to check whether the
> service is alive/reachable I consider SARE a much richer concept for
> getting info about the ability to deliver events, also because it
> follows the exact route of event delivery. Ping doesn't represent the
> invocation path in case of Jini Distributed Events which (especially in
> the case of security and network topology) might be failing just because
> of these differences.
> 
> In the proposal I also use SARE as the first event to be sent to
> verify whether a callback is possible, so besides a 'source alive' it
> also serves another purpose, namely to find out whether event delivery
> can work at all.
> 
> One thing we haven't covered yet is that a ping for reachability might
> be successful, even while the source is not able to deliver events
> timely due to being overloaded, deadlocks, etc., while SARE will show
> the source is not able to deliver events properly. As such it tells
> me more about the state the event producer is in and its ability to
> serve me.
> 

I agree ping can be misleading but I don't see how SARE really helps -
the absence of a SARE arriving leaves you wondering what exactly
happened.  Did you get overloaded, did you lose an event, did the server
fail?  It seems to me SARE is a hint just as the results of a ping are a
hint just as the timeout due to lack of arriving events that drives the
need to ping is a hint.

> To conclude, a ping (assuming in its simple and generic form) doesn't
> give me enough information about the capabilities of a source to deliver
> events, where SARE can do this better. Yes it will lead to complexity at
> the server, maybe a slight reduction in scalability, but in most cases a
> simplification at the client side and the ability to get indications you
> won't be able to get with ping.
> 
> I'm not saying this is the only way, but to me it represents a pattern I
> have often used and see value in being part of the standard toolbox, but
> so might mechanisms to test for reachability/availability (the ones
> Dennis mentioned).
> 

All understood - it's precisely why I'm asking you the questions - to
determine what might be best (which includes subjective measures of
simplicity, reusability, bang for buck etc)

> My hope is that the common patterns people use can be
> standardized/formalized so that we see more support for them, either
> through frameworks, utilities or whatever people like to see or fit to
> them. But at least in a way they don't stay proprietary in many small
> corners of the Jini empire.

No issue with that!

Dan.

Re: SourceAliveRemoteEvent Part II

Posted by Mark Brouwer <ma...@cheiron.org>.
Dan Creswell wrote:
> So I'm still confused about what the exact use cases here so I'll guess:
> 
> One seems related to firewall traversal etc and whether or not callbacks
> can performed.

The issue here is that a route from host A to host B can be
fundamentally different than from host B to host A, firewall traversal
is just one example, routing, NAT, proxying, DNS are others I can think of.

If from host A one registers for events for a service at host B and I
succeed it doesn't tell me anything about the ability for the event
producer on host B to send the events to host A. For that reason some
event protocols have their event equivalent 'ping' as an agreed upon
first event.

SARE covers this case in its current specification, something not
covered by 'ping'.

> A second appears to be attempting to divine whether or not you've lost
> some events.

That really depends upon whether the actual event protocol (JavaSpaces
notify, Lookup Service service discovery event protocol) have the notion
of strictly increasing sequence numbers i.e. no gaps allowed. JavaSpaces
allows for gaps, the Lookup Service spec not. In general an event
protocol that is used to build a state from events will have strictly
increasing sequence numbering.

Given the semantics for SARE to have a sequence number that equals the
last send sequence number you might infer whether you missed an event,
depending on whether the event protocol is strictly increasing.

However the spec with regard to sequence numbers for a SARE has been
chosen mainly for not interfering with the service specific event protocols.

With ping asking about the last event sent I consider more tricky as
there might be an undefined number of events in transit at the moment
the server receives your ping request. The larger the interval becomes
between your decision to ping and the time the server processes 'ping'
will increase the chance you get a positive (false) answer on your
question whether you missed something.

Assuming you want to minimize the number of false request you have to
act upon I think SARE might do a better job here in most cases.

> A third appears to be attempting to divine the health of a remote service.

Yes, although I wouldn't use 'divine' here. All conclusions one can
derive from data points in our industry are often best guesses with
those related to distributed computing having the lowest credibility :-)

The only thing I would say is that apparently a QoS aspect I agreed upon
with the service has not been met. Any consequences I derive from that
are really a combination of my knowledge of the client, service,
environment, etc. and the value of that particular event registration.
All kinds of aspects that are not explicit in the service specifications
itself, but when building system you often take into account.

So SARE is nothing more and nothing less as a optional event protocol
that can be multiplexed with existing remote event protocols that in
certain situations makes it for a client easier to make certain
decision. It ain't a silver bullet, but to me it often provides value
compared to what a ping can provide me for reasons I hope were clear in
the previous posting and I mentioned above.


I decided not to respond on the other remarks because I think they are
either too detailed or represent us using different percentages about
the probability of loosing events, etc. Which given our maybe different
experience with distributed system might be true for both of us.

So I hope we (and I also hope others form an opinion) can get to the
point where it is agreed that SARE represent a valuable protocol
enhancement that is general applicable, or that it ain't and that I
should pursue as part of my own toolbox.
-- 
Mark





Re: SourceAliveRemoteEvent Part II

Posted by Mark Brouwer <ma...@cheiron.org>.
Hi Gregg,

Gregg Wonderly wrote:
> Dan Creswell wrote:
>> I agree ping can be misleading but I don't see how SARE really helps -
>> the absence of a SARE arriving leaves you wondering what exactly
>> happened.  Did you get overloaded, did you lose an event, did the server
>> fail?  It seems to me SARE is a hint just as the results of a ping are a
>> hint just as the timeout due to lack of arriving events that drives the
>> need to ping is a hint.
> 
> One the things about TCP that I've learned to deal with in different 
> ways, is the fact that if you don't have keep alive in place, you really 
> have to have both ends sending stuff to make the retry timers eventually 
> report a disconnect.  Otherwise, the receiving end will just block on a 
> read forever.  So, if you can turn on keep alive on the endpoint, thats 
> useful.  If the endpoint is not TCP based, but is just a serial cable or 
> some other interface, without TCP and keep alive available, then your 
> invocation layer, or application layer has to do the reachability test.
> 
> I've basically learned that if I want something like this to be part of 
> the application, then it needs to be near the application layer, away 
> from the pluggable parts such as JERI provides.
> 
> I have an NIO based client and server class that we use for reachability 
> (not liveness of a particular application, invocation point) of hosts 
> when we have larger fanouts and interdependencies.
> 
> There are lots of ways to do this type of thing.  I believe that we 
> really should look to see if there isn't a common behavior that we can 
> include a pattern/some code to support.  Certainly having the invocation 
> layer discuss reachability between the two ends is helpful, but at what 
> point do you "complain" about it, and how do you complain, if this is 
> going on asynchronously for a listener kind of notification mechanism.

Given you introduction of the invocation versus application layer and
you mentioning "Certainly having the invocation layer discuss
reachability between the two ends is helpful" I have a bit of a problem
with positioning (sorry, don't know of a less loaded word) your
viewpoint in whether SARE is useful or not, whether ping
is sufficient or not, or that you have something different in mind.

Due to the lack of being able to do the brain-melt I'm going to repeat
something you probably already are aware of. So just to be sure SARE, is
at the application level and not at the invocation layer level. It is
only that some frameworks might provide support for it so that a service
code doesn't need to take care of all the details and as such could be
seen as shielded from it, but nowhere did I intend to assume that it had
its place in the Jini ERI stack.

Can you also explain why you really have the need for an additional NIO
bases server client to test for reachability that can't be arranged for
by performing plain RMI calls with proper timeout constraints set. What
concerns me is that you have an additional network path which consumes
extra port, might need to be secured and the usual stuff as extra
configuration, routability, being able to proxy the traffic and firewall
traversal.
-- 
Mark

Re: SourceAliveRemoteEvent Part II

Posted by Gregg Wonderly <gr...@wonderly.org>.
Dan Creswell wrote:
> I agree ping can be misleading but I don't see how SARE really helps -
> the absence of a SARE arriving leaves you wondering what exactly
> happened.  Did you get overloaded, did you lose an event, did the server
> fail?  It seems to me SARE is a hint just as the results of a ping are a
> hint just as the timeout due to lack of arriving events that drives the
> need to ping is a hint.

One the things about TCP that I've learned to deal with in different ways, is 
the fact that if you don't have keep alive in place, you really have to have 
both ends sending stuff to make the retry timers eventually report a disconnect. 
  Otherwise, the receiving end will just block on a read forever.  So, if you 
can turn on keep alive on the endpoint, thats useful.  If the endpoint is not 
TCP based, but is just a serial cable or some other interface, without TCP and 
keep alive available, then your invocation layer, or application layer has to do 
the reachability test.

I've basically learned that if I want something like this to be part of the 
application, then it needs to be near the application layer, away from the 
pluggable parts such as JERI provides.

I have an NIO based client and server class that we use for reachability (not 
liveness of a particular application, invocation point) of hosts when we have 
larger fanouts and interdependencies.

There are lots of ways to do this type of thing.  I believe that we really 
should look to see if there isn't a common behavior that we can include a 
pattern/some code to support.  Certainly having the invocation layer discuss 
reachability between the two ends is helpful, but at what point do you 
"complain" about it, and how do you complain, if this is going on asynchronously 
for a listener kind of notification mechanism.

Gregg Wonderly