You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@ode.apache.org by Tammo van Lessen <tv...@gmail.com> on 2008/10/22 16:28:59 UTC

Extensions and Reliability

Hi guys,

there is one open issue regarding the extension activity implementation
I'd like to discuss with you.

Currently, the extension framework supports two models, i) to implement
an extension operation for a very short running, synchronous operation
and b) to give the developer a bit more control of when to complete the
activity when implementing in an asynchronous manner (i.e. directly
calling a complete(), completeWithFault() method vs. having this logic
wrapping a runSync() method which is implemented by the extension developer.

complete() and completeWithFault() use Jacob channels to notify that the
activity has been completed.

Now the question: The approach described above works more or less fine
for short running activity implementation. The problem I see is how to
deal with an engine crash after the extension activity has been started
and before it has been completed. In this case we've a problem as we can
not recover the extension code when the PI's state has been recovered.
Thus, we're waiting forever until the extension activity completes.

One possibility would be to also restart the extension implementation
(how?) when the PI recovers, but this might be harmful when the
extension has already done something that must not be repeated. So I
guess it depends on the individual case, but I'm wondering how to deal
with this problem appropriately.

Any ideas? I hope so ;)

Cheers,
  Tammo

Re: Extensions and Reliability

Posted by Matthieu Riou <ma...@offthelip.org>.

On Wed, Oct 22, 2008 at 7:28 AM, Tammo van Lessen <tv...@gmail.com>wrote:

> Hi guys,
>
> there is one open issue regarding the extension activity implementation
> I'd like to discuss with you.
>
> Currently, the extension framework supports two models, i) to implement
> an extension operation for a very short running, synchronous operation
> and b) to give the developer a bit more control of when to complete the
> activity when implementing in an asynchronous manner (i.e. directly
> calling a complete(), completeWithFault() method vs. having this logic
> wrapping a runSync() method which is implemented by the extension
> developer.
>
> complete() and completeWithFault() use Jacob channels to notify that the
> activity has been completed.
>
> Now the question: The approach described above works more or less fine
> for short running activity implementation. The problem I see is how to
> deal with an engine crash after the extension activity has been started
> and before it has been completed.

Just making sure I understand correctly: that's only for the asynchronous
cases, correct? So when people rely on the completion channel.

> In this case we've a problem as we can
> not recover the extension code when the PI's state has been recovered.
> Thus, we're waiting forever until the extension activity completes.
>

I've implemented something similar for invoke a few weeks ago. The problem
is pretty much the same (IIUC): if the server crashes during an invoke is
taking place, we never get a reply (assuming it's a two-way) and the invoke
just hangs there.

I fixed this by scheduling a task that checks the invoke after the timeout
period. If we get the reply properly, the task is cancelled (had to add a
cancel operation on the scheduler). If you really get a timeout, the invoke
enters normal recovery and the task will just get discarded when it
executes. But if the server crashes, when it gets restarted the task will
trigger a check of the invoke, see that it hasn't got any reply and that
it's not in recovery and force the recovery.

So I'm thinking we could generalize this for extension activities? The
problem I see is that for the invoke we have some data external to the
process (the message exchange) that we can check without reloading the whole
thing just to see what it looks like. For extension activities I'm not sure
there's an equivalent. So for those async extensions you might also need an
additional table just to track the status of of the "call" externally to the
process.

Does that help?

Matthieu

>
> One possibility would be to also restart the extension implementation
> (how?) when the PI recovers, but this might be harmful when the
> extension has already done something that must not be repeated. So I
> guess it depends on the individual case, but I'm wondering how to deal
> with this problem appropriately.
>
> Any ideas? I hope so ;)
>
> Cheers,
>   Tammo
>