You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@sling.apache.org by Julian Sedding <js...@gmail.com> on 2016/02/01 10:10:12 UTC

Re: SLING-5421 - Allow JCR installer to recover from being paused indefinitely

Hi Carsten

There are two things to consider:

(1) moving the implementation of the repository based signalling into
the installer (essentially encapsulation)
(2) implementation of a robust protocol for signalling a block to
installers on other cluster nodes

So far I have talked about (1) but didn't go into the details of (2).
What I have in mind for (2) is a content structure that records three
pieces of information information:

- Sling ID in order to identify on which cluster node the block was triggered
- Service PID (i.e. the fully qualified class name of the
implementation) in order to know which service triggered the block
- Creation timestamp for information/debugging

The content structure would look like the following:

/system/sling/installer/jcr/pauseInstallation
    <sling-id>/
        <service-pid>/
            <random-uuid>/
                jcr:created = <datetime>

It is important that, as a general rule, any node without children is
eagerly deleted. This means that the installer is blocked if
"pauseInstallation" has at least one child node and unblocked if it
has none (or does not exist itself).

The structure would allow a single service to hold multiple blocks
(each <random-uuid> node representing one).

Normally we would assume that a service blocks the installer and later
unblocks it again, ideally using a try/finally block. However, it gets
interesting when edge cases are considered:
- the repository service may get stopped (or restarted), in which case
the unblock can fail
- a cluster node can be killed or disconnected before the unblock can be done
- I have seen log files where the "blocking" service was restarted
while it blocked the installer, because the installer was
asynchronously processing a batch from a previous block. However, it
is unclear why the unblock did not happen in this case: there were no
exceptions in the log and I don't believe they were swallowed, because
 when I provoked a similar scenario exceptions were written to the
log.

To recover from such failure scenarios, the installer needs to be unblocked:
- if a blocking service is stopped. A stopped service may still exist
in the JVM and finish execution, therefore this could be solved using
weak references to a block-token and a reference queue. Or
alternatively by using a timeout in such cases.
- if a cluster node disappears from a topology, its <sling-id> node
should be removed after a timeout

There is a danger, however, that unblocking the installer due to
recovery causes a partial deployment to be installed. This may put the
system into an unusable state (e.g. bundles may not be resolvable,
because their dependencies were not updated(installed). I don't know
how we could address this.

Maybe an entirely different approach would be to provide a list of
deployables (e.g. repository paths?) to the installer, which then only
installs the deployables if all are available (ignoring deployables
with extensions it does not handle). This list would need to be
communicated in a cluster as well, however.

Regards
Julian


On Sun, Jan 31, 2016 at 10:34 AM, Carsten Ziegeler <cz...@apache.org> wrote:
> Julian Sedding wrote
>> Hi Carsten
>>
>>> Offline discussions don't make it transparent why you came to this
>>> conclusion.
>>> Please enclose the relevant information either here or in the issue.
>>
>> Sure, I thought that I included some information both in the email and
>> in the issue. But it is probably worthwhile to expand on it a bit
>> more.
>>
>> The current implementation is based on a convention rather than a
>> contract: place a node under a specific parent node and the JCR
>> installer will pause its activities.
>>
>> It turns out that this convention in this simple form has limitations
>> when things go wrong:
>>
>> - If a "deployer" implementation fails to delete the pause-marker node
>> (no matter what the reasons are), whose responsibility is it to delete
>> this node to recover the system?
>> - If a "deployer" on cluster node A creates a pause-marker node and
>> then cluster node A is shut down/crashes/disconnects, whose
>> responsibility is it to delete this node to recover the system?
>>
>> Both these questions require a more sophisticated convention IMHO.
>> This becomes a burden for implementers, makes fixing the convention
>> nearly impossible (in case we miss an edge case) and is brittle,
>> because "deployers" may have bugs in their implementations.
>>
>> So the logical conclusion is to move the implementation of this
>> "convention" into Sling and expose it via a simple API.
>>
>> The convention basically becomes an implementation detail, which is
>> needed to distribute the information about blocking the installer
>> within a cluster.
>>
>> Does this answer your questions?
>>
>
> Thanks, yes, however :) I can't follow the conclusion. How is having an
> API which for example has a pause/resume method to be called
> different/easier/avoiding the problems than adding/removing a node?
>
> Carsten
>
>
>
> --
> Carsten Ziegeler
> Adobe Research Switzerland
> cziegeler@apache.org

Re: SLING-5421 - Allow JCR installer to recover from being paused indefinitely

Posted by Carsten Ziegeler <cz...@apache.org>.

Julian Sedding wrote
> Hi Carsten
> 
> Thanks for your comments. I agree that it would be nice if we could
> avoid pausing the installer altogether.

If I remember correctly, that's what we said when we added the current
pausing solution: this pausing solution is temporary until we change the
package installer to use the OSGi installer :)

> 
> However, I see some challenges:
> - How do we make sure all nodes in a cluster install the bundles into
> their OSGi environments in a single batch?

Ok, good point - right. Well if a content package would be installed
with a single save this would be easy :)

> - Currently content packages contain bundles that are installed into
> the repository. How could we prevent duplicate installation (by the
> JCR installer triggered via observation and directly by the OSGi
> installer)?

If the content package would be installed through the installer,
the bundles would still be installed through observation. But due to the
single thread after all content is installed.

> 
> Do you think it is realistic to solve these issues in the short term?

Not sure, however changing the mechanism as suggested doesn't sound so
easy to me either.
> 
> Even if we can solve them, we will still need reliable
> communication/coordination between cluster nodes. This part, as
> Bertrand suggested in the issue, could be made generic. AFAIK
> ZooKeeper, etcd et al. provide such mechanisms. Maybe we need to
> provide an implementation agnostic API for this in the discovery
> module.

Well, a lot of things are doable - but starting at the real problem and
ending up with such a massive solution spreading across bundles/apis
doesn't sound appealing to me. I would rather spent the energy and think
about what is the best way to update an installation and derive a
possible solution from there. Especially with containers like docker
instances are not updated but simply a new instance with the new
configuration is started. Therefore I'm wondering if we should really go
this far and add all these things all over the place.

I think for now, the immediate issue to resolve is to recover from being
paused indefinitely. Simplest solution is to require clients to add a
timestamp to the node they create and the node will be removed after a
(long) timeout.

Regards
Carsten
-- 
Carsten Ziegeler
Adobe Research Switzerland
cziegeler@apache.org

Re: SLING-5421 - Allow JCR installer to recover from being paused indefinitely

Posted by Julian Sedding <js...@gmail.com>.

Hi Carsten

Thanks for your comments. I agree that it would be nice if we could
avoid pausing the installer altogether.

However, I see some challenges:
- How do we make sure all nodes in a cluster install the bundles into
their OSGi environments in a single batch?
- Currently content packages contain bundles that are installed into
the repository. How could we prevent duplicate installation (by the
JCR installer triggered via observation and directly by the OSGi
installer)?

Do you think it is realistic to solve these issues in the short term?

Even if we can solve them, we will still need reliable
communication/coordination between cluster nodes. This part, as
Bertrand suggested in the issue, could be made generic. AFAIK
ZooKeeper, etcd et al. provide such mechanisms. Maybe we need to
provide an implementation agnostic API for this in the discovery
module.

Regards
Julian



On Mon, Feb 1, 2016 at 12:49 PM, Carsten Ziegeler <cz...@apache.org> wrote:
> Julian Sedding wrote
>> Hi Carsten
>>
>> There are two things to consider:
>>
>> (1) moving the implementation of the repository based signalling into
>> the installer (essentially encapsulation)
>
> when you say "installer", do you mean "jcr installer" or "osgi
> installer"? From below I assume jcr installer.
>
>
>> (2) implementation of a robust protocol for signalling a block to
>> installers on other cluster nodes
>>
>> So far I have talked about (1) but didn't go into the details of (2).
>> What I have in mind for (2) is a content structure that records three
>> pieces of information information:
>>
>> - Sling ID in order to identify on which cluster node the block was triggered
>> - Service PID (i.e. the fully qualified class name of the
>> implementation) in order to know which service triggered the block
>> - Creation timestamp for information/debugging
>>
>> The content structure would look like the following:
>>
>> /system/sling/installer/jcr/pauseInstallation
>>     <sling-id>/
>>         <service-pid>/
>>             <random-uuid>/
>>                 jcr:created = <datetime>
>>
>> It is important that, as a general rule, any node without children is
>> eagerly deleted. This means that the installer is blocked if
>> "pauseInstallation" has at least one child node and unblocked if it
>> has none (or does not exist itself).
>>
>> The structure would allow a single service to hold multiple blocks
>> (each <random-uuid> node representing one).
>>
>> Normally we would assume that a service blocks the installer and later
>> unblocks it again, ideally using a try/finally block. However, it gets
>> interesting when edge cases are considered:
>> - the repository service may get stopped (or restarted), in which case
>> the unblock can fail
>> - a cluster node can be killed or disconnected before the unblock can be done
>> - I have seen log files where the "blocking" service was restarted
>> while it blocked the installer, because the installer was
>> asynchronously processing a batch from a previous block. However, it
>> is unclear why the unblock did not happen in this case: there were no
>> exceptions in the log and I don't believe they were swallowed, because
>>  when I provoked a similar scenario exceptions were written to the
>> log.
>>
>> To recover from such failure scenarios, the installer needs to be unblocked:
>> - if a blocking service is stopped. A stopped service may still exist
>> in the JVM and finish execution, therefore this could be solved using
>> weak references to a block-token and a reference queue. Or
>> alternatively by using a timeout in such cases.
>> - if a cluster node disappears from a topology, its <sling-id> node
>> should be removed after a timeout
>>
>> There is a danger, however, that unblocking the installer due to
>> recovery causes a partial deployment to be installed. This may put the
>> system into an unusable state (e.g. bundles may not be resolvable,
>> because their dependencies were not updated(installed). I don't know
>> how we could address this.
>>
>> Maybe an entirely different approach would be to provide a list of
>> deployables (e.g. repository paths?) to the installer, which then only
>> installs the deployables if all are available (ignoring deployables
>> with extensions it does not handle). This list would need to be
>> communicated in a cluster as well, however.
>>
>
> Thanks Julian, now I understand your idea. This might work, however
> sounds a little bit complex to me.
> Now, obviously, the easier solution is that the content package which
> installs the bundles in the first place, is installed through the OSGi
> installer - as the OSGi installer is single threaded, bundles installed
> through content packages would be installed after all content is
> installed. And no pausing would be needed as well.
> So, I think pausing and trying to recover etc. is a house made problem
> which could be avoided if the root cause would be solved.
>
> Carsten
>
>> Regards
>> Julian
>>
>>
>> On Sun, Jan 31, 2016 at 10:34 AM, Carsten Ziegeler <cz...@apache.org> wrote:
>>> Julian Sedding wrote
>>>> Hi Carsten
>>>>
>>>>> Offline discussions don't make it transparent why you came to this
>>>>> conclusion.
>>>>> Please enclose the relevant information either here or in the issue.
>>>>
>>>> Sure, I thought that I included some information both in the email and
>>>> in the issue. But it is probably worthwhile to expand on it a bit
>>>> more.
>>>>
>>>> The current implementation is based on a convention rather than a
>>>> contract: place a node under a specific parent node and the JCR
>>>> installer will pause its activities.
>>>>
>>>> It turns out that this convention in this simple form has limitations
>>>> when things go wrong:
>>>>
>>>> - If a "deployer" implementation fails to delete the pause-marker node
>>>> (no matter what the reasons are), whose responsibility is it to delete
>>>> this node to recover the system?
>>>> - If a "deployer" on cluster node A creates a pause-marker node and
>>>> then cluster node A is shut down/crashes/disconnects, whose
>>>> responsibility is it to delete this node to recover the system?
>>>>
>>>> Both these questions require a more sophisticated convention IMHO.
>>>> This becomes a burden for implementers, makes fixing the convention
>>>> nearly impossible (in case we miss an edge case) and is brittle,
>>>> because "deployers" may have bugs in their implementations.
>>>>
>>>> So the logical conclusion is to move the implementation of this
>>>> "convention" into Sling and expose it via a simple API.
>>>>
>>>> The convention basically becomes an implementation detail, which is
>>>> needed to distribute the information about blocking the installer
>>>> within a cluster.
>>>>
>>>> Does this answer your questions?
>>>>
>>>
>>> Thanks, yes, however :) I can't follow the conclusion. How is having an
>>> API which for example has a pause/resume method to be called
>>> different/easier/avoiding the problems than adding/removing a node?
>>>
>>> Carsten
>>>
>>>
>>>
>>> --
>>> Carsten Ziegeler
>>> Adobe Research Switzerland
>>> cziegeler@apache.org
>>
>
>
>
> --
> Carsten Ziegeler
> Adobe Research Switzerland
> cziegeler@apache.org

Re: SLING-5421 - Allow JCR installer to recover from being paused indefinitely

Posted by Carsten Ziegeler <cz...@apache.org>.

Julian Sedding wrote
> Hi Carsten
> 
> There are two things to consider:
> 
> (1) moving the implementation of the repository based signalling into
> the installer (essentially encapsulation)

when you say "installer", do you mean "jcr installer" or "osgi
installer"? From below I assume jcr installer.


> (2) implementation of a robust protocol for signalling a block to
> installers on other cluster nodes
> 
> So far I have talked about (1) but didn't go into the details of (2).
> What I have in mind for (2) is a content structure that records three
> pieces of information information:
> 
> - Sling ID in order to identify on which cluster node the block was triggered
> - Service PID (i.e. the fully qualified class name of the
> implementation) in order to know which service triggered the block
> - Creation timestamp for information/debugging
> 
> The content structure would look like the following:
> 
> /system/sling/installer/jcr/pauseInstallation
>     <sling-id>/
>         <service-pid>/
>             <random-uuid>/
>                 jcr:created = <datetime>
> 
> It is important that, as a general rule, any node without children is
> eagerly deleted. This means that the installer is blocked if
> "pauseInstallation" has at least one child node and unblocked if it
> has none (or does not exist itself).
> 
> The structure would allow a single service to hold multiple blocks
> (each <random-uuid> node representing one).
> 
> Normally we would assume that a service blocks the installer and later
> unblocks it again, ideally using a try/finally block. However, it gets
> interesting when edge cases are considered:
> - the repository service may get stopped (or restarted), in which case
> the unblock can fail
> - a cluster node can be killed or disconnected before the unblock can be done
> - I have seen log files where the "blocking" service was restarted
> while it blocked the installer, because the installer was
> asynchronously processing a batch from a previous block. However, it
> is unclear why the unblock did not happen in this case: there were no
> exceptions in the log and I don't believe they were swallowed, because
>  when I provoked a similar scenario exceptions were written to the
> log.
> 
> To recover from such failure scenarios, the installer needs to be unblocked:
> - if a blocking service is stopped. A stopped service may still exist
> in the JVM and finish execution, therefore this could be solved using
> weak references to a block-token and a reference queue. Or
> alternatively by using a timeout in such cases.
> - if a cluster node disappears from a topology, its <sling-id> node
> should be removed after a timeout
> 
> There is a danger, however, that unblocking the installer due to
> recovery causes a partial deployment to be installed. This may put the
> system into an unusable state (e.g. bundles may not be resolvable,
> because their dependencies were not updated(installed). I don't know
> how we could address this.
> 
> Maybe an entirely different approach would be to provide a list of
> deployables (e.g. repository paths?) to the installer, which then only
> installs the deployables if all are available (ignoring deployables
> with extensions it does not handle). This list would need to be
> communicated in a cluster as well, however.
> 

Thanks Julian, now I understand your idea. This might work, however
sounds a little bit complex to me.
Now, obviously, the easier solution is that the content package which
installs the bundles in the first place, is installed through the OSGi
installer - as the OSGi installer is single threaded, bundles installed
through content packages would be installed after all content is
installed. And no pausing would be needed as well.
So, I think pausing and trying to recover etc. is a house made problem
which could be avoided if the root cause would be solved.

Carsten

> Regards
> Julian
> 
> 
> On Sun, Jan 31, 2016 at 10:34 AM, Carsten Ziegeler <cz...@apache.org> wrote:
>> Julian Sedding wrote
>>> Hi Carsten
>>>
>>>> Offline discussions don't make it transparent why you came to this
>>>> conclusion.
>>>> Please enclose the relevant information either here or in the issue.
>>>
>>> Sure, I thought that I included some information both in the email and
>>> in the issue. But it is probably worthwhile to expand on it a bit
>>> more.
>>>
>>> The current implementation is based on a convention rather than a
>>> contract: place a node under a specific parent node and the JCR
>>> installer will pause its activities.
>>>
>>> It turns out that this convention in this simple form has limitations
>>> when things go wrong:
>>>
>>> - If a "deployer" implementation fails to delete the pause-marker node
>>> (no matter what the reasons are), whose responsibility is it to delete
>>> this node to recover the system?
>>> - If a "deployer" on cluster node A creates a pause-marker node and
>>> then cluster node A is shut down/crashes/disconnects, whose
>>> responsibility is it to delete this node to recover the system?
>>>
>>> Both these questions require a more sophisticated convention IMHO.
>>> This becomes a burden for implementers, makes fixing the convention
>>> nearly impossible (in case we miss an edge case) and is brittle,
>>> because "deployers" may have bugs in their implementations.
>>>
>>> So the logical conclusion is to move the implementation of this
>>> "convention" into Sling and expose it via a simple API.
>>>
>>> The convention basically becomes an implementation detail, which is
>>> needed to distribute the information about blocking the installer
>>> within a cluster.
>>>
>>> Does this answer your questions?
>>>
>>
>> Thanks, yes, however :) I can't follow the conclusion. How is having an
>> API which for example has a pause/resume method to be called
>> different/easier/avoiding the problems than adding/removing a node?
>>
>> Carsten
>>
>>
>>
>> --
>> Carsten Ziegeler
>> Adobe Research Switzerland
>> cziegeler@apache.org
> 


 
-- 
Carsten Ziegeler
Adobe Research Switzerland
cziegeler@apache.org