You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@sling.apache.org by "Stefan Egli (JIRA)" <ji...@apache.org> on 2016/02/04 10:24:39 UTC

[jira] [Resolved] (SLING-3432) pseudo network partition causes job deserialization issue in a cluster (when reading while job is being reassigned)

     [ https://issues.apache.org/jira/browse/SLING-3432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stefan Egli resolved SLING-3432.
--------------------------------
       Resolution: Fixed
         Assignee: Stefan Egli
    Fix Version/s:     (was: Discovery Impl 1.2.8)
                   Discovery Impl 1.2.2
                   Discovery Oak 1.2.2

Marking as fixed. Resolution is as follows:
* discovery.oak addresses network partitioning by relying on a deterministic storage (the DocumentStore), a reliable lease mechanism and lease timeouts. If any instance doesn't update the lease in time this results in removing it from the cluster (for the others) and shutting down (for the local instance). Under no circumstance would discovery.oak together with the lease-check in oak allow a pseudo network partitioning situation. So for discovery.oak this is handled fine.
* discovery.impl: all issues but the mentioned SLING-4640 are adressed. As mentioned SLING-4640 will not be fixed for discovery.impl as it's not feasible. However, SLING-5195 and SLING-5280 add additional safety checks that try to help for large repository delays too. Still, if the repository delays are very asymmetric (ie reading is very slow for one instance vs writes are fast), then SLING-4640 can still happen. To address those issues, the recommendation is to switch to discovery.oak.

> pseudo network partition causes job deserialization issue in a cluster (when reading while job is being reassigned)
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: SLING-3432
>                 URL: https://issues.apache.org/jira/browse/SLING-3432
>             Project: Sling
>          Issue Type: Bug
>          Components: Extensions
>    Affects Versions: Discovery Impl 1.0.2
>            Reporter: Stefan Egli
>            Assignee: Stefan Egli
>             Fix For: Discovery Oak 1.2.2, Discovery Impl 1.2.2
>
>
> There is a race condition between two instances in a cluster (eg oak or crx): Instance 1 is writing a job with a binary property, instance 2 is reading the job (likely triggered by discovery sending it a topologychangedevent). It looks like instance 2 is reading the job just about while instance 1 is still in the process or completely writing the job, or at least the binary. Resulting in the following exception:
> 04.03.2014 06:55:39.667 *WARN* [Apache Sling Job Background Loader] org.apache.sling.event.impl.jobs.JobManagerImpl Unable to read job from /var/eventing/jobs/assigned/e4337f8f-47d2-41df-b3ab-0d40b1b2acd4/slingevent:eventadmin/2014/3/3/8/45/cq.wcm.msm.job.pageEvent_9718d7db-85b4-4930-a2ba-11a80d772970_172
> java.lang.Exception: Unable to deserialize property 'pageEvent'
>         at org.apache.sling.event.impl.support.ResourceHelper.cloneValueMap(ResourceHelper.java:213)
>         at org.apache.sling.event.impl.jobs.JobManagerImpl.readJob(JobManagerImpl.java:538)
>         at org.apache.sling.event.impl.jobs.BackgroundLoader.loadJobInTheBackground(BackgroundLoader.java:318)
>         at org.apache.sling.event.impl.jobs.BackgroundLoader.loadJobsInTheBackground(BackgroundLoader.java:294)
>         at org.apache.sling.event.impl.jobs.BackgroundLoader.run(BackgroundLoader.java:203)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: java.io.EOFException: null
>         at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2280)
>         at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2749)
>         at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:779)
>         at java.io.ObjectInputStream.<init>(ObjectInputStream.java:279)
>         at org.apache.sling.event.impl.support.ResourceHelper.cloneValueMap(ResourceHelper.java:208)
>         ... 5 common frames omitted



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)