You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by "Justin Ross (JIRA)" <ji...@apache.org> on 2014/06/12 23:59:03 UTC

[jira] [Updated] (QPID-5719) HA becomes unresponsive once any of the brokers are SIGSTOPed

     [ https://issues.apache.org/jira/browse/QPID-5719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Justin Ross updated QPID-5719:
------------------------------

    Fix Version/s: 0.29

> HA becomes unresponsive once any of the brokers are SIGSTOPed
> -------------------------------------------------------------
>
>                 Key: QPID-5719
>                 URL: https://issues.apache.org/jira/browse/QPID-5719
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Clustering
>    Affects Versions: 0.28
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>             Fix For: 0.29
>
>         Attachments: ha-heartbeat.diff
>
>
> See also: https://bugzilla.redhat.com/show_bug.cgi?id=1086638
> Description of problem:
> qpid HA becomes unresponsive once any of the brokers are SIGSTOPed.
> There are three different cases:
> a] stopped ALL brokers
> b] stopped the primary
> c] stopped a backup
> In any of above listed cases following observations were made:
> a-c]    RHCS clustat is just fine and report everything is just ok
> a-c]    qpid-ha (status --all) hangs
> a,b,c*] any other clients are indefinitely blocked
>         a-b] cases directly at the beginning
>         c] case at the end, client able to recover after minute or so,
>            due to connection timeout
> In fact this defect also proves that qpid-ha can be out of sync when compared to clustat as tracked by BZ.
> The expectations are:
>  * a] quorum lost HA down (same as kill -9 to all nodes)
>       no clients able to communicate
>  * b] promotion of new primary, there has to be mechanism to get rid of stopped process
>       clients should be able to communicate after recovery
>  * c] unresponsive backup should get restarted
>       clients should be able to communicate after duration when backup is detected as unresponsive
>  * Generally better integration Qpid HA environment <-> RHCS is needed
>    aka SIGSTOP detection
>  * Heartbeat primary <-> backups probably needed



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org