You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Ryota Egashira (JIRA)" <ji...@apache.org> on 2014/02/26 19:29:19 UTC

[jira] [Commented] (OOZIE-1492) Make sure HA works with HCat and SLA notifications

    [ https://issues.apache.org/jira/browse/OOZIE-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13913273#comment-13913273 ] 

Ryota Egashira commented on OOZIE-1492:
---------------------------------------

Hi, I did gap analysis on several cases regarding HA support for HCat. please correct me if anything missing or wrong.

- Case 1 (straightforward case, no server down, and a coord action submitted/started on one server)
suppose coord job submitted to oozie server X, and coord action materialized there.
CoordMaterializeTransitionXCommand registers missing partitions of the coord action to dependencyCache in memory, also register the topic (table name) to JMS. 
when partition becomes available,  notification sent from JMS to oozie  X, and if all available, coord action become ready. 
fine so far.

- Case 2 (server down after coord action materialized)
suppose oozie X down after materialization of the coord action.
After while (10 min default now), the coord action will be picked up by RecoveryService on other oozie (say Y), queues  CoordPushDependencyCheckXCommand, which polls HCatalog and get the list of current missing partitions, register them to dependency cache on oozie Y, and register the topic to JMS from oozie Y. (or make coord action ready if all available). afterwards, notification will be sent to oozie Y.

- Case 3 (server down after coord job submission but before materialization)
Coord job is in prep status, and recovery service needs to pick up (seems that it's not picked up in current code)

- Case 4 (no server down, but coord action picked by recovery service on other oozie server )
Suppose coord job submitted and coord action materialized on oozie X, but the coord action picked up by RecoveryService of other oozie, Y.
Similar with Case 2.  dependency cache updated and JMS topic registered from oozie Y. fine afterwards.
but oozie X has dependency cache outdated, and is still subscriber of the topic, which needs to be cleaned up.

Additional code needed for Case 3 and 4, but not much.
one disadvantage of this (relying on recovery service to pick coord action when oozie server down) is latency.  
also, according to messaging service team(using JMS) at Y!, no issue about the same topic registered from different oozie servers. (simply each oozie server becomes a subscriber of the topic).


> Make sure HA works with HCat and SLA notifications
> --------------------------------------------------
>
>                 Key: OOZIE-1492
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1492
>             Project: Oozie
>          Issue Type: Improvement
>          Components: HA
>    Affects Versions: trunk
>            Reporter: Robert Kanter
>
> We need to make sure HA works with HCat integration and SLA notifications. Both have in-memory datastructures and HA will impact them.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)