You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@oozie.apache.org by "Shwetha G S (JIRA)" <ji...@apache.org> on 2014/04/21 09:29:15 UTC

[jira] [Commented] (OOZIE-1527) Fix scalability issues with coordinator materialization

    [ https://issues.apache.org/jira/browse/OOZIE-1527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13975461#comment-13975461 ] 

Shwetha G S commented on OOZIE-1527:
------------------------------------

{quote}
lookupInterval and scheduling interval of CoordMaterializeTriggerRunnable are same. The lookup interval looks for jobs with next materialized time within the lookup interval. Earlier we reduced it to 2 mins to make materialization run frequently, but it also only tries to materialize jobs for the nominal time just 2 mins before the nominal time is reached. And if there are lot of coord jobs having to be materialized (happens especially on hour boundaries) they all get delayed very badly and get only picked up for materialization after the nominal time has actually passed causing SLA misses. That is why we want to have 2 different settings for lookup and schedule interval where we can schedule frequently and lookup more in advance. For eg: lookInterval can be set to 10 mins and schedule interval to 2 mins. This way we will materialize 10 mins in advance instead of just 2 mins before nominal time and will have breather to meet the SLA.
{quote}
When there are too many hourly jobs scheduled for the beginning of the hour, some coord actions are created almost 45 mins late. This approach will fix the issue. Even though the coord action is created early, we should queue CoordActionInputCheckXCommand with delay(nominal time - now) to avoid un-necessary input checks. This is more evident for minutely jobs as too many un-necessary input checks will be in the queue.

> Fix scalability issues with coordinator materialization
> -------------------------------------------------------
>
>                 Key: OOZIE-1527
>                 URL: https://issues.apache.org/jira/browse/OOZIE-1527
>             Project: Oozie
>          Issue Type: Bug
>          Components: coordinator
>    Affects Versions: trunk
>            Reporter: Mona Chitnis
>            Assignee: Purshotam Shah
>             Fix For: trunk
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> In certain situations when there is a large number of coordinators in the system, they have been observed to create huge backlog in materialization, and progressing very slow compared to expected. This patch can be looked upon as both a bug-fix or an enhancement addressing following points:
> 1. 'materialization.system.limit' leads to bringing Coord jobs in LRU fashion, but some of them may already be maxing out at actions to materialize (= throttle), and < limit jobs may actually undergo materialization. This patch does a second iteration of loading jobs to get materialized to reduce backlog
> 2. 'materialization.window' being 1 hour may work in most cases, but hourly jobs are seen to face significant slowdown at times, by lot of other minute jobs getting materialized. Therefore, window can be doubled (i.e. 2 hours) when job is hourly/daily.
> 3. For hourly coordinators, it is consistently seen that materialization occurs only near the end of the hour. e.g. for action whose nominal time is 2:00, action creation time is 1:59, if nominal time - 3:00, creation time is 2:58 and so on. If window is an hour in the future, doesn't explain why materialization won't occur anytime in the middle of the preceding hour.



--
This message was sent by Atlassian JIRA
(v6.2#6252)