You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Shanthoosh Venkataraman (JIRA)" <ji...@apache.org> on 2018/01/18 20:10:00 UTC

[jira] [Commented] (SAMZA-1561) JobModel upgrade consistency problem.

    [ https://issues.apache.org/jira/browse/SAMZA-1561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16331116#comment-16331116 ] 

Shanthoosh Venkataraman commented on SAMZA-1561:
------------------------------------------------

 

[~boryas]

Short term fix is to add a simple comparator check with upgrade in the upgrade sequence(since without this patch, this problem will prevent standalone jobs to run in production without manual intervention).

Ideal fix is to remove the jobModelVersion zkNode dependency and just migrate all existing code to watch on the jobModels directory for getting jobModel change (this involves changes in different layers, will be revisited after this change goes in).

 

 

> JobModel upgrade consistency problem.
> -------------------------------------
>
>                 Key: SAMZA-1561
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1561
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Shanthoosh Venkataraman
>            Assignee: Shanthoosh Venkataraman
>            Priority: Major
>
> JobModel upgrade sequence is the following:
>  
> A. Read previousJobModelVersion from JobModelBasePath/jobModelVersion.
> B. Publish the new JobModel with version (previousJobModelVersion + 1) to JobModelBasePath/jobmodels.
> C. Create a barrier with version (previousJobModelVersion + 1).
> D. Update jobModelVersion path with value (previousJobModelVersion + 1).
> Followers watch on jobModelVersion path for JobModel upgrades.
> If the leader dies before executing the last step of the upgrade sequence, then any processor elected as leader will be unable to publish the new JobModel and will fail with ZkNodeExistsException (For instance, previousJobModel version is 2 of a processor group [P1, P2]. P1 is the leader and it created zkNode jobModelBasePath/jobModels/3 for publishing jobModel and dies without upgrading jobModelVersion path(which stays as 2). If P2 becomes leader, it will generate the jobModel version and try to create node jobModelBasePath/jobModels/3 and will fail). 
>  
>  This behavior was observed during the testing in one of samza standalone jobs. 
> JobModelBasePath/jobModels is the source of truth for the latest jobModelVersion in a processor group. By maintaining it in a separate zookeeper node and not having the capability to do atomic upgrades we run into this consistency problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)