You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@samza.apache.org by "Prateek Maheshwari (JIRA)" <ji...@apache.org> on 2019/03/19 17:16:00 UTC

[jira] [Commented] (SAMZA-1992) Hot standby containers for improving recovery-time of Stateful jobs

    [ https://issues.apache.org/jira/browse/SAMZA-1992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16796297#comment-16796297 ] 

Prateek Maheshwari commented on SAMZA-1992:
-------------------------------------------

[~rayman7718] OK to close ticket now?

> Hot standby containers for improving recovery-time of Stateful jobs
> -------------------------------------------------------------------
>
>                 Key: SAMZA-1992
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1992
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Rayman
>            Priority: Major
>
> Recovery time for jobs with large state increases significantly in host-failure scenarios. This is problematic because of two reasons: 
> a) causes _unavailability_ , 
> and  
> b) causes _backlog_ _buildup_ in case of jobs with high input QPS, requiring scaleup (then scale-down) or causes increased catch-up time.
>  
> The solution comprises of two parts 
> *1. Increasing restore parallelism*: Samza restores stores at SamzaContainer startup sequentially for each task (using TaskStorageManager and TaskSideInputStorage Manager). Parallelizing task restores. We can parallelize store restores either in SamzaContainer (using a bounded/configurable threadpool) or in the TaskInstance or in the TaskStorageManager.
> *2. Hot-Standby Containers:* 
> A  Samza container which consumes input, reads or updates state, or invokes external services or produces outputs is called an *“active container.”*  
> A *hot-standby container* is one which carries updated or hot KV state, and guarantees that, when it is used for a failover, its KV state corresponds to atleast the last checkpoint of the corresponding active container (for each task). There are multiple ways of building such hot-standby container implementations, see this section for pros and cons. Restore time With hot-standby container, restore times should be similar to host-affinity restore-times -- 10s of seconds _regardless of state size_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)