You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Rayman (JIRA)" <ji...@apache.org> on 2018/11/13 19:55:00 UTC

[jira] [Created] (SAMZA-1992) Hot standby containers for improving recovery-time of Stateful jobs

Rayman created SAMZA-1992:
-----------------------------

             Summary: Hot standby containers for improving recovery-time of Stateful jobs
                 Key: SAMZA-1992
                 URL: https://issues.apache.org/jira/browse/SAMZA-1992
             Project: Samza
          Issue Type: Improvement
            Reporter: Rayman


Recovery time for jobs with large state increases significantly in host-failure scenarios. This is problematic because of two reasons: 
a) causes _unavailability_ , 

and  
b) causes _backlog_ _buildup_ in case of jobs with high input QPS, requiring scaleup (then scale-down) or causes increased catch-up time.

 

The solution comprises of two parts 

*1. Increasing restore parallelism*: Samza restores stores at SamzaContainer startup sequentially for each task (using TaskStorageManager and TaskSideInputStorage Manager). Parallelizing task restores. We can parallelize store restores either in SamzaContainer (using a bounded/configurable threadpool) or in the TaskInstance or in the TaskStorageManager.

*2. Hot-Standby Containers:* 
A  Samza container which consumes input, reads or updates state, or invokes external services or produces outputs is called an *“active container.”*  
A *hot-standby container* is one which carries updated or hot KV state, and guarantees that, when it is used for a failover, its KV state corresponds to atleast the last checkpoint of the corresponding active container (for each task). There are multiple ways of building such hot-standby container implementations, see this section for pros and cons. Restore time With hot-standby container, restore times should be similar to host-affinity restore-times -- 10s of seconds _regardless of state size_.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)