You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@samza.apache.org by "Yi Pan (Data Infrastructure) (JIRA)" <ji...@apache.org> on 2015/10/23 19:14:27 UTC

[jira] [Updated] (SAMZA-798) Performance and stability issue after combining checkpoint and coordinator stream

     [ https://issues.apache.org/jira/browse/SAMZA-798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yi Pan (Data Infrastructure) updated SAMZA-798:
-----------------------------------------------
    Fix Version/s: 0.10.0

> Performance and stability issue after combining checkpoint and coordinator stream
> ---------------------------------------------------------------------------------
>
>                 Key: SAMZA-798
>                 URL: https://issues.apache.org/jira/browse/SAMZA-798
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Yi Pan (Data Infrastructure)
>             Fix For: 0.10.0
>
>
> When testing 0.10 release candidate w/ large number of topic partitions and containers, we have observed that there is a serious stability issue when combining checkpoint and coordinator streams together.
> The reasons being the following:
> 1) The current JobCoordinator's use case of coordinator includes the following:
>    * job configuration
>    * task changelog partition map
>    * container locality info
>    * misc (like migration marker, etc.)
>    * checkpoint
> Out of all the above, checkpoint creates the biggest problem:
> 1) It is much more dynamic than others. Trying to keep up-to-state w/ the coordinator stream's tail becomes impossible due to checkpointing from all containers. Mixing checkpoint w/ other message together in one stream makes it impossible to differentiate the cases whether there is more important configuration/status information that has to be read immediately versus there are just checkpoint updates in the coordinator stream.
> 2) In case of checkpoint, it is not necessary for JobCoordinator to be in the middle of the path. Our previous checkpoint model actually works properly: the checkpoint is written by the containers and read by the containers, and it is very clear that read only happens when recover/restart the container while write happens during the container runtime. Making all containers rely on the JobCoordinator to read the latest checkpoint actually makes JobCoordinator the single point bottleneck.
> 3) With the current change, removing CheckpointManagerFactory also disable the possibility for users to plugin their own checkpoint system.
> Bottom line is: the write rate to coordinator stream should just be configuration and other low volume information. High-volume traffic should not be sent to coordinator stream, which is often used by JobCoordinator to build an up-to-date job status.
> In addition, it would be preferred to move the checkpoint back before this release to avoid the unnecessary migration of checkpoint to coordinator stream.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)