You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Shanthoosh Venkataraman (JIRA)" <ji...@apache.org> on 2017/07/10 22:27:00 UTC

[jira] [Commented] (SAMZA-1282) Spinning up more containers than the number of tasks kills leader

    [ https://issues.apache.org/jira/browse/SAMZA-1282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081282#comment-16081282 ] 

Shanthoosh Venkataraman commented on SAMZA-1282:
------------------------------------------------

Scenario : When there’re more stream processors(P) than tasks(T) [ X is number of stream processors, Y is number of tasks. X > Y].

Current behavior : Fail with RuntimeException.

Possible solutions: 
Solution A:

Sort the stream processors using unique zookeeper sequential id associated with each processor. Generate job model using ‘Y’ lexicographically least stream processors and kill the rest of stream processors. 

Pros: 
* Straight forward and doesn’t require much change.

Cons: 
* Additional stream processors are killed instead of using them when there're death to existing members of processors group.

Solution B:

Sort the stream processors using unique zookeeper sequential id associated with each processor. 
Generate job model using ‘Y’ lexicographically least stream processors and allow additional stream processors to live (could join group when any chosen stream processor dies). Will require each stream processor to hold local state (if it’s part of a group or not) and ignore zookeeper events if not part of the group. 

Pros:
* Improved fault tolerance to stream processor deaths in a group.

Cons: 
* Expected obvious performance drop since standby processors consume system resources and receive zookeeper events.

> Spinning up more containers than the number of tasks kills leader
> -----------------------------------------------------------------
>
>                 Key: SAMZA-1282
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1282
>             Project: Samza
>          Issue Type: Bug
>          Components: container
>    Affects Versions: 0.13.0
>            Reporter: Bharath Kumarasubramanian
>            Assignee: Shanthoosh Venkataraman
>             Fix For: 0.13.1
>
>
> When a user tries to spin up more containers than the max partitions or tasks, the leader process gets killed. 
> We throw an exception in the TaskNameGrouper for the above scenario and that needs to be handled gracefully by the leader and kill the newly spun containers as opposed bailing out.
> Here is the stack trace 
> {code}
>  2017-05-10 15:13:24.526 [debounce-thread-0] ScheduleAfterDebounceTime [ERROR] OnProcessorChange threw an exception.
> java.lang.IllegalArgumentException: number of containers 2 is bigger than number of tasks 1
> 	at org.apache.samza.container.grouper.task.GroupByContainerIds.group(GroupByContainerIds.java:68)
> 	at org.apache.samza.coordinator.JobModelManager$.readJobModel(JobModelManager.scala:258)
> 	at org.apache.samza.coordinator.JobModelManager.readJobModel(JobModelManager.scala)
> 	at org.apache.samza.zk.ZkJobCoordinator.generateNewJobModel(ZkJobCoordinator.java:212)
> 	at org.apache.samza.zk.ZkJobCoordinator.doOnProcessorChange(ZkJobCoordinator.java:125)
> 	at org.apache.samza.zk.ZkJobCoordinator.lambda$onProcessorChange$1(ZkJobCoordinator.java:120)
> 	at org.apache.samza.zk.ScheduleAfterDebounceTime.lambda$scheduleAfterDebounceTime$0(ScheduleAfterDebounceTime.java:89)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)