You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Shanthoosh Venkataraman (JIRA)" <ji...@apache.org> on 2017/11/21 20:07:32 UTC

[jira] [Updated] (SAMZA-1506) Potential orphaned containers problem in SamzaContainer.

     [ https://issues.apache.org/jira/browse/SAMZA-1506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shanthoosh Venkataraman updated SAMZA-1506:
-------------------------------------------
    Summary: Potential orphaned containers problem in SamzaContainer.  (was: Potential orphaned containers  in LocalContainerRunner.)

> Potential orphaned containers problem in SamzaContainer.
> --------------------------------------------------------
>
>                 Key: SAMZA-1506
>                 URL: https://issues.apache.org/jira/browse/SAMZA-1506
>             Project: Samza
>          Issue Type: Bug
>            Reporter: Shanthoosh Venkataraman
>            Assignee: Abhishek Shivanna
>             Fix For: 0.14.0
>
>
> We noticed an occurrence of orphaned container in LinkedIn production environment(using samza-yarn). 
> The ContainerHeartbeatMonitor added as part of SAMZA-871 to solve this problem is alive on the orphaned container java process and didn't shut it down. 
> ContainerHeartbeatMonitor uses single-threaded ScheduledExecutorService to periodically check if the container is orphaned.
> From the following process thread dump, it's apparent that the worker thread in ScheduledExecutorService finds the task queue is empty and goes to waiting state(expecting new tasks to be added to the queue).
> {code:java}
> "Samza-ContainerHeartbeatMonitor-0" #34 prio=5 os_prio=0 tid=0x00007f9322896800 nid=0x38af waiting on condition [0x00007f92f363e000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x000000070078a0e8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1081)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:809)
>         at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>         at java.lang.Thread.run(Thread.java:745)
> {code}
> If the execution of a Runnable submitted to ScheduledExecutorService.scheduleAtFixedRate throws an exception, subsequent executions are suppressed. 
> Existing ContainerHeartBeatClient implementation which accesses the ApplicationMaster http-endpoint to get container liveness has IOException handlers alone. Any unchecked exceptions thrown from that code path will shutdown the ContainerHeartbeatMonitor(This is the suspected cause).
> This requires further investigation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)