You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Bharath Kumarasubramanian (Jira)" <ji...@apache.org> on 2020/05/27 03:38:00 UTC

[jira] [Updated] (SAMZA-2491) AM should log uncaught exceptions and System.exit to ensure that the process dies on errors

     [ https://issues.apache.org/jira/browse/SAMZA-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bharath Kumarasubramanian updated SAMZA-2491:
---------------------------------------------
    Fix Version/s: 1.5

> AM should log uncaught exceptions and System.exit to ensure that the process dies on errors
> -------------------------------------------------------------------------------------------
>
>                 Key: SAMZA-2491
>                 URL: https://issues.apache.org/jira/browse/SAMZA-2491
>             Project: Samza
>          Issue Type: Improvement
>            Reporter: Hai Lu
>            Assignee: Hai Lu
>            Priority: Major
>             Fix For: 1.5
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> From: pmaheshw
> Symptom: A job deployment timed out waiting for application attempt to transition from New to Running.
> Cause: ClusterBasedJobCoordinator threw an exception during startup due to a misconfiguration, but did not kill the AM process (likely due to non-daemon threads).
> Suggested fixes:
> 1. ClusterBasedJobCoordinator#main doesn't use an uncaught exception handler, and doesn't catch + log any exceptions thrown from ClusterBasedJobCoordinator constructor or from run(). We should fix this. Uncaught exceptions go to stderr instead of logs and do not have a timestamp, which makes debugging difficult. E.g.:
> Exception in thread "main" org.apache.samza.SamzaException: Cannot get systemAdmin for system aggregate-tracking
> at org.apache.samza.system.SystemAdmins.getSystemAdmin(SystemAdmins.java:63)
> at org.apache.samza.system.StreamMetadataCache$$anonfun$3.apply(StreamMetadataCache.scala:66)
> at org.apache.samza.system.StreamMetadataCache$$anonfun$3.apply(StreamMetadataCache.scala:64)
> at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
> at scala.collection.immutable.Map$Map2.foreach(Map.scala:137)
> at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
> at org.apache.samza.system.StreamMetadataCache.getStreamMetadata(StreamMetadataCache.scala:64)
> at org.apache.samza.coordinator.StreamPartitionCountMonitor.getMetadata(StreamPartitionCountMonitor.java:92)
> at org.apache.samza.coordinator.StreamPartitionCountMonitor.<init>(StreamPartitionCountMonitor.java:113)
> at org.apache.samza.clustermanager.ClusterBasedJobCoordinator.getPartitionCountMonitor(ClusterBasedJobCoordinator.java:343)
> at org.apache.samza.clustermanager.ClusterBasedJobCoordinator.<init>(ClusterBasedJobCoordinator.java:207)
> at org.apache.samza.clustermanager.ClusterBasedJobCoordinator.main(ClusterBasedJobCoordinator.java:441)
> 2. JC should call System.exit on returning from main (cleanly or on exception) and from the uncaught exception handler to ensure that the AM process dies on these errors and does not leave the deployment hanging. We've also seen this issue due to client libraries (datavault, brooklin, kafka etc.) creating non-daemon threads and not stopping them cleanly. See LocalContainerRunner for reference, which does kill the process on returning from main thread. E.g., in this case its threads like this:
> "AsyncHttpClient-27-1" #134 prio=5 os_prio=0 tid=0x00007faead675000 nid=0x4151 runnable [0x00007fae9c9da000]
> java.lang.Thread.State: RUNNABLE
> at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
> at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
> at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
> at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
>  - locked <0x00000000fe6a2f40> (a com.linkedin.mario.shaded.io.netty.channel.nio.SelectedSelectionKeySet)
>  - locked <0x00000000fe6fe9c0> (a java.util.Collections$UnmodifiableSet)
>  - locked <0x00000000fe6a3f68> (a sun.nio.ch.EPollSelectorImpl)
> at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
> at com.linkedin.mario.shaded.io.netty.channel.nio.SelectedSelectionKeySetSelector.select(SelectedSelectionKeySetSelector.java:62)
> at com.linkedin.mario.shaded.io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:824)
> at com.linkedin.mario.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:457)
> at com.linkedin.mario.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$6.run(SingleThreadEventExecutor.java:1044)
> at com.linkedin.mario.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at com.linkedin.mario.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)