You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jeff Zhang (JIRA)" <ji...@apache.org> on 2015/06/24 10:42:05 UTC

[jira] [Resolved] (TEZ-2550) DAGAppMaster gets locked up due to ATS

     [ https://issues.apache.org/jira/browse/TEZ-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Zhang resolved TEZ-2550.
-----------------------------
    Resolution: Duplicate

Resolved in TEZ-2548

> DAGAppMaster gets locked up due to ATS
> --------------------------------------
>
>                 Key: TEZ-2550
>                 URL: https://issues.apache.org/jira/browse/TEZ-2550
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>
> {noformat}
> Thread 30453: (state = IN_NATIVE)
>  - java.net.SocketInputStream.socketRead0(java.io.FileDescriptor, byte[], int, int, int) @bci=0 (Compiled frame; information may be imprecise)
>  - java.net.SocketInputStream.read(byte[], int, int, int) @bci=79, line=150 (Compiled frame)
>  - java.net.SocketInputStream.read(byte[], int, int) @bci=11, line=121 (Compiled frame)
>  - java.io.BufferedInputStream.fill() @bci=214, line=246 (Compiled frame)
>  - java.io.BufferedInputStream.read1(byte[], int, int) @bci=44, line=286 (Compiled frame)
>  - java.io.BufferedInputStream.read(byte[], int, int) @bci=49, line=345 (Compiled frame)
>  - sun.net.www.http.HttpClient.parseHTTPHeader(sun.net.www.MessageHeader, sun.net.ProgressSource, sun.net.www.protocol.http.HttpURLConnection) @bci=51, line=703 (Compiled frame)
>  - sun.net.www.http.HttpClient.parseHTTP(sun.net.www.MessageHeader, sun.net.ProgressSource, sun.net.www.protocol.http.HttpURLConnection) @bci=56, line=647 (Compiled frame)
>  - sun.net.www.protocol.http.HttpURLConnection.getInputStream0() @bci=327, line=1534 (Compiled frame)
>  - sun.net.www.protocol.http.HttpURLConnection.getInputStream() @bci=52, line=1439 (Compiled frame)
>  - java.net.HttpURLConnection.getResponseCode() @bci=16, line=480 (Compiled frame)
>  - com.sun.jersey.client.urlconnection.URLConnectionClientHandler._invoke(com.sun.jersey.api.client.ClientRequest) @bci=272, line=240 (Interpreted frame)
>  - com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(com.sun.jersey.api.client.ClientRequest) @bci=2, line=147 (Interpreted frame)
>  - org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter$1.run() @bci=11, line=226 (Interpreted frame)
>  - org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientConnectionRetry.retryOn(org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineClientRetryOp) @bci=11, line=162 (Interpreted frame)
>  - org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl$TimelineJerseyRetryFilter.handle(com.sun.jersey.api.client.ClientRequest) @bci=18, line=237 (Interpreted frame)
>  - com.sun.jersey.api.client.Client.handle(com.sun.jersey.api.client.ClientRequest) @bci=35, line=648 (Interpreted frame)
>  - com.sun.jersey.api.client.WebResource.handle(java.lang.Class, com.sun.jersey.api.client.ClientRequest) @bci=10, line=670 (Interpreted frame)
>  - com.sun.jersey.api.client.WebResource.access$200(com.sun.jersey.api.client.WebResource, java.lang.Class, com.sun.jersey.api.client.ClientRequest) @bci=3, line=74 (Compiled frame)
>  - com.sun.jersey.api.client.WebResource$Builder.post(java.lang.Class, java.lang.Object) @bci=12, line=563 (Compiled frame)
>  - org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPostingObject(java.lang.Object, java.lang.String) @bci=41, line=472 (Compiled frame)
>  - org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.doPosting(java.lang.Object, java.lang.String) @bci=3, line=321 (Compiled frame)
>  - org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.putEntities(org.apache.hadoop.yarn.api.records.timeline.TimelineEntity[]) @bci=55, line=301 (Compiled frame)
>  - org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.handleEvents(java.util.List) @bci=188, line=343 (Compiled frame)
>  - org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService.serviceStop() @bci=273, line=229 (Interpreted frame)
>  - org.apache.hadoop.service.AbstractService.stop() @bci=32, line=221 (Interpreted frame)
>  - org.apache.hadoop.service.ServiceOperations.stop(org.apache.hadoop.service.Service) @bci=5, line=52 (Interpreted frame)
>  - org.apache.hadoop.service.ServiceOperations.stopQuietly(org.apache.commons.logging.Log, org.apache.hadoop.service.Service) @bci=1, line=80 (Interpreted frame)
>  - org.apache.hadoop.service.CompositeService.stop(int, boolean) @bci=115, line=157 (Interpreted frame)
>  - org.apache.hadoop.service.CompositeService.serviceStop() @bci=58, line=131 (Interpreted frame)
>  - org.apache.tez.dag.history.HistoryEventHandler.serviceStop() @bci=11, line=80 (Interpreted frame)
>  - org.apache.hadoop.service.AbstractService.stop() @bci=32, line=221 (Interpreted frame)
>  - org.apache.hadoop.service.ServiceOperations.stop(org.apache.hadoop.service.Service) @bci=5, line=52 (Interpreted frame)
>  - org.apache.hadoop.service.ServiceOperations.stopQuietly(org.apache.commons.logging.Log, org.apache.hadoop.service.Service) @bci=1, line=80 (Interpreted frame)
>  - org.apache.hadoop.service.ServiceOperations.stopQuietly(org.apache.hadoop.service.Service) @bci=4, line=65 (Interpreted frame)
>  - org.apache.tez.dag.app.DAGAppMaster.stopServices() @bci=137, line=1675 (Interpreted frame)
>  - org.apache.tez.dag.app.DAGAppMaster.serviceStop() @bci=30, line=1831 (Interpreted frame)
>  - org.apache.hadoop.service.AbstractService.stop() @bci=32, line=221 (Interpreted frame)
>  - org.apache.tez.dag.app.DAGAppMaster$DAGAppMasterShutdownHandler$AMShutdownRunnable.run() @bci=48, line=840 (Interpreted frame)
>  - java.lang.Thread.run() @bci=11, line=745 (Interpreted frame)
> .....
> .....
> .....
> .....
> Thread 26211: (state = BLOCKED)
>  - org.apache.tez.dag.app.DAGAppMaster.shutdownTezAM() @bci=0, line=1176 (Interpreted frame)
>  - org.apache.tez.dag.api.client.DAGClientHandler.shutdownAM() @bci=22, line=124 (Interpreted frame)
>  - org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolBlockingPBServerImpl.shutdownSession(com.google.protobuf.RpcController, org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$ShutdownSessionRequestProto) @bci=55, line=179 (Interpreted frame)
>  - org.apache.tez.dag.api.client.rpc.DAGClientAMProtocolRPC$DAGClientAMProtocol$2.callBlockingMethod(com.google.protobuf.Descriptors$MethodDescriptor, com.google.protobuf.RpcController, com.google.protobuf.Message) @bci=152, line=7473 (Compiled frame)
>  - org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(org.apache.hadoop.ipc.RPC$Server, java.lang.String, org.apache.hadoop.io.Writable, long) @bci=246, line=619 (Compiled frame)
>  - org.apache.hadoop.ipc.RPC$Server.call(org.apache.hadoop.ipc.RPC$RpcKind, java.lang.String, org.apache.hadoop.io.Writable, long) @bci=9, line=962 (Compiled frame)
>  - org.apache.hadoop.ipc.Server$Handler$1.run() @bci=38, line=2039 (Compiled frame)
>  - org.apache.hadoop.ipc.Server$Handler$1.run() @bci=1, line=2035 (Compiled frame)
>  - java.security.AccessController.doPrivileged(java.security.PrivilegedExceptionAction, java.security.AccessControlContext) @bci=0 (Compiled frame)
>  - javax.security.auth.Subject.doAs(javax.security.auth.Subject, java.security.PrivilegedExceptionAction) @bci=42, line=422 (Compiled frame)
>  - org.apache.hadoop.security.UserGroupInformation.doAs(java.security.PrivilegedExceptionAction) @bci=14, line=1628 (Compiled frame)
>  - org.apache.hadoop.ipc.Server$Handler.run() @bci=308, line=2033 (Interpreted frame)
> {noformat}
> DAGAppMaster.serviceStop() gets a lock which is not released due to ATS connection (thought socket read timeout would be there; but this never comes out of the blocking call. Waited for more than 10-15 minutes).  Due to this shutdownTezAM() gets blocked and ends up occupying the slot.  
> This happened with latest tez master (commit ce26b3f52761d2a48a612a7613d99b712a320204).  Not sure if this is consistently reproduceable; Creating this ticket as a placeholder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)