You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2024/02/01 21:24:00 UTC

[jira] [Commented] (HADOOP-19061) Capture exception in rpcRequestSender.start() in IPC.Connection.run()

    [ https://issues.apache.org/jira/browse/HADOOP-19061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17813410#comment-17813410 ] 

ASF GitHub Bot commented on HADOOP-19061:
-----------------------------------------

xinglin opened a new pull request, #6519:
URL: https://github.com/apache/hadoop/pull/6519

   …onnection.run()
   
   <!--
     Thanks for sending a pull request!
       1. If this is your first time, please read our contributor guidelines: https://cwiki.apache.org/confluence/display/HADOOP/How+To+Contribute
       2. Make sure your PR title starts with JIRA issue id, e.g., 'HADOOP-17799. Your PR title ...'.
   -->
   
   ### Description of PR
   rpcRequestThread.start() is called outside of the try-catch{} block. However, it can throw OOM. In such cases, we fail to start the Connection and Connection.rpcRequestThread threads. However, this OOM won't be captured in Connection.setupIOStreams(). Instead, that function returns and getConnection() will return an Connection object and we will continue with connection.sendRpcRequest(call). sendRpcRequest() will then be hanging forever at its while loop, because we don't mark this connection as closed and we don't have the rpcRequestSender thread to poll the request from the queue.
   
   Please see more details in the jira description.
   
   ### How was this patch tested?
   
   trivial change. let jenkins build.
   
   ### For code changes:
   
   - [ ] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
   - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
   - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files?
   
   




> Capture exception in rpcRequestSender.start() in IPC.Connection.run()
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-19061
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19061
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 3.5.0
>            Reporter: Xing Lin
>            Assignee: Xing Lin
>            Priority: Major
>
> _rpcRequestThread.start()_ is called outside of the try-catch{} block. However, it can throw OOM. In such cases, we fail to start the Connection and Connection.rpcRequestThread threads. However, this OOM won't be captured in {_}Connection.setupIOStreams(){_}. Instead, that function returns and getConnection() will return an Connection object and we will continue with _connection.sendRpcRequest(call). sendRpcRequest()_ will then be hanging forever at its while loop, because we don't mark this connection as closed and we don't have the rpcRequestSender thread to poll the request from the queue.
> {code:java}
> IPC.Connection.run()
>   @Override
>     public void run() {
>       // Don't start the ipc parameter sending thread until we start this
>       // thread, because the shutdown logic only gets triggered if this
>       // thread is started.
>       rpcRequestThread.start();
>       if (LOG.isDebugEnabled())
>         LOG.debug(getName() + ": starting, having connections " 
>             + connections.size());      
>       try {
>         while (waitForWork()) {//wait here for work - read or close connection
>           receiveRpcResponse();
>         }
>       } catch (Throwable t) {
>         // This truly is unexpected, since we catch IOException in receiveResponse
>         // -- this is only to be really sure that we don't leave a client hanging
>         // forever.
>         LOG.warn("Unexpected error reading responses on connection " + this, t);
>         markClosed(new IOException("Error reading responses", t));
>       }{code}
> while loop in sendRpcRequest
> {code:java}
> while (!shouldCloseConnection.get()) {
>   if (rpcRequestQueue.offer(Pair.of(call, buf), 1, TimeUnit.SECONDS)) {
>     break;
>   }
> }{code}
> OOM exception in starting the rpcRequestSender thread.
> {code:java}
> Exception in thread "IPC Client (1664093259) connection to nn01.grid.linkedin.com/IP-Address:portNum from kafkaetl" java.lang.OutOfMemoryError: unable to create new native thread
> 	at java.lang.Thread.start0(Native Method)
> 	at java.lang.Thread.start(Thread.java:717)
> 	at org.apache.hadoop.ipc.Client$Connection.run(Client.java:1034)
> {code}
> Multiple threads blocked by queue.offer(). and we don't found any "IPC Client" or "IPC Parameter Sending Thread" in thread dump. 
> {code:java}
> Thread 2156123: (state = BLOCKED)
>  - sun.misc.Unsafe.park(boolean, long) @bci=0 (Compiled frame; information may be imprecise)
>  - java.util.concurrent.locks.LockSupport.parkNanos(java.lang.Object, long) @bci=20, line=215 (Compiled frame)
>  - java.util.concurrent.SynchronousQueue$TransferQueue.awaitFulfill(java.util.concurrent.SynchronousQueue$TransferQueue$QNode, java.lang.Object, boolean, long) @bci=156, line=764 (Compiled frame)
>  - java.util.concurrent.SynchronousQueue$TransferQueue.transfer(java.lang.Object, boolean, long) @bci=148, line=695 (Compiled frame)
>  - java.util.concurrent.SynchronousQueue.offer(java.lang.Object, long, java.util.concurrent.TimeUnit) @bci=24, line=895 (Compiled frame)
>  - org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(org.apache.hadoop.ipc.Client$Call) @bci=88, line=1134 (Compiled frame)
>  - org.apache.hadoop.ipc.Client.call(org.apache.hadoop.ipc.RPC$RpcKind, org.apache.hadoop.io.Writable, org.apache.hadoop.ipc.Client$ConnectionId, int, java.util.concurrent.atomic.AtomicBoolean, org.apache.hadoop.ipc.AlignmentContext) @bci=36, line=1402 (Interpreted frame)
>  - org.apache.hadoop.ipc.Client.call(org.apache.hadoop.ipc.RPC$RpcKind, org.apache.hadoop.io.Writable, org.apache.hadoop.ipc.Client$ConnectionId, java.util.concurrent.atomic.AtomicBoolean, org.apache.hadoop.ipc.AlignmentContext) @bci=9, line=1349 (Compiled frame)
>  - org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(java.lang.Object, java.lang.reflect.Method, java.lang.Object[]) @bci=248, line=230 (Compiled frame)
>  - org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(java.lang.Object, java.lang.reflect.Method, java.lang.Object[]) @bci=4, line=118 (Compiled frame)
>  - com.sun.proxy.$Proxy11.getBlockLocations({code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org