You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Quanlong Huang (Jira)" <ji...@apache.org> on 2022/04/11 02:25:00 UTC
[jira] [Commented] (IMPALA-11234) impalad keeps reporting ShortCircuitCache slot release failures in heavy workload

    [ https://issues.apache.org/jira/browse/IMPALA-11234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520270#comment-17520270 ] 

Quanlong Huang commented on IMPALA-11234:
-----------------------------------------

I pick one shmId(140d269194fcb432a23c81c3ba255163) and analyze the related logs in impalad and DataNodes.

In impalad.INFO, I can see logs like this. slotIdx has values from 0 to 88. The socket path are all "/var/lib/hadoop-hdfs/socket.31000":
{code:java}
E0411 08:15:28.107564 11924 ShortCircuitCache.java:232] ShortCircuitCache(0xe20957e): failed to release short-circuit shared memory slot Slot(slotIdx=0, shm=DfsClientShm(140d269194fcb432a23c81c3ba255163)) by sending ReleaseShortCircuitAccessRequestProto to /var/lib/hadoop-hdfs/socket.31000.  Closing shared memory segment.
E0411 08:15:28.130812 11924 ShortCircuitCache.java:232] ShortCircuitCache(0xe20957e): failed to release short-circuit shared memory slot Slot(slotIdx=1, shm=DfsClientShm(140d269194fcb432a23c81c3ba255163)) by sending ReleaseShortCircuitAccessRequestProto to /var/lib/hadoop-hdfs/socket.31000.  Closing shared memory segment.
E0411 08:15:28.184429 11924 ShortCircuitCache.java:232] ShortCircuitCache(0xe20957e): failed to release short-circuit shared memory slot Slot(slotIdx=85, shm=DfsClientShm(140d269194fcb432a23c81c3ba255163)) by sending ReleaseShortCircuitAccessRequestProto to /var/lib/hadoop-hdfs/socket.31000.  Closing shared memory segment.{code}
There are 3 DataNodes running on my desktop machine. node-3 has the port 31000. However, in its logs I can only find request of 52 slots. They distribute randomly. In logs/cluster/cdh7-node-3/hadoop-hdfs/hdfs-datanode.log:
{code:java}
2022-04-11 08:15:25,263 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: cliID: DFSClient_NONMAPREDUCE_2061951479_1, src: 127.0.0.1, dest: 127.0.0.1, op: REQUEST_SHORT_CIRCUIT_SHM, shmId: 140d269194fcb432a23c81c3ba255163, srvID: 0097f112-0937-4cbe-b842-17aef9191461, success: true
2022-04-11 08:15:28,130 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId: 140d269194fcb432a23c81c3ba255163, slotIdx: 1, srvID: 0097f112-0937-4cbe-b842-17aef9191461, success: true
2022-04-11 08:15:28,184 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId: 140d269194fcb432a23c81c3ba255163, slotIdx: 85, srvID: 0097f112-0937-4cbe-b842-17aef9191461, success: true
2022-04-11 08:15:28,192 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId: 140d269194fcb432a23c81c3ba255163, slotIdx: 2, srvID: 0097f112-0937-4cbe-b842-17aef9191461, success: true
{code}
The wird thing is, I found the missing 37 requests in another DataNode's log. In logs/cluster/cdh7-node-2/hadoop-hdfs/hdfs-datanode.log:
{code:java}
2022-04-11 08:15:28,103 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId: 140d269194fcb432a23c81c3ba255163, slotIdx: 0, srvID: f681b686-3c87-4a46-a4c1-a10ad5980a01, success: true
2022-04-11 08:15:28,205 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 127.0.0.1, dest: 127.0.0.1, op: RELEASE_SHORT_CIRCUIT_FDS, shmId: 140d269194fcb432a23c81c3ba255163, slotIdx: 87, srvID: f681b686-3c87-4a46-a4c1-a10ad5980a01, success: true
{code}
"srvID" in the logs is the service Id of the DataNode. They identifies different DataNodes.

It seems the hdfs client in impalad sends RELEASE_SHORT_CIRCUIT_FDS requests to wrong DataNodes.

> impalad keeps reporting ShortCircuitCache slot release failures in heavy workload
> ---------------------------------------------------------------------------------
>
>                 Key: IMPALA-11234
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11234
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Major
>
> I keep seeing this error during a local perf test on my desktop machine:
> {code:java}
> E0410 07:04:10.691095   430 ShortCircuitCache.java:232] ShortCircuitCache(0x6e76c6a7): failed to release short-circuit shared memory slot Slot(slotIdx=0, shm=DfsClientShm(1effcf56a590fbc371938a368987f4e9)) by sending ReleaseShortCircuitAccessRequestProto to /var/lib/hadoop-hdfs/socket.31001.  Closing shared memory segment.
> Java exception follows:
> java.io.IOException: ERROR_INVALID: there is no shared memory segment registered with shmId 1effcf56a590fbc371938a368987f4e9
>         at org.apache.hadoop.hdfs.shortcircuit.ShortCircuitCache$SlotReleaser.run(ShortCircuitCache.java:214)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
>  {code}
> I can also find it in our Jenkins jobs, but it only happens in the data-loading phase. So I suspend it only happens in heavy workloads.
> HDFS-14701 mentioned that this happens when the DataNode is stopped/restarted. But I didn't restart my HDFS cluster and I'm still able to see this error log.
> It worth investigating if we are doing something wrong in short-circuit related stuffs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org