You are viewing a plain text version of this content. The canonical link for it is here.

Posted to notifications@iotdb.apache.org by "Xinyu Tan (Jira)" <ji...@apache.org> on 2023/05/05 10:48:00 UTC

[jira] [Created] (IOTDB-5840) Avoid the problem that the insertRecords interface may cause the number of threads to balloon when there are too many data regions

Xinyu Tan created IOTDB-5840:
--------------------------------

Summary: Avoid the problem that the insertRecords interface may cause the number of threads to balloon when there are too many data regions
Key: IOTDB-5840
URL: https://issues.apache.org/jira/browse/IOTDB-5840
Project: Apache IoTDB
Issue Type: Improvement
Reporter: Xinyu Tan
Assignee: Xinyu Tan

On a machine with sufficient CPU resources (for example, 32 cores), if the number of Dataregions is too small, the write pressure in the cluster is concentrated on the locks of these regions. As a result, the write latency is high and the throughput cannot be increased. When the number of DataRegion is large, for an InsertRecords request with a large batchSize such as 10000, its write request may involve many DataRegion. Once the concurrency is high, It takes hundreds of internalServiceClient to dispatch the planNode. Under the current threading model of BIO, this would also increase the number of InternalServiceRPC threads in the cluster to hundreds or thousands.

For example, in a user test environment, coreSize of the clientManager is set to 600 and maxSize is set to 1000 to prevent concurrent write requests from blocking each other while obtaining internalServiceClient. The result is that each node has nearly 1000 InternalServiceRPC threads. If the client increases concurrency further, a "connection reset by peer" error is reported. This error should be caused by the default parameters of the linux kernel not supporting so many connections.

The current mpp framework splits Plannodes by region only. Therefore, the number of RPCS to be sent per write request is closely related to the number of dataregion involved in the request rather than the number of Datanodes.

The solution to this problem is to aggregate RPC requests sent to the same datanode. This reduces the pressure on the clientManager and reduces the number of InternalServiceRPC threads. Avoid sending the connection reset by peer error to the client again.

After the optimization, the number of RPC service threads was reduced from 1000 to 200. The connection reset by peer error was cleared. And we can increase the number of regions to make full use of cluster cpu resources

--
This message was sent by Atlassian Jira
(v8.20.10#820010)