You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "jinhai (Jira)" <ji...@apache.org> on 2021/10/22 11:01:00 UTC

[jira] [Updated] (SPARK-37006) MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs when shuffle reading

     [ https://issues.apache.org/jira/browse/SPARK-37006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

jinhai updated SPARK-37006:
---------------------------
    Description: 
When executing the ShuffleBlockFetcherIterator.fetchHostLocalBlocks method, in order to obtain the hostLocalDirs value, we need to send an RPC request through ExternalBlockStoreClient or NettyBlockTransferService. Then get shuffle data according to blockId and localDirs.

We can add localDir to the BlockManagerId class of MapStatus, so that we can get localDir directly when fetch host-local blocks without sending RPC requests.

The benefits are:
1. No need to send RPC request localDirs value when fetchHostLocalBlocks;
2. When the external shuffle service is enabled, there is no need to register ExecutorShuffleInfo in the ExternalShuffleBlockResolver class, nor to save the ExecutorShuffleInfo data in the ExternalShuffleBlockResolver class through leveldb.
3. Also, there is no need to cache host-local dirs in the HostLocalDirManager class.

  was:
In shuffle reading, in order to get the hostLocalDirs value when executing fetchHostLocalBlocks, we need ExternalBlockStoreClient or NettyBlockTransferService to make a rpc request.

And when externalShuffleServiceEnabled, there is no need to registerExecutor and so on in the ExternalShuffleBlockResolver class.

Throughout the spark shuffle module, a lot of code logic is written to deal with localDirs.

We can directly add localDirs to the BlockManagerId class of MapStatus to get datafile and indexfile.


> MapStatus adds localDirs to avoid the rpc request by method getHostLocalDirs when shuffle reading
> -------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-37006
>                 URL: https://issues.apache.org/jira/browse/SPARK-37006
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 3.1.2
>            Reporter: jinhai
>            Priority: Major
>
> When executing the ShuffleBlockFetcherIterator.fetchHostLocalBlocks method, in order to obtain the hostLocalDirs value, we need to send an RPC request through ExternalBlockStoreClient or NettyBlockTransferService. Then get shuffle data according to blockId and localDirs.
> We can add localDir to the BlockManagerId class of MapStatus, so that we can get localDir directly when fetch host-local blocks without sending RPC requests.
> The benefits are:
> 1. No need to send RPC request localDirs value when fetchHostLocalBlocks;
> 2. When the external shuffle service is enabled, there is no need to register ExecutorShuffleInfo in the ExternalShuffleBlockResolver class, nor to save the ExecutorShuffleInfo data in the ExternalShuffleBlockResolver class through leveldb.
> 3. Also, there is no need to cache host-local dirs in the HostLocalDirManager class.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org