You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uniffle.apache.org by js...@apache.org on 2022/07/01 06:57:48 UTC
[incubator-uniffle] 04/17: [Doc] Update readme with features like multiple remote storage support etc (#191)
This is an automated email from the ASF dual-hosted git repository.
jshao pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle.git
commit 11a8594e868db3aaf55af9baa1903e8cbd17413e
Author: Colin <co...@tencent.com>
AuthorDate: Wed Jun 22 16:38:27 2022 +0800
[Doc] Update readme with features like multiple remote storage support etc (#191)
What changes were proposed in this pull request?
Update Readme for latest features, eg, multiple remote storage support, dynamic client conf etc.
Why are the changes needed?
Doc should be updated
Does this PR introduce any user-facing change?
No
How was this patch tested?
No need
---
README.md | 46 ++++++++++++++++++++++++++++++++++------------
1 file changed, 34 insertions(+), 12 deletions(-)
diff --git a/README.md b/README.md
index e134f0f..50903ce 100644
--- a/README.md
+++ b/README.md
@@ -11,7 +11,7 @@ Coordinator will collect status of shuffle server and do the assignment for the
Shuffle server will receive the shuffle data, merge them and write to storage.
-Depend on different situation, Firestorm supports Memory & Local, Memory & Remote Storage(eg, HDFS), Local only, Remote Storage only.
+Depend on different situation, Firestorm supports Memory & Local, Memory & Remote Storage(eg, HDFS), Memory & Local & Remote Storage(recommendation for production environment).
## Shuffle Process with Firestorm
@@ -74,9 +74,25 @@ rss-xxx.tgz will be generated for deployment
rss.coordinator.server.heartbeat.timeout 30000
rss.coordinator.app.expired 60000
rss.coordinator.shuffle.nodes.max 5
- rss.coordinator.exclude.nodes.file.path RSS_HOME/conf/exclude_nodes
- ```
-4. start Coordinator
+ # enable dynamicClientConf, and coordinator will be responsible for most of client conf
+ rss.coordinator.dynamicClientConf.enabled true
+ # config the path of client conf
+ rss.coordinator.dynamicClientConf.path <RSS_HOME>/conf/dynamic_client.conf
+ # config the path of excluded shuffle server
+ rss.coordinator.exclude.nodes.file.path <RSS_HOME>/conf/exclude_nodes
+ ```
+4. update <RSS_HOME>/conf/dynamic_client.conf, rss client will get default conf from coordinator eg,
+ ```
+ # MEMORY_LOCALFILE_HDFS is recommandation for production environment
+ rss.storage.type MEMORY_LOCALFILE_HDFS
+ # multiple remote storages are supported, and client will get assignment from coordinator
+ rss.coordinator.remote.storage.path hdfs://cluster1/path,hdfs://cluster2/path
+ rss.writer.require.memory.retryMax 1200
+ rss.client.retry.max 100
+ rss.writer.send.check.timeout 600000
+ rss.client.read.buffer.size 14m
+ ```
+5. start Coordinator
```
bash RSS_HOME/bin/start-coordnator.sh
```
@@ -90,14 +106,17 @@ rss-xxx.tgz will be generated for deployment
HADOOP_HOME=<hadoop home>
XMX_SIZE="80g"
```
-3. update RSS_HOME/conf/server.conf, the following demo is for memory + local storage only, eg,
+3. update RSS_HOME/conf/server.conf, eg,
```
rss.rpc.server.port 19999
rss.jetty.http.port 19998
rss.rpc.executor.size 2000
- rss.storage.type MEMORY_LOCALFILE
+ # it should be configed the same as in coordinator
+ rss.storage.type MEMORY_LOCALFILE_HDFS
rss.coordinator.quorum <coordinatorIp1>:19999,<coordinatorIp2>:19999
+ # local storage path for shuffle server
rss.storage.basePath /data1/rssdata,/data2/rssdata....
+ # it's better to config thread num according to local disk num
rss.server.flush.thread.alive 5
rss.server.flush.threadPool.size 10
rss.server.buffer.capacity 40g
@@ -108,6 +127,10 @@ rss-xxx.tgz will be generated for deployment
rss.server.preAllocation.expired 120000
rss.server.commit.timeout 600000
rss.server.app.expired.withoutHeartbeat 120000
+ # note: the default value of rss.server.flush.cold.storage.threshold.size is 64m
+ # there will be no data written to DFS if set it as 100g even rss.storage.type=MEMORY_LOCALFILE_HDFS
+ # please set proper value if DFS is used, eg, 64m, 128m.
+ rss.server.flush.cold.storage.threshold.size 100g
```
4. start Shuffle Server
```
@@ -121,12 +144,11 @@ rss-xxx.tgz will be generated for deployment
The jar for Spark3 is located in <RSS_HOME>/jars/client/spark3/rss-client-XXXXX-shaded.jar
-2. Update Spark conf to enable Firestorm, the following demo is for local storage only, eg,
+2. Update Spark conf to enable Firestorm, eg,
```
spark.shuffle.manager org.apache.spark.shuffle.RssShuffleManager
spark.rss.coordinator.quorum <coordinatorIp1>:19999,<coordinatorIp2>:19999
- spark.rss.storage.type MEMORY_LOCALFILE
```
### Support Spark dynamic allocation
@@ -140,17 +162,16 @@ After apply the patch and rebuild spark, add following configuration in spark co
spark.dynamicAllocation.enabled true
```
-## Deploy MapReduce Client
+### Deploy MapReduce Client
1. Add client jar to the classpath of each NodeManager, e.g., <HADOOP>/share/hadoop/mapreduce/
The jar for MapReduce is located in <RSS_HOME>/jars/client/mr/rss-client-mr-XXXXX-shaded.jar
-2. Update MapReduce conf to enable Firestorm, the following demo is for local storage only, eg,
+2. Update MapReduce conf to enable Firestorm, eg,
```
-Dmapreduce.rss.coordinator.quorum=<coordinatorIp1>:19999,<coordinatorIp2>:19999
- -Dmapreduce.rss.storage.type=MEMORY_LOCALFILE
-Dyarn.app.mapreduce.am.command-opts=org.apache.hadoop.mapreduce.v2.app.RssMRAppMaster
-Dmapreduce.job.map.output.collector.class=org.apache.hadoop.mapred.RssMapOutputCollector
-Dmapreduce.job.reduce.shuffle.consumer.plugin.class=org.apache.hadoop.mapreduce.task.reduce.RssShuffle
@@ -168,9 +189,10 @@ The important configuration is listed as following.
|Property Name|Default| Description|
|---|---|---|
|rss.coordinator.server.heartbeat.timeout|30000|Timeout if can't get heartbeat from shuffle server|
-|rss.coordinator.assignment.strategy|BASIC|Strategy for assigning shuffle server, only BASIC support|
+|rss.coordinator.assignment.strategy|PARTITION_BALANCE|Strategy for assigning shuffle server, PARTITION_BALANCE should be used for workload balance|
|rss.coordinator.app.expired|60000|Application expired time (ms), the heartbeat interval should be less than it|
|rss.coordinator.shuffle.nodes.max|9|The max number of shuffle server when do the assignment|
+|rss.coordinator.dynamicClientConf.path|-|The path of configuration file which have default conf for rss client|
|rss.coordinator.exclude.nodes.file.path|-|The path of configuration file which have exclude nodes|
|rss.coordinator.exclude.nodes.check.interval.ms|60000|Update interval (ms) for exclude nodes|
|rss.rpc.server.port|-|RPC port for coordinator|