You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uniffle.apache.org by ro...@apache.org on 2022/09/24 03:25:34 UTC

[incubator-uniffle] branch master updated: Set the default disk capacity to the total space (#237)

This is an automated email from the ASF dual-hosted git repository.

roryqi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle.git


The following commit(s) were added to refs/heads/master by this push:
     new f0fbdc48 Set the default disk capacity to the total space (#237)
f0fbdc48 is described below

commit f0fbdc48f662221f61dfbcdfbbcfb44ee58efc1a
Author: Junfan Zhang <ju...@outlook.com>
AuthorDate: Sat Sep 24 11:25:29 2022 +0800

    Set the default disk capacity to the total space (#237)
    
    ### What changes were proposed in this pull request?
    Set the default disk capacity to the total space
    
    ### Why are the changes needed?
    When shuffle-servers are colocated with NodeManagers, the disk capacity is hard to set. Because when it's equal with the total space, it will cause the most shuffle-servers startup fail due to free-space checking. If not, it will waste some disk resources.
    
    And the default disk capacity is 1TB, it's not out-of-box value. So this PR is to set the default value of 'rss.server.disk.capacity' to the disk total space.
    
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    No need.
---
 README.md                                          | 32 ++---------
 docs/server_guide.md                               | 62 ++++++++++++++++++++++
 .../apache/uniffle/server/ShuffleServerConf.java   |  6 +--
 .../uniffle/storage/common/LocalStorage.java       | 16 ++++--
 4 files changed, 81 insertions(+), 35 deletions(-)

diff --git a/README.md b/README.md
index 4ab67422..8ec523c4 100644
--- a/README.md
+++ b/README.md
@@ -235,33 +235,11 @@ and job recovery (i.e., `yarn.app.mapreduce.am.job.recovery.enable=false`)
 
 The important configuration is listed as following.
 
-### Coordinator
-
-For more details of advanced configuration, please see [Uniffle Coordinator Guide](https://github.com/apache/incubator-uniffle/blob/master/docs/coordinator_guide.md).
-
-### Shuffle Server
-
-|Property Name|Default|Description|
-|---|---|---|
-|rss.coordinator.quorum|-|Coordinator quorum|
-|rss.rpc.server.port|-|RPC port for Shuffle server|
-|rss.jetty.http.port|-|Http port for Shuffle server|
-|rss.server.buffer.capacity|-|Max memory of buffer manager for shuffle server|
-|rss.server.memory.shuffle.highWaterMark.percentage|75.0|Threshold of spill data to storage, percentage of rss.server.buffer.capacity|
-|rss.server.memory.shuffle.lowWaterMark.percentage|25.0|Threshold of keep data in memory, percentage of rss.server.buffer.capacity|
-|rss.server.read.buffer.capacity|-|Max size of buffer for reading data|
-|rss.server.heartbeat.interval|10000|Heartbeat interval to Coordinator (ms)|
-|rss.server.flush.threadPool.size|10|Thread pool for flush data to file|
-|rss.server.commit.timeout|600000|Timeout when commit shuffle data (ms)|
-|rss.storage.type|-|Supports MEMORY_LOCALFILE, MEMORY_HDFS, MEMORY_LOCALFILE_HDFS|
-|rss.server.flush.cold.storage.threshold.size|64M| The threshold of data size for LOACALFILE and HDFS if MEMORY_LOCALFILE_HDFS is used|
-|rss.server.tags|-|The comma-separated list of tags to indicate the shuffle server's attributes. It will be used as the assignment basis for the coordinator|
-|rss.server.single.buffer.flush.enabled|false|Whether single buffer flush when size exceeded rss.server.single.buffer.flush.threshold|
-|rss.server.single.buffer.flush.threshold|64M|The threshold of single shuffle buffer flush|
-
-### Shuffle Client
-
-For more details of advanced configuration, please see [Uniffle Shuffle Client Guide](https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md).
+|Role|Link|
+|---|---|
+|coordinator|[Uniffle Coordinator Guide](https://github.com/apache/incubator-uniffle/blob/master/docs/coordinator_guide.md)|
+|shuffle server|[Uniffle Shuffle Server Guide](https://github.com/apache/incubator-uniffle/blob/master/docs/server_guide.md)|
+|client|[Uniffle Shuffle Client Guide](https://github.com/apache/incubator-uniffle/blob/master/docs/client_guide.md)|
 
 ## Security:Hadoop kerberos authentication
 The primary goals of the Uniffle Kerberos security are:
diff --git a/docs/server_guide.md b/docs/server_guide.md
index e959839f..5e2a97ec 100644
--- a/docs/server_guide.md
+++ b/docs/server_guide.md
@@ -20,3 +20,65 @@ license: |
   limitations under the License.
 ---
 # Uniffle Shuffle Server Guide
+
+## Deploy
+This document will introduce how to deploy Uniffle shuffle servers.
+
+### Steps
+1. unzip package to RSS_HOME
+2. update RSS_HOME/bin/rss-env.sh, eg,
+   ```
+     JAVA_HOME=<java_home>
+     HADOOP_HOME=<hadoop home>
+     XMX_SIZE="80g"
+   ```
+3. update RSS_HOME/conf/server.conf, eg,
+   ```
+     rss.rpc.server.port 19999
+     rss.jetty.http.port 19998
+     rss.rpc.executor.size 2000
+     # it should be configed the same as in coordinator
+     rss.storage.type MEMORY_LOCALFILE_HDFS
+     rss.coordinator.quorum <coordinatorIp1>:19999,<coordinatorIp2>:19999
+     # local storage path for shuffle server
+     rss.storage.basePath /data1/rssdata,/data2/rssdata....
+     # it's better to config thread num according to local disk num
+     rss.server.flush.thread.alive 5
+     rss.server.flush.threadPool.size 10
+     rss.server.buffer.capacity 40g
+     rss.server.read.buffer.capacity 20g
+     rss.server.heartbeat.timeout 60000
+     rss.server.heartbeat.interval 10000
+     rss.rpc.message.max.size 1073741824
+     rss.server.preAllocation.expired 120000
+     rss.server.commit.timeout 600000
+     rss.server.app.expired.withoutHeartbeat 120000
+     # note: the default value of rss.server.flush.cold.storage.threshold.size is 64m
+     # there will be no data written to DFS if set it as 100g even rss.storage.type=MEMORY_LOCALFILE_HDFS
+     # please set proper value if DFS is used, eg, 64m, 128m.
+     rss.server.flush.cold.storage.threshold.size 100g
+   ```
+4. start Shuffle Server
+   ```
+    bash RSS_HOME/bin/start-shuffle-server.sh
+   ```
+
+## Configuration
+|Property Name|Default|Description|
+|---|---|---|
+|rss.coordinator.quorum|-|Coordinator quorum|
+|rss.rpc.server.port|-|RPC port for Shuffle server|
+|rss.jetty.http.port|-|Http port for Shuffle server|
+|rss.server.buffer.capacity|-|Max memory of buffer manager for shuffle server|
+|rss.server.memory.shuffle.highWaterMark.percentage|75.0|Threshold of spill data to storage, percentage of rss.server.buffer.capacity|
+|rss.server.memory.shuffle.lowWaterMark.percentage|25.0|Threshold of keep data in memory, percentage of rss.server.buffer.capacity|
+|rss.server.read.buffer.capacity|-|Max size of buffer for reading data|
+|rss.server.heartbeat.interval|10000|Heartbeat interval to Coordinator (ms)|
+|rss.server.flush.threadPool.size|10|Thread pool for flush data to file|
+|rss.server.commit.timeout|600000|Timeout when commit shuffle data (ms)|
+|rss.storage.type|-|Supports MEMORY_LOCALFILE, MEMORY_HDFS, MEMORY_LOCALFILE_HDFS|
+|rss.server.flush.cold.storage.threshold.size|64M| The threshold of data size for LOACALFILE and HDFS if MEMORY_LOCALFILE_HDFS is used|
+|rss.server.tags|-|The comma-separated list of tags to indicate the shuffle server's attributes. It will be used as the assignment basis for the coordinator|
+|rss.server.single.buffer.flush.enabled|false|Whether single buffer flush when size exceeded rss.server.single.buffer.flush.threshold|
+|rss.server.single.buffer.flush.threshold|64M|The threshold of single shuffle buffer flush|
+|rss.server.disk.capacity|-1|Disk capacity that shuffle server can use. If it's negative, it will use the default disk whole space|
diff --git a/server/src/main/java/org/apache/uniffle/server/ShuffleServerConf.java b/server/src/main/java/org/apache/uniffle/server/ShuffleServerConf.java
index 7336f6d5..9e580842 100644
--- a/server/src/main/java/org/apache/uniffle/server/ShuffleServerConf.java
+++ b/server/src/main/java/org/apache/uniffle/server/ShuffleServerConf.java
@@ -177,9 +177,9 @@ public class ShuffleServerConf extends RssBaseConf {
   public static final ConfigOption<Long> DISK_CAPACITY = ConfigOptions
       .key("rss.server.disk.capacity")
       .longType()
-      .checkValue(ConfigUtils.POSITIVE_LONG_VALIDATOR, "disk capacity must be positive")
-      .defaultValue(1024L * 1024L * 1024L * 1024L)
-      .withDescription("Disk capacity that shuffle server can use");
+      .defaultValue(-1L)
+      .withDescription("Disk capacity that shuffle server can use. "
+          + "If it's negative, it will use the default whole space");
 
   public static final ConfigOption<Long> SHUFFLE_EXPIRED_TIMEOUT_MS = ConfigOptions
       .key("rss.server.shuffle.expired.timeout.ms")
diff --git a/storage/src/main/java/org/apache/uniffle/storage/common/LocalStorage.java b/storage/src/main/java/org/apache/uniffle/storage/common/LocalStorage.java
index a2fa0847..19c886ee 100644
--- a/storage/src/main/java/org/apache/uniffle/storage/common/LocalStorage.java
+++ b/storage/src/main/java/org/apache/uniffle/storage/common/LocalStorage.java
@@ -44,7 +44,7 @@ public class LocalStorage extends AbstractStorage {
   private static final Logger LOG = LoggerFactory.getLogger(LocalStorage.class);
   public static final String STORAGE_HOST = "local";
 
-  private final long capacity;
+  private long capacity;
   private final String basePath;
   private final double cleanupThreshold;
   private final long cleanIntervalMs;
@@ -77,10 +77,16 @@ public class LocalStorage extends AbstractStorage {
       LOG.warn("Init base directory " + basePath + " fail, the disk should be corrupted", ioe);
       throw new RuntimeException(ioe);
     }
-    long freeSpace = baseFolder.getFreeSpace();
-    if (freeSpace < capacity) {
-      throw new IllegalArgumentException("The Disk of " + basePath + " Available Capacity " + freeSpace
-          + " is smaller than configuration");
+    if (capacity < 0L) {
+      this.capacity = baseFolder.getTotalSpace();
+      LOG.info("Make the disk capacity the total space when \"rss.server.disk.capacity\" is not specified "
+          + "or less than 0");
+    } else {
+      long freeSpace = baseFolder.getFreeSpace();
+      if (freeSpace < capacity) {
+        throw new IllegalArgumentException("The Disk of " + basePath + " Available Capacity " + freeSpace
+            + " is smaller than configuration");
+      }
     }
   }