You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uniffle.apache.org by ro...@apache.org on 2023/01/22 13:34:36 UTC
[incubator-uniffle] branch master updated: [ISSUE-378][HugePartition][Part-4] Supplement doc about huge partitions (#505)

This is an automated email from the ASF dual-hosted git repository.

roryqi pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle.git


The following commit(s) were added to refs/heads/master by this push:
     new 3ba56c54 [ISSUE-378][HugePartition][Part-4] Supplement doc about huge partitions (#505)
3ba56c54 is described below

commit 3ba56c5496f39032cd723906e1776ec129d3cc01
Author: Junfan Zhang <zu...@apache.org>
AuthorDate: Sun Jan 22 21:34:31 2023 +0800

    [ISSUE-378][HugePartition][Part-4] Supplement doc about huge partitions (#505)
    
    ### What changes were proposed in this pull request?
    Add doc about huge partitions
    
    ### Why are the changes needed?
    Guide for users
    
    
    ### Does this PR introduce _any_ user-facing change?
    No
    
    ### How was this patch tested?
    Don't need
---
 docs/server_guide.md | 50 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 48 insertions(+), 2 deletions(-)

diff --git a/docs/server_guide.md b/docs/server_guide.md
index 09decaae..c646d03c 100644
--- a/docs/server_guide.md
+++ b/docs/server_guide.md
@@ -86,8 +86,6 @@ This document will introduce how to deploy Uniffle shuffle servers.
 |rss.server.leak.shuffledata.check.interval|3600000|The interval of leak shuffle data check (ms)|
 |rss.server.max.concurrency.of.single.partition.writer|1|The max concurrency of single partition writer, the data partition file number is equal to this value. Default value is 1. This config could improve the writing speed, especially for huge partition.|
 |rss.metrics.reporter.class|-|The class of metrics reporter.|
-|rss.server.huge-partition.size.threshold|20g|Threshold of huge partition size, once exceeding threshold, memory usage limitation and huge partition buffer flushing will be triggered.|
-|rss.server.huge-partition.memory.limit.ratio|0.2|The memory usage limit ratio for huge partition, it will only triggered when partition's size exceeds the threshold of 'rss.server.huge-partition.size.threshold'|
 
 ### Advanced Configurations
 |Property Name|Default| Description                                                                                                                                                                                 |
@@ -104,3 +102,51 @@ PrometheusPushGatewayMetricReporter is one of the built-in metrics reporter, whi
 |rss.metrics.prometheus.pushgateway.groupingkey|-| Specifies the grouping key which is the group and global labels of all metrics. The label name and value are separated by '=', and labels are separated by ';', e.g., k1=v1;k2=v2. Please ensure that your grouping key meets the [Prometheus requirements](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels). |
 |rss.metrics.prometheus.pushgateway.jobname|-| The job name under which metrics will be pushed.                                                                                                                                                                                                                                                                                      |
 |rss.metrics.prometheus.pushgateway.report.interval.seconds|10| The interval in seconds for the reporter to report metrics.                                                                                                                                                                                                                                                                                     |
+
+### Huge Partition Optimization
+A huge partition is a common problem for Spark/MR and so on, caused by data skew. And it can cause the shuffle server to become unstable. To solve this, we introduce some mechanisms to limit the writing of huge partitions to avoid affecting regular partitions, more details can be found in [ISSUE-378](https://github.com/apache/incubator-uniffle/issues/378). The basic rules for limiting large partitions are memory usage limits and flushing individual buffers directly to persistent storage.
+
+#### Memory usage limit
+To do this, we introduce the extra configs
+
+|Property Name|Default|Description|
+|---|---|---|
+|rss.server.huge-partition.size.threshold|20g|Threshold of huge partition size, once exceeding threshold, memory usage limitation and huge partition buffer flushing will be triggered. This value depends on the capacity of per disk in shuffle server. For example, per disk capacity is 1TB, and the max size of huge partition in per disk is 5. So the total size of huge partition in local disk is 100g (10%)，this is an acceptable config value. Once reaching this threshold, it will be better to [...]
+to HDFS directly, which could be handled by multiple storage manager fallback strategy|
+|rss.server.huge-partition.memory.limit.ratio|0.2|The memory usage limit ratio for huge partition, it will only triggered when partition's size exceeds the threshold of 'rss.server.huge-partition.size.threshold'. If the buffer capacity is 10g, this means the default memory usage for huge partition is 2g. Samely, this config value depends on max size of huge partitions on per shuffle server.|
+
+#### Data flush
+Once the huge partition threshold is reached, the partition is marked as a huge partition. And then single buffer flush is triggered (writing to persistent storage as soon as possible). By default, single buffer flush is only enabled by configuring `rss.server.single.buffer.flush.enabled', but it's automatically valid for huge partition. 
+
+If you don't use HDFS, the huge partition may be flushed to local disk, which is dangerous if the partition size is larger than the free disk space. Therefore, it is recommended to use a mixed storage type, including HDFS or other distributed file systems.
+
+For HDFS, the conf value of `rss.server.single.buffer.flush.threshold` should be greater than the value of `rss.server.flush.cold.storage.threshold.size`, which will flush data directly to HDFS. 
+
+Finally, to improve the speed of writing to HDFS for a single partition, the value of `rss.server.max.concurrency.of.single.partition.writer` and `rss.server.flush.threadPool.size` could be increased to 10 or 20.
+
+#### Example of server conf
+```
+rss.rpc.server.port 19999
+rss.jetty.http.port 19998
+rss.rpc.executor.size 2000
+rss.storage.type MEMORY_LOCALFILE_HDFS
+rss.coordinator.quorum <coordinatorIp1>:19999,<coordinatorIp2>:19999
+rss.storage.basePath /data1/rssdata,/data2/rssdata....
+rss.server.flush.thread.alive 10
+rss.server.buffer.capacity 40g
+rss.server.read.buffer.capacity 20g
+rss.server.heartbeat.timeout 60000
+rss.server.heartbeat.interval 10000
+rss.rpc.message.max.size 1073741824
+rss.server.preAllocation.expired 120000
+rss.server.commit.timeout 600000
+rss.server.app.expired.withoutHeartbeat 120000
+
+# For huge partitions
+rss.server.flush.threadPool.size 20
+rss.server.flush.cold.storage.threshold.size 128m
+rss.server.single.buffer.flush.threshold 129m
+rss.server.max.concurrency.of.single.partition.writer 20
+rss.server.huge-partition.size.threshold 20g
+rss.server.huge-partition.memory.limit.ratio 0.2
+```
\ No newline at end of file