You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uniffle.apache.org by ck...@apache.org on 2023/01/17 02:55:30 UTC

[incubator-uniffle] 04/04: Improve README (#427)

This is an automated email from the ASF dual-hosted git repository.

ckj pushed a commit to branch branch-0.6
in repository https://gitbox.apache.org/repos/asf/incubator-uniffle.git

commit c4259be963a019559a4908bfbcf166cdb89699e7
Author: Kaijie Chen <ck...@apache.org>
AuthorDate: Thu Dec 15 20:23:21 2022 +0800

    Improve README (#427)
    
    1. Update the introduction section in README:
        * "compute engines" is changed to "distributed compute engines".
        * "MapReduce" is changed to "Apache Hadoop MapReduce".
        * Remove the "Total lines" badge, see [OSSInsight][1] for better insight.
    2. Fix typos.
    
    [1]: https://ossinsight.io/analyze/apache/incubator-uniffle#lines-of-code-changed "apache/incubator-uniffle | OSSInsight"
    
    Improve the README.
    
    No.
    
    Preview: https://github.com/kaijchen/incubator-uniffle/tree/readme#apache-uniffle-incubating
---
 README.md | 48 +++++++++++++++++++++++++-----------------------
 1 file changed, 25 insertions(+), 23 deletions(-)

diff --git a/README.md b/README.md
index 84fd8099..c6f2a77b 100644
--- a/README.md
+++ b/README.md
@@ -17,23 +17,25 @@
 
 # Apache Uniffle (Incubating)
 
-Uniffle is a unified remote shuffle service for compute engines.
+Uniffle is a unified remote shuffle service for distributed compute engines.
 It provides the ability to aggregate and store shuffle data on remote servers,
 thus improving the performance and reliability of large jobs.
-Currently it supports [Apache Spark](https://spark.apache.org) and [MapReduce](https://hadoop.apache.org).
+Currently it supports [Apache Spark](https://spark.apache.org) and [Apache Hadoop MapReduce](https://hadoop.apache.org).
 
 [![Build](https://github.com/apache/incubator-uniffle/actions/workflows/build.yml/badge.svg?branch=master&event=push)](https://github.com/apache/incubator-uniffle/actions/workflows/build.yml)
 [![Codecov](https://codecov.io/gh/apache/incubator-uniffle/branch/master/graph/badge.svg)](https://codecov.io/gh/apache/incubator-uniffle)
+[![License](https://img.shields.io/github/license/apache/incubator-uniffle)](https://github.com/apache/incubator-uniffle/blob/master/LICENSE)
+[![Release](https://img.shields.io/github/v/release/apache/incubator-uniffle)](https://github.com/apache/incubator-uniffle/releases)
 
 ## Architecture
 ![Rss Architecture](docs/asset/rss_architecture.png)
-Uniffle contains coordinator cluster, shuffle server cluster and remote storage(eg, HDFS) if necessary.
+Uniffle cluster consists of three components, a coordinator cluster, a shuffle server cluster and an optional remote storage (e.g., HDFS).
 
-Coordinator will collect status of shuffle server and do the assignment for the job.
+Coordinator will collect the status of shuffle servers and assign jobs based on some strategy.
 
 Shuffle server will receive the shuffle data, merge them and write to storage.
 
-Depend on different situation, Uniffle supports Memory & Local, Memory & Remote Storage(eg, HDFS), Memory & Local & Remote Storage(recommendation for production environment).
+Depending on different situations, Uniffle supports Memory & Local, Memory & Remote Storage(e.g., HDFS), Memory & Local & Remote Storage(recommendation for production environment).
 
 ## Shuffle Process with Uniffle
 
@@ -50,20 +52,20 @@ Depend on different situation, Uniffle supports Memory & Local, Memory & Remote
    8. After write data, task report all blockId to shuffle server, this step is used for data validation later
    9. Store taskAttemptId in MapStatus to support Spark speculation
 
-* Depend on different storage type, spark task read shuffle data from shuffle server or remote storage or both of them.
+* Depending on different storage types, the spark task will read shuffle data from shuffle server or remote storage or both of them.
 
 ## Shuffle file format
-The shuffle data is stored with index file and data file. Data file has all blocks for specific partition and index file has metadata for every block.
+The shuffle data is stored with index file and data file. Data file has all blocks for a specific partition and the index file has metadata for every block.
 
 ![Rss Shuffle_Write](docs/asset/rss_data_format.png)
 
 ## Supported Spark Version
-Current support Spark 2.3.x, Spark 2.4.x, Spark3.0.x, Spark 3.1.x, Spark 3.2.x
+Currently supports Spark 2.3.x, Spark 2.4.x, Spark 3.0.x, Spark 3.1.x, Spark 3.2.x
 
 Note: To support dynamic allocation, the patch(which is included in client-spark/patch folder) should be applied to Spark
 
 ## Supported MapReduce Version
-Current support Hadoop 2.8.5's MapReduce framework.
+Currently supports the MapReduce framework of Hadoop 2.8.5
 
 ## Building Uniffle
 > note: currently Uniffle requires JDK 1.8 to build, adding later JDK support is on our roadmap.
@@ -73,11 +75,11 @@ To build it, run:
 
     mvn -DskipTests clean package
 
-Build against profile Spark2(2.4.6)
+Build against profile Spark 2 (2.4.6)
 
     mvn -DskipTests clean package -Pspark2
 
-Build against profile Spark3(3.1.2)
+Build against profile Spark 3 (3.1.2)
 
     mvn -DskipTests clean package -Pspark3
 
@@ -108,13 +110,13 @@ rss-xxx.tgz will be generated for deployment
 ### Deploy Coordinator
 
 1. unzip package to RSS_HOME
-2. update RSS_HOME/bin/rss-env.sh, eg,
+2. update RSS_HOME/bin/rss-env.sh, e.g.,
    ```
      JAVA_HOME=<java_home>
      HADOOP_HOME=<hadoop home>
      XMX_SIZE="16g"
    ```
-3. update RSS_HOME/conf/coordinator.conf, eg,
+3. update RSS_HOME/conf/coordinator.conf, e.g.,
    ```
      rss.rpc.server.port 19999
      rss.jetty.http.port 19998
@@ -128,9 +130,9 @@ rss-xxx.tgz will be generated for deployment
      # config the path of excluded shuffle server
      rss.coordinator.exclude.nodes.file.path <RSS_HOME>/conf/exclude_nodes
    ```
-4. update <RSS_HOME>/conf/dynamic_client.conf, rss client will get default conf from coordinator eg,
+4. update <RSS_HOME>/conf/dynamic_client.conf, rss client will get default conf from coordinator e.g.,
    ```
-    # MEMORY_LOCALFILE_HDFS is recommandation for production environment
+    # MEMORY_LOCALFILE_HDFS is recommended for production environment
     rss.storage.type MEMORY_LOCALFILE_HDFS
     # multiple remote storages are supported, and client will get assignment from coordinator
     rss.coordinator.remote.storage.path hdfs://cluster1/path,hdfs://cluster2/path
@@ -147,18 +149,18 @@ rss-xxx.tgz will be generated for deployment
 ### Deploy Shuffle Server
 
 1. unzip package to RSS_HOME
-2. update RSS_HOME/bin/rss-env.sh, eg,
+2. update RSS_HOME/bin/rss-env.sh, e.g.,
    ```
      JAVA_HOME=<java_home>
      HADOOP_HOME=<hadoop home>
      XMX_SIZE="80g"
    ```
-3. update RSS_HOME/conf/server.conf, eg,
+3. update RSS_HOME/conf/server.conf, e.g.,
    ```
      rss.rpc.server.port 19999
      rss.jetty.http.port 19998
      rss.rpc.executor.size 2000
-     # it should be configed the same as in coordinator
+     # it should be configured the same as in coordinator
      rss.storage.type MEMORY_LOCALFILE_HDFS
      rss.coordinator.quorum <coordinatorIp1>:19999,<coordinatorIp2>:19999
      # local storage path for shuffle server
@@ -176,7 +178,7 @@ rss-xxx.tgz will be generated for deployment
      rss.server.app.expired.withoutHeartbeat 120000
      # note: the default value of rss.server.flush.cold.storage.threshold.size is 64m
      # there will be no data written to DFS if set it as 100g even rss.storage.type=MEMORY_LOCALFILE_HDFS
-     # please set proper value if DFS is used, eg, 64m, 128m.
+     # please set a proper value if DFS is used, e.g., 64m, 128m.
      rss.server.flush.cold.storage.threshold.size 100g
    ```
 4. start Shuffle Server
@@ -185,13 +187,13 @@ rss-xxx.tgz will be generated for deployment
    ```
 
 ### Deploy Spark Client
-1. Add client jar to Spark classpath, eg, SPARK_HOME/jars/
+1. Add client jar to Spark classpath, e.g., SPARK_HOME/jars/
 
    The jar for Spark2 is located in <RSS_HOME>/jars/client/spark2/rss-client-XXXXX-shaded.jar
 
    The jar for Spark3 is located in <RSS_HOME>/jars/client/spark3/rss-client-XXXXX-shaded.jar
 
-2. Update Spark conf to enable Uniffle, eg,
+2. Update Spark conf to enable Uniffle, e.g.,
 
    ```
    spark.shuffle.manager org.apache.spark.shuffle.RssShuffleManager
@@ -216,7 +218,7 @@ After apply the patch and rebuild spark, add following configuration in spark co
 
 The jar for MapReduce is located in <RSS_HOME>/jars/client/mr/rss-client-mr-XXXXX-shaded.jar
 
-2. Update MapReduce conf to enable Uniffle, eg,
+2. Update MapReduce conf to enable Uniffle, e.g.,
 
    ```
    -Dmapreduce.rss.coordinator.quorum=<coordinatorIp1>:19999,<coordinatorIp2>:19999
@@ -230,7 +232,7 @@ and job recovery (i.e., `yarn.app.mapreduce.am.job.recovery.enable=false`)
 
 ## Configuration
 
-The important configuration is listed as following.
+The important configuration is listed as follows.
 
 ### Coordinator