You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@celeborn.apache.org by et...@apache.org on 2023/03/15 12:13:27 UTC

[incubator-celeborn-website] branch main updated: Add docs about deploy.

This is an automated email from the ASF dual-hosted git repository.

ethanfeng pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-celeborn-website.git


The following commit(s) were added to refs/heads/main by this push:
     new 950b0fe  Add docs about deploy.
950b0fe is described below

commit 950b0fe5c44a829532fb3dac7f195e68be6bb120
Author: mingji <fe...@alibaba-inc.com>
AuthorDate: Wed Mar 15 20:13:17 2023 +0800

    Add docs about deploy.
---
 docs/user_guide/deploy.md | 136 ++++++++++++++++++++++++++++++++++++++++++++++
 mkdocs.yml                |   1 +
 2 files changed, 137 insertions(+)

diff --git a/docs/user_guide/deploy.md b/docs/user_guide/deploy.md
new file mode 100644
index 0000000..3107923
--- /dev/null
+++ b/docs/user_guide/deploy.md
@@ -0,0 +1,136 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+  http://www.apache.org/licenses/LICENSE-2.0
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+# Deploy
+
+## Deploy Celeborn
+
+1. Unzip the tarball to `$CELEBORN_HOME`
+2. Modify environment variables in `$CELEBORN_HOME/conf/celeborn-env.sh`
+
+EXAMPLE:
+```properties
+#!/usr/bin/env bash
+CELEBORN_MASTER_MEMORY=4g
+CELEBORN_WORKER_MEMORY=2g
+CELEBORN_WORKER_OFFHEAP_MEMORY=4g
+```
+3. Modify configurations in `$CELEBORN_HOME/conf/celeborn-defaults.conf`
+
+EXAMPLE: single master cluster
+```properties
+# used by client and worker to connect to master
+celeborn.master.endpoints clb-master:9097
+
+# used by master to bootstrap
+celeborn.master.host clb-master
+celeborn.master.port 9097
+
+celeborn.metrics.enabled true
+celeborn.worker.flush.buffer.size 256k
+celeborn.worker.storage.dirs /mnt/disk1/,/mnt/disk2
+# If your hosts have disk raid or use lvm, set celeborn.worker.monitor.disk.enabled to false
+celeborn.worker.monitor.disk.enabled false
+```   
+
+EXAMPLE: HA cluster
+```properties
+# used by client and worker to connect to master
+celeborn.master.endpoints clb-1:9097,clb-2:9098,clb-3:9099
+
+# used by master nodes to bootstrap, every node should know the topology of whole cluster, for each node,
+# `celeborn.ha.master.node.id` should be unique, and `celeborn.ha.master.node.<id>.host` is required
+celeborn.ha.enabled true
+celeborn.ha.master.node.id 1
+celeborn.ha.master.node.1.host clb-1
+celeborn.ha.master.node.1.port 9097
+celeborn.ha.master.node.1.ratis.port 9872
+celeborn.ha.master.node.2.host clb-2
+celeborn.ha.master.node.2.port 9098
+celeborn.ha.master.node.2.ratis.port 9873
+celeborn.ha.master.node.3.host clb-3
+celeborn.ha.master.node.3.port 9099
+celeborn.ha.master.node.3.ratis.port 9874
+celeborn.ha.master.ratis.raft.server.storage.dir /mnt/disk1/rss_ratis/
+
+celeborn.metrics.enabled true
+# If you want to use HDFS as shuffle storage, make sure that flush buffer size is at least 4MB or larger.
+celeborn.worker.flush.buffer.size 256k
+celeborn.worker.storage.dirs /mnt/disk1/,/mnt/disk2
+# If your hosts have disk raid or use lvm, set celeborn.worker.monitor.disk.enabled to false
+celeborn.worker.monitor.disk.enabled false
+```
+
+4. Copy Celeborn and configurations to all nodes
+5. Start all services. If you install Celeborn distribution in same path on every node and your
+   cluster can perform SSH login then you can fill `$CELEBORN_HOME/conf/hosts` and
+   use `$CELEBORN_HOME/sbin/start-all.sh` to start all
+   services. If the installation paths are not identical, you will need to start service manually.  
+   Start Celeborn master  
+   `$CELEBORN_HOME/sbin/start-master.sh`  
+   Start Celeborn worker  
+   `$CELEBORN_HOME/sbin/start-worker.sh`
+6. If Celeborn start success, the output of Master's log should be like this:
+```angular2html
+22/10/08 19:29:11,805 INFO [main] Dispatcher: Dispatcher numThreads: 64
+22/10/08 19:29:11,875 INFO [main] TransportClientFactory: mode NIO threads 64
+22/10/08 19:29:12,057 INFO [main] Utils: Successfully started service 'MasterSys' on port 9097.
+22/10/08 19:29:12,113 INFO [main] Master: Metrics system enabled.
+22/10/08 19:29:12,125 INFO [main] HttpServer: master: HttpServer started on port 9098.
+22/10/08 19:29:12,126 INFO [main] Master: Master started.
+22/10/08 19:29:57,842 INFO [dispatcher-event-loop-19] Master: Registered worker
+Host: 192.168.15.140
+RpcPort: 37359
+PushPort: 38303
+FetchPort: 37569
+ReplicatePort: 37093
+SlotsUsed: 0()
+LastHeartbeat: 0
+Disks: {/mnt/disk1=DiskInfo(maxSlots: 6679, committed shuffles 0 shuffleAllocations: Map(), mountPoint: /mnt/disk1, usableSpace: 448284381184, avgFlushTime: 0, activeSlots: 0) status: HEALTHY dirs , /mnt/disk3=DiskInfo(maxSlots: 6716, committed shuffles 0 shuffleAllocations: Map(), mountPoint: /mnt/disk3, usableSpace: 450755608576, avgFlushTime: 0, activeSlots: 0) status: HEALTHY dirs , /mnt/disk2=DiskInfo(maxSlots: 6713, committed shuffles 0 shuffleAllocations: Map(), mountPoint: /mnt/d [...]
+WorkerRef: null
+```
+
+## Deploy Spark client
+Copy $CELEBORN_HOME/spark/*.jar to $SPARK_HOME/jars/
+
+## Spark Configuration
+To use Celeborn, following spark configurations should be added.
+```properties
+spark.shuffle.manager org.apache.spark.shuffle.celeborn.RssShuffleManager
+# must use kryo serializer because java serializer do not support relocation
+spark.serializer org.apache.spark.serializer.KryoSerializer
+
+# celeborn master
+spark.celeborn.master.endpoints clb-1:9097,clb-2:9098,clb-3:9099
+spark.shuffle.service.enabled false
+
+# options: hash, sort
+# Hash shuffle writer use (partition count) * (celeborn.push.buffer.size) * (spark.executor.cores) memory.
+# Sort shuffle writer use less memory than hash shuffle writer, if your shuffle partition count is large, try to use sort hash writer.  
+spark.celeborn.shuffle.writer hash
+
+# we recommend set spark.celeborn.push.replicate.enabled to true to enable server-side data replication
+# If you have only one worker, this setting must be false 
+spark.celeborn.push.replicate.enabled true
+
+# Support for Spark AQE only tested under Spark 3
+# we recommend set localShuffleReader to false to get better performance of Celeborn
+spark.sql.adaptive.localShuffleReader.enabled false
+
+# we recommend enabling aqe support to gain better performance
+spark.sql.adaptive.enabled true
+spark.sql.adaptive.skewJoin.enabled true
+```
diff --git a/mkdocs.yml b/mkdocs.yml
index 83dd163..89a4911 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -77,6 +77,7 @@ nav:
   - Home: index.md
   - Download: download.md
   - User Guide:
+      - Deploy: user_guide/deploy.md
       - Upgrade: user_guide/upgrade.md
       - Ratis Shell: user_guide/celeborn_ratis_shell.md
   - Configuration: