You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ag...@apache.org on 2021/08/31 03:20:52 UTC

[arrow-datafusion] branch master updated: Improve User Guide (#954)

This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git


The following commit(s) were added to refs/heads/master by this push:
     new e52abb3  Improve User Guide (#954)
e52abb3 is described below

commit e52abb310925bc3bfd84de84f410b43af6414413
Author: Andy Grove <ag...@apache.org>
AuthorDate: Mon Aug 30 21:20:47 2021 -0600

    Improve User Guide (#954)
---
 docs/user-guide/src/SUMMARY.md                     |   9 +--
 docs/user-guide/src/cli.md                         |  74 +++++++++++++++++++
 .../{library.md => distributed/cargo-install.md}   |  33 +++++++--
 docs/user-guide/src/distributed/client-rust.md     |  80 ++++++++++++++++++++-
 docs/user-guide/src/distributed/deployment.md      |   6 +-
 docs/user-guide/src/distributed/docker-compose.md  |  42 +++++++++--
 .../src/distributed/{standalone.md => docker.md}   |  51 ++++++++-----
 docs/user-guide/src/distributed/introduction.md    |  20 +++---
 docs/user-guide/src/distributed/kubernetes.md      |  62 +++++++++++++---
 docs/user-guide/src/faq.md                         |   3 +-
 docs/user-guide/src/img/ballista-architecture.png  | Bin 21225 -> 0 bytes
 docs/user-guide/src/library.md                     |   2 +-
 docs/user-guide/src/sql/ddl.md                     |  36 ++++++++++
 13 files changed, 358 insertions(+), 60 deletions(-)

diff --git a/docs/user-guide/src/SUMMARY.md b/docs/user-guide/src/SUMMARY.md
index 516fcce..3621031 100644
--- a/docs/user-guide/src/SUMMARY.md
+++ b/docs/user-guide/src/SUMMARY.md
@@ -22,16 +22,17 @@
 - [Introduction](introduction.md)
 - [Example Usage](example-usage.md)
 - [Use as a Library](library.md)
+- [DataFusion CLI](cli.md)
 - [SQL Reference](sql/introduction.md)
 
   - [SELECT](sql/select.md)
   - [DDL](sql/ddl.md)
-    - [CREATE EXTERNAL TABLE](sql/ddl.md)
   - [Datafusion Specific Functions](sql/datafusion-functions.md)
 
-- [Distributed](distributed/introduction.md)
-  - [Create a Ballista Cluster](distributed/deployment.md)
-    - [Docker](distributed/standalone.md)
+- [Ballista Distributed Compute](distributed/introduction.md)
+  - [Start a Ballista Cluster](distributed/deployment.md)
+    - [Cargo Install](distributed/cargo-install.md)
+    - [Docker](distributed/docker.md)
     - [Docker Compose](distributed/docker-compose.md)
     - [Kubernetes](distributed/kubernetes.md)
     - [Raspberry Pi](distributed/raspberrypi.md)
diff --git a/docs/user-guide/src/cli.md b/docs/user-guide/src/cli.md
new file mode 100644
index 0000000..28716b6
--- /dev/null
+++ b/docs/user-guide/src/cli.md
@@ -0,0 +1,74 @@
+<!---
+  Licensed to the Apache Software Foundation (ASF) under one
+  or more contributor license agreements.  See the NOTICE file
+  distributed with this work for additional information
+  regarding copyright ownership.  The ASF licenses this file
+  to you under the Apache License, Version 2.0 (the
+  "License"); you may not use this file except in compliance
+  with the License.  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing,
+  software distributed under the License is distributed on an
+  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  KIND, either express or implied.  See the License for the
+  specific language governing permissions and limitations
+  under the License.
+-->
+
+# DataFusion Command-line Interface
+
+The DataFusion CLI allows SQL queries to be executed by an in-process DataFusion context, or by a distributed
+Ballista context.
+
+```ignore
+USAGE:
+    datafusion-cli [FLAGS] [OPTIONS]
+
+FLAGS:
+    -h, --help       Prints help information
+    -q, --quiet      Reduce printing other than the results and work quietly
+    -V, --version    Prints version information
+
+OPTIONS:
+    -c, --batch-size <batch-size>    The batch size of each query, or use DataFusion default
+    -p, --data-path <data-path>      Path to your data, default to current directory
+    -f, --file <file>...             Execute commands from file(s), then exit
+        --format <format>            Output format [default: table]  [possible values: csv, tsv, table, json, ndjson]
+        --host <host>                Ballista scheduler host
+        --port <port>                Ballista scheduler port
+```
+
+## Example
+
+Create a CSV file to query.
+
+```bash,ignore
+$ echo "1,2" > data.csv
+```
+
+```sql,ignore
+$ datafusion-cli
+
+DataFusion CLI v5.1.0-SNAPSHOT
+
+> CREATE EXTERNAL TABLE foo (a INT, b INT) STORED AS CSV LOCATION 'data.csv';
+0 rows in set. Query took 0.001 seconds.
+
+> SELECT * FROM foo;
++---+---+
+| a | b |
++---+---+
+| 1 | 2 |
++---+---+
+1 row in set. Query took 0.017 seconds.
+```
+
+## Ballista
+
+The DataFusion CLI can also connect to a Ballista scheduler for query execution.
+
+```bash,ignore
+datafusion-cli --host localhost --port 50050
+```
diff --git a/docs/user-guide/src/library.md b/docs/user-guide/src/distributed/cargo-install.md
similarity index 50%
copy from docs/user-guide/src/library.md
copy to docs/user-guide/src/distributed/cargo-install.md
index d35a4b7..504154d 100644
--- a/docs/user-guide/src/library.md
+++ b/docs/user-guide/src/distributed/cargo-install.md
@@ -17,13 +17,34 @@
   under the License.
 -->
 
-# Using DataFusion as a library
+## Deploying a standalone Ballista cluster using cargo install
 
-DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
+A simple way to start a local cluster for testing purposes is to use cargo to install
+the scheduler and executor crates.
 
-To get started, add the following to your `Cargo.toml` file:
+```bash
+cargo install ballista-scheduler
+cargo install ballista-executor
+```
+
+With these crates installed, it is now possible to start a scheduler process.
+
+```bash
+RUST_LOG=info ballista-scheduler
+```
+
+The scheduler will bind to port 50050 by default.
+
+Next, start an executor processes in a new terminal session with the specified concurrency
+level.
+
+```bash
+RUST_LOG=info ballista-executor -c 4
+```
+
+The executor will bind to port 50051 by default. Additional executors can be started by
+manually specifying a bind port. For example:
 
-```toml
-[dependencies]
-datafusion = "4.0.0-SNAPSHOT"
+```bash
+RUST_LOG=info ballista-executor --bind-port 50052 -c 4
 ```
diff --git a/docs/user-guide/src/distributed/client-rust.md b/docs/user-guide/src/distributed/client-rust.md
index 7f7ffcb..4e6ecf5 100644
--- a/docs/user-guide/src/distributed/client-rust.md
+++ b/docs/user-guide/src/distributed/client-rust.md
@@ -19,5 +19,81 @@
 
 ## Ballista Rust Client
 
-The Rust client supports a `DataFrame` API as well as SQL. See the
-[TPC-H Benchmark Client](https://github.com/ballista-compute/ballista/tree/main/rust/benchmarks/tpch) for an example.
+Ballista usage is very similar to DataFusion. Tha main difference is that the starting point is a `BallistaContext`
+instead of the DataFusion `ExecutionContext`. Ballista uses the same DataFrame API as DataFusion.
+
+The following code sample demonstrates how to create a `BallistaContext` to connect to a Ballista scheduler process.
+
+```rust
+let config = BallistaConfig::builder()
+    .set("ballista.shuffle.partitions", "4")
+    .build()?;
+
+// connect to Ballista scheduler
+let ctx = BallistaContext::remote("localhost", 50050, &config);
+```
+
+Here is a full example using the DataFrame API.
+
+```rust
+#[tokio::main]
+async fn main() -> Result<()> {
+    let config = BallistaConfig::builder()
+        .set("ballista.shuffle.partitions", "4")
+        .build()?;
+
+    // connect to Ballista scheduler
+    let ctx = BallistaContext::remote("localhost", 50050, &config);
+
+    let testdata = datafusion::arrow::util::test_util::parquet_test_data();
+
+    let filename = &format!("{}/alltypes_plain.parquet", testdata);
+
+    // define the query using the DataFrame trait
+    let df = ctx
+        .read_parquet(filename)?
+        .select_columns(&["id", "bool_col", "timestamp_col"])?
+        .filter(col("id").gt(lit(1)))?;
+
+    // print the results
+    df.show().await?;
+
+    Ok(())
+}
+```
+
+Here is a full example demonstrating SQL usage.
+
+```rust
+#[tokio::main]
+async fn main() -> Result<()> {
+    let config = BallistaConfig::builder()
+        .set("ballista.shuffle.partitions", "4")
+        .build()?;
+
+    // connect to Ballista scheduler
+    let ctx = BallistaContext::remote("localhost", 50050, &config);
+
+    let testdata = datafusion::arrow::util::test_util::arrow_test_data();
+
+    // register csv file with the execution context
+    ctx.register_csv(
+        "aggregate_test_100",
+        &format!("{}/csv/aggregate_test_100.csv", testdata),
+        CsvReadOptions::new(),
+    )?;
+
+    // execute the query
+    let df = ctx.sql(
+        "SELECT c1, MIN(c12), MAX(c12) \
+        FROM aggregate_test_100 \
+        WHERE c11 > 0.1 AND c11 < 0.9 \
+        GROUP BY c1",
+    )?;
+
+    // print the results
+    df.show().await?;
+
+    Ok(())
+}
+```
diff --git a/docs/user-guide/src/distributed/deployment.md b/docs/user-guide/src/distributed/deployment.md
index 3a00c96..fee020c 100644
--- a/docs/user-guide/src/distributed/deployment.md
+++ b/docs/user-guide/src/distributed/deployment.md
@@ -19,8 +19,10 @@
 
 # Deployment
 
-Ballista is packaged as Docker images. Refer to the following guides to create a Ballista cluster:
+There are multiple ways that a Ballista cluster can be deployed.
 
-- [Create a cluster using Docker](standalone.md)
+- [Create a cluster using Cargo install](cargo-install.md)
+- [Create a cluster using Docker](docker.md)
 - [Create a cluster using Docker Compose](docker-compose.md)
 - [Create a cluster using Kubernetes](kubernetes.md)
+- [Create a cluster on Raspberry Pi](raspberrypi.md)
diff --git a/docs/user-guide/src/distributed/docker-compose.md b/docs/user-guide/src/distributed/docker-compose.md
index fc24d89..1b010b5 100644
--- a/docs/user-guide/src/distributed/docker-compose.md
+++ b/docs/user-guide/src/distributed/docker-compose.md
@@ -17,11 +17,29 @@
   under the License.
 -->
 
-# Installing Ballista with Docker Compose
+# Starting a Ballista cluster using Docker Compose
 
-Docker Compose is a convenient way to launch a cluister when testing locally. The following Docker Compose example
-demonstrates how to start a cluster using a single process that acts as both a scheduler and an executor, with a data
-volume mounted into the container so that Ballista can access the host file system.
+Docker Compose is a convenient way to launch a cluster when testing locally.
+
+## Build Docker image
+
+There is no officially published Docker image so it is currently necessary to build the image from source instead.
+
+Run the following commands to clone the source repository and build the Docker image.
+
+```bash
+git clone git@github.com:apache/arrow-datafusion.git -b 5.1.0
+cd arrow-datafusion
+./dev/build-ballista-docker.sh
+```
+
+This will create an image with the tag `ballista:0.6.0`.
+
+## Start a cluster
+
+The following Docker Compose example demonstrates how to start a cluster using one scheduler process and one
+executor process, with the scheduler using etcd as a backing store. A data volume is mounted into each container
+so that Ballista can access the host file system.
 
 ```yaml
 version: "2.2"
@@ -60,4 +78,20 @@ node cluster.
 docker-compose up
 ```
 
+This should show output similar to the following:
+
+```bash
+$ docker-compose up
+Creating network "ballista-benchmarks_default" with the default driver
+Creating ballista-benchmarks_etcd_1 ... done
+Creating ballista-benchmarks_ballista-scheduler_1 ... done
+Creating ballista-benchmarks_ballista-executor_1  ... done
+Attaching to ballista-benchmarks_etcd_1, ballista-benchmarks_ballista-scheduler_1, ballista-benchmarks_ballista-executor_1
+ballista-executor_1   | [2021-08-28T15:55:22Z INFO  ballista_executor] Running with config:
+ballista-executor_1   | [2021-08-28T15:55:22Z INFO  ballista_executor] work_dir: /tmp/.tmpLVx39c
+ballista-executor_1   | [2021-08-28T15:55:22Z INFO  ballista_executor] concurrent_tasks: 4
+ballista-scheduler_1  | [2021-08-28T15:55:22Z INFO  ballista_scheduler] Ballista v0.6.0 Scheduler listening on 0.0.0.0:50050
+ballista-executor_1   | [2021-08-28T15:55:22Z INFO  ballista_executor] Ballista v0.6.0 Rust Executor listening on 0.0.0.0:50051
+```
+
 The scheduler listens on port 50050 and this is the port that clients will need to connect to.
diff --git a/docs/user-guide/src/distributed/standalone.md b/docs/user-guide/src/distributed/docker.md
similarity index 55%
rename from docs/user-guide/src/distributed/standalone.md
rename to docs/user-guide/src/distributed/docker.md
index 66b6bc8..4892ab8 100644
--- a/docs/user-guide/src/distributed/standalone.md
+++ b/docs/user-guide/src/distributed/docker.md
@@ -17,7 +17,21 @@
   under the License.
 -->
 
-## Deploying a standalone Ballista cluster
+## Starting a Ballista cluster using Docker
+
+## Build Docker image
+
+There is no officially published Docker image so it is currently necessary to build the image from source instead.
+
+Run the following commands to clone the source repository and build the Docker image.
+
+```bash
+git clone git@github.com:apache/arrow-datafusion.git -b 5.1.0
+cd arrow-datafusion
+./dev/build-ballista-docker.sh
+```
+
+This will create an image with the tag `ballista:0.6.0`.
 
 ### Start a Scheduler
 
@@ -25,7 +39,7 @@ Start a scheduler using the following syntax:
 
 ```bash
 docker run --network=host \
-  -d ballistacompute/ballista-rust:0.4.2-SNAPSHOT \
+  -d ballista:0.6.0 \
   /scheduler --bind-port 50050
 ```
 
@@ -33,15 +47,15 @@ Run `docker ps` to check that the process is running:
 
 ```
 $ docker ps
-CONTAINER ID   IMAGE                                         COMMAND                  CREATED         STATUS         PORTS     NAMES
-59452ce72138   ballistacompute/ballista-rust:0.4.2-SNAPSHOT   "/scheduler --bind-p…"   6 seconds ago   Up 5 seconds             affectionate_hofstadter
+CONTAINER ID   IMAGE            COMMAND                  CREATED         STATUS        PORTS     NAMES
+1f3f8b5ed93a   ballista:0.6.0   "/scheduler --bind-p…"   2 seconds ago   Up 1 second             tender_archimedes
 ```
 
 Run `docker logs CONTAINER_ID` to check the output from the process:
 
 ```
-$ docker logs 59452ce72138
-[2021-02-14T18:32:20Z INFO  scheduler] Ballista v0.4.2-SNAPSHOT Scheduler listening on 0.0.0.0:50050
+$ docker logs 1f3f8b5ed93a
+[2021-08-28T15:45:11Z INFO  ballista_scheduler] Ballista v0.6.0 Scheduler listening on 0.0.0.0:50050
 ```
 
 ### Start executors
@@ -50,7 +64,7 @@ Start one or more executor processes. Each executor process will need to listen
 
 ```bash
 docker run --network=host \
-  -d ballistacompute/ballista-rust:0.4.2-SNAPSHOT \
+  -d ballista:0.6.0 \
   /executor --external-host localhost --bind-port 50051
 ```
 
@@ -58,32 +72,31 @@ Use `docker ps` to check that both the scheduer and executor(s) are now running:
 
 ```
 $ docker ps
-CONTAINER ID   IMAGE                                         COMMAND                  CREATED         STATUS         PORTS     NAMES
-0746ce262a19   ballistacompute/ballista-rust:0.4.2-SNAPSHOT   "/executor --externa…"   2 seconds ago   Up 1 second              naughty_mclean
-59452ce72138   ballistacompute/ballista-rust:0.4.2-SNAPSHOT   "/scheduler --bind-p…"   4 minutes ago   Up 4 minutes             affectionate_hofstadter
+CONTAINER ID   IMAGE            COMMAND                  CREATED          STATUS          PORTS     NAMES
+7c6941bb8dc0   ballista:0.6.0   "/executor --externa…"   3 seconds ago    Up 2 seconds              tender_goldberg
+1f3f8b5ed93a   ballista:0.6.0   "/scheduler --bind-p…"   50 seconds ago   Up 49 seconds             tender_archimedes
 ```
 
 Use `docker logs CONTAINER_ID` to check the output from the executor(s):
 
 ```
-$ docker logs 0746ce262a19
-[2021-02-14T18:36:25Z INFO  executor] Running with config: ExecutorConfig { host: "localhost", bind_port: 50051, work_dir: "/tmp/.tmpVRFSvn", concurrent_tasks: 4 }
-[2021-02-14T18:36:25Z INFO  executor] Ballista v0.4.2-SNAPSHOT Rust Executor listening on 0.0.0.0:50051
-[2021-02-14T18:36:25Z INFO  executor] Starting registration with scheduler
+$ docker logs 7c6941bb8dc0
+[2021-08-28T15:45:58Z INFO  ballista_executor] Running with config:
+[2021-08-28T15:45:58Z INFO  ballista_executor] work_dir: /tmp/.tmpeyEM76
+[2021-08-28T15:45:58Z INFO  ballista_executor] concurrent_tasks: 4
+[2021-08-28T15:45:58Z INFO  ballista_executor] Ballista v0.6.0 Rust Executor listening on 0.0.0.0:50051
 ```
 
-The external host and port will be registered with the scheduler. The executors will discover other executors by
-requesting a list of executors from the scheduler.
-
 ### Using etcd as backing store
 
 _NOTE: This functionality is currently experimental_
 
-Ballista can optionally use [etcd](https://etcd.io/) as a backing store for the scheduler.
+Ballista can optionally use [etcd](https://etcd.io/) as a backing store for the scheduler. Use the following commands
+to launch the scheduler with this option enabled.
 
 ```bash
 docker run --network=host \
-  -d ballistacompute/ballista-rust:0.4.2-SNAPSHOT \
+  -d ballista:0.6.0 \
   /scheduler --bind-port 50050 \
   --config-backend etcd \
   --etcd-urls etcd:2379
diff --git a/docs/user-guide/src/distributed/introduction.md b/docs/user-guide/src/distributed/introduction.md
index 0d96c26..aebf700 100644
--- a/docs/user-guide/src/distributed/introduction.md
+++ b/docs/user-guide/src/distributed/introduction.md
@@ -28,25 +28,23 @@ The foundational technologies in Ballista are:
 - [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels for efficient processing of data.
 - [Apache Arrow Flight Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for efficient data transfer between processes.
 - [Google Protocol Buffers](https://developers.google.com/protocol-buffers) for serializing query plans.
-- [Docker](https://www.docker.com/) for packaging up executors along with user-defined code.
-
-## Architecture
-
-The following diagram highlights some of the integrations that will be possible with this unique architecture. Note that not all components shown here are available yet.
-
-![Ballista Architecture Diagram](img/ballista-architecture.png)
+- [DataFusion](https://github.com/apache/arrow-datafusion/) for query execution.
 
 ## How does this compare to Apache Spark?
 
 Although Ballista is largely inspired by Apache Spark, there are some key differences.
 
-- The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead of GC pauses.
+- The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead
+  of GC pauses.
 - Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized
   processing (SIMD and GPU) and efficient compression. Although Spark does have some columnar support, it is still
   largely row-based today.
-- The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.
-- The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors in any programming language with minimal serialization overhead.
+- The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than
+  Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of
+  distributed compute.
+- The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors
+  in any programming language with minimal serialization overhead.
 
 ## Status
 
-Ballista is at the proof-of-concept phase currently but is under active development by a growing community.
+Ballista is still in the early stages of development but is capable of executing complex analytical queries at scale.
diff --git a/docs/user-guide/src/distributed/kubernetes.md b/docs/user-guide/src/distributed/kubernetes.md
index ef4acca..9a8e29b 100644
--- a/docs/user-guide/src/distributed/kubernetes.md
+++ b/docs/user-guide/src/distributed/kubernetes.md
@@ -20,9 +20,9 @@
 # Deploying Ballista with Kubernetes
 
 Ballista can be deployed to any Kubernetes cluster using the following instructions. These instructions assume that
-you are already comfortable with managing Kubernetes deployments.
+you are already comfortable managing Kubernetes deployments.
 
-The k8s deployment consists of:
+The Ballista deployment consists of:
 
 - k8s deployment for one or more scheduler processes
 - k8s deployment for one or more executor processes
@@ -30,6 +30,15 @@ The k8s deployment consists of:
 - k8s persistent volume and persistent volume claims to make local data accessible to Ballista
 - _(optional)_ a [keda](http://keda.sh) instance for autoscaling the number of executors
 
+## Testing locally
+
+[Microk8s](https://microk8s.io/) is recommended for installing a local k8s cluster. Once Microk8s is installed, DNS
+must be enabled using the following command.
+
+```bash
+microk8s enable dns
+```
+
 ## Limitations
 
 Ballista is at an early stage of development and therefore has some significant limitations:
@@ -39,13 +48,28 @@ Ballista is at an early stage of development and therefore has some significant
 - Only a single scheduler instance is currently supported unless the scheduler is configured to use `etcd` as a
   backing store.
 
+## Build Docker image
+
+There is no officially published Docker image so it is currently necessary to build the image from source instead.
+
+Run the following commands to clone the source repository and build the Docker image.
+
+```bash
+git clone git@github.com:apache/arrow-datafusion.git -b 5.1.0
+cd arrow-datafusion
+./dev/build-ballista-docker.sh
+```
+
+This will create an image with the tag `ballista:0.6.0`.
+
 ## Publishing your images
 
-Currently there are no official Ballista images that work with the instructions in this guide. For the time being,
-you will need to build and publish your own images. You can do that by invoking the `dev/build-ballista-docker.sh`.
+Once the images have been built, you can retag them and can push them to your favourite docker registry.
 
-Once the images have been built, you can retag them with `docker tag ballista:0.5.0-SNAPSHOT <new-image-name>` so you
-can push them to your favourite docker registry.
+```bash
+docker tag ballista:0.6.0 <your-repo>/ballista:0.6.0
+docker push <your-repo>/ballista:0.6.0
+```
 
 ## Create Persistent Volume and Persistent Volume Claim
 
@@ -130,7 +154,7 @@ spec:
     spec:
       containers:
         - name: ballista-scheduler
-          image: <your-image>
+          image: <your-repo>/ballista:0.6.0
           command: ["/scheduler"]
           args: ["--bind-port=50050"]
           ports:
@@ -161,7 +185,7 @@ spec:
     spec:
       containers:
         - name: ballista-executor
-          image: <your-image>
+          image: <your-repo>/ballista:0.6.0
           command: ["/executor"]
           args:
             - "--bind-port=50051"
@@ -205,11 +229,31 @@ You can view the scheduler logs with `kubectl logs ballista-scheduler-0`:
 
 ```
 $ kubectl logs ballista-scheduler-0
-[2021-02-19T00:24:01Z INFO  scheduler] Ballista v0.4.2-SNAPSHOT Scheduler listening on 0.0.0.0:50050
+[2021-02-19T00:24:01Z INFO  scheduler] Ballista v0.6.0 Scheduler listening on 0.0.0.0:50050
 [2021-02-19T00:24:16Z INFO  ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "b5e81711-1c5c-46ec-8522-d8b359793188", host: "10.1.23.149", port: 50051 }
 [2021-02-19T00:24:17Z INFO  ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "816e4502-a876-4ed8-b33f-86d243dcf63f", host: "10.1.23.150", port: 50051 }
 ```
 
+## Port Forwarding
+
+If you want to run applications outside of the cluster and have them connect to the scheduler then it is necessary to
+set up port forwarding.
+
+First, check that the `ballista-scheduler` service is running.
+
+```bash
+$ kubectl get services
+NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
+kubernetes           ClusterIP   10.152.183.1    <none>        443/TCP     26h
+ballista-scheduler   ClusterIP   10.152.183.21   <none>        50050/TCP   24m
+```
+
+Use the following command to set up port-forwarding.
+
+```bash
+kubectl port-forward service/ballista-scheduler 50050:50050
+```
+
 ## Deleting the Ballista cluster
 
 Run the following kubectl command to delete the cluster.
diff --git a/docs/user-guide/src/faq.md b/docs/user-guide/src/faq.md
index 5e1a72c..f96306c 100644
--- a/docs/user-guide/src/faq.md
+++ b/docs/user-guide/src/faq.md
@@ -28,5 +28,4 @@ DataFusion is a library for executing queries in-process using the Apache Arrow
 model and computational kernels. It is designed to run within a single process, using threads
 for parallel query execution.
 
-Ballista is a distributed compute platform design to leverage DataFusion and other query
-execution libraries.
+Ballista is a distributed compute platform built on DataFusion.
diff --git a/docs/user-guide/src/img/ballista-architecture.png b/docs/user-guide/src/img/ballista-architecture.png
deleted file mode 100644
index 2f78f29..0000000
Binary files a/docs/user-guide/src/img/ballista-architecture.png and /dev/null differ
diff --git a/docs/user-guide/src/library.md b/docs/user-guide/src/library.md
index d35a4b7..d255d78 100644
--- a/docs/user-guide/src/library.md
+++ b/docs/user-guide/src/library.md
@@ -25,5 +25,5 @@ To get started, add the following to your `Cargo.toml` file:
 
 ```toml
 [dependencies]
-datafusion = "4.0.0-SNAPSHOT"
+datafusion = "5.1.0"
 ```
diff --git a/docs/user-guide/src/sql/ddl.md b/docs/user-guide/src/sql/ddl.md
index cb16657..b48179d 100644
--- a/docs/user-guide/src/sql/ddl.md
+++ b/docs/user-guide/src/sql/ddl.md
@@ -18,3 +18,39 @@
 -->
 
 # DDL
+
+## CREATE EXTERNAL TABLE
+
+Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. It is not necessary
+to provide schema information for Parquet files.
+
+```sql
+CREATE EXTERNAL TABLE taxi
+STORED AS PARQUET
+LOCATION '/mnt/nyctaxi/tripdata.parquet';
+```
+
+CSV data sources can also be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. It is necessary to
+provide schema information for CSV files since DataFusion does not automatically infer the schema when using SQL
+to query CSV files.
+
+```sql
+CREATE EXTERNAL TABLE test (
+    c1  VARCHAR NOT NULL,
+    c2  INT NOT NULL,
+    c3  SMALLINT NOT NULL,
+    c4  SMALLINT NOT NULL,
+    c5  INT NOT NULL,
+    c6  BIGINT NOT NULL,
+    c7  SMALLINT NOT NULL,
+    c8  INT NOT NULL,
+    c9  BIGINT NOT NULL,
+    c10 VARCHAR NOT NULL,
+    c11 FLOAT NOT NULL,
+    c12 DOUBLE NOT NULL,
+    c13 VARCHAR NOT NULL
+)
+STORED AS CSV
+WITH HEADER ROW
+LOCATION '/path/to/aggregate_test_100.csv';
+```