You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ag...@apache.org on 2021/08/31 03:20:52 UTC
[arrow-datafusion] branch master updated: Improve User Guide (#954)
This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git
The following commit(s) were added to refs/heads/master by this push:
new e52abb3 Improve User Guide (#954)
e52abb3 is described below
commit e52abb310925bc3bfd84de84f410b43af6414413
Author: Andy Grove <ag...@apache.org>
AuthorDate: Mon Aug 30 21:20:47 2021 -0600
Improve User Guide (#954)
---
docs/user-guide/src/SUMMARY.md | 9 +--
docs/user-guide/src/cli.md | 74 +++++++++++++++++++
.../{library.md => distributed/cargo-install.md} | 33 +++++++--
docs/user-guide/src/distributed/client-rust.md | 80 ++++++++++++++++++++-
docs/user-guide/src/distributed/deployment.md | 6 +-
docs/user-guide/src/distributed/docker-compose.md | 42 +++++++++--
.../src/distributed/{standalone.md => docker.md} | 51 ++++++++-----
docs/user-guide/src/distributed/introduction.md | 20 +++---
docs/user-guide/src/distributed/kubernetes.md | 62 +++++++++++++---
docs/user-guide/src/faq.md | 3 +-
docs/user-guide/src/img/ballista-architecture.png | Bin 21225 -> 0 bytes
docs/user-guide/src/library.md | 2 +-
docs/user-guide/src/sql/ddl.md | 36 ++++++++++
13 files changed, 358 insertions(+), 60 deletions(-)
diff --git a/docs/user-guide/src/SUMMARY.md b/docs/user-guide/src/SUMMARY.md
index 516fcce..3621031 100644
--- a/docs/user-guide/src/SUMMARY.md
+++ b/docs/user-guide/src/SUMMARY.md
@@ -22,16 +22,17 @@
- [Introduction](introduction.md)
- [Example Usage](example-usage.md)
- [Use as a Library](library.md)
+- [DataFusion CLI](cli.md)
- [SQL Reference](sql/introduction.md)
- [SELECT](sql/select.md)
- [DDL](sql/ddl.md)
- - [CREATE EXTERNAL TABLE](sql/ddl.md)
- [Datafusion Specific Functions](sql/datafusion-functions.md)
-- [Distributed](distributed/introduction.md)
- - [Create a Ballista Cluster](distributed/deployment.md)
- - [Docker](distributed/standalone.md)
+- [Ballista Distributed Compute](distributed/introduction.md)
+ - [Start a Ballista Cluster](distributed/deployment.md)
+ - [Cargo Install](distributed/cargo-install.md)
+ - [Docker](distributed/docker.md)
- [Docker Compose](distributed/docker-compose.md)
- [Kubernetes](distributed/kubernetes.md)
- [Raspberry Pi](distributed/raspberrypi.md)
diff --git a/docs/user-guide/src/cli.md b/docs/user-guide/src/cli.md
new file mode 100644
index 0000000..28716b6
--- /dev/null
+++ b/docs/user-guide/src/cli.md
@@ -0,0 +1,74 @@
+<!---
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+-->
+
+# DataFusion Command-line Interface
+
+The DataFusion CLI allows SQL queries to be executed by an in-process DataFusion context, or by a distributed
+Ballista context.
+
+```ignore
+USAGE:
+ datafusion-cli [FLAGS] [OPTIONS]
+
+FLAGS:
+ -h, --help Prints help information
+ -q, --quiet Reduce printing other than the results and work quietly
+ -V, --version Prints version information
+
+OPTIONS:
+ -c, --batch-size <batch-size> The batch size of each query, or use DataFusion default
+ -p, --data-path <data-path> Path to your data, default to current directory
+ -f, --file <file>... Execute commands from file(s), then exit
+ --format <format> Output format [default: table] [possible values: csv, tsv, table, json, ndjson]
+ --host <host> Ballista scheduler host
+ --port <port> Ballista scheduler port
+```
+
+## Example
+
+Create a CSV file to query.
+
+```bash,ignore
+$ echo "1,2" > data.csv
+```
+
+```sql,ignore
+$ datafusion-cli
+
+DataFusion CLI v5.1.0-SNAPSHOT
+
+> CREATE EXTERNAL TABLE foo (a INT, b INT) STORED AS CSV LOCATION 'data.csv';
+0 rows in set. Query took 0.001 seconds.
+
+> SELECT * FROM foo;
++---+---+
+| a | b |
++---+---+
+| 1 | 2 |
++---+---+
+1 row in set. Query took 0.017 seconds.
+```
+
+## Ballista
+
+The DataFusion CLI can also connect to a Ballista scheduler for query execution.
+
+```bash,ignore
+datafusion-cli --host localhost --port 50050
+```
diff --git a/docs/user-guide/src/library.md b/docs/user-guide/src/distributed/cargo-install.md
similarity index 50%
copy from docs/user-guide/src/library.md
copy to docs/user-guide/src/distributed/cargo-install.md
index d35a4b7..504154d 100644
--- a/docs/user-guide/src/library.md
+++ b/docs/user-guide/src/distributed/cargo-install.md
@@ -17,13 +17,34 @@
under the License.
-->
-# Using DataFusion as a library
+## Deploying a standalone Ballista cluster using cargo install
-DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
+A simple way to start a local cluster for testing purposes is to use cargo to install
+the scheduler and executor crates.
-To get started, add the following to your `Cargo.toml` file:
+```bash
+cargo install ballista-scheduler
+cargo install ballista-executor
+```
+
+With these crates installed, it is now possible to start a scheduler process.
+
+```bash
+RUST_LOG=info ballista-scheduler
+```
+
+The scheduler will bind to port 50050 by default.
+
+Next, start an executor processes in a new terminal session with the specified concurrency
+level.
+
+```bash
+RUST_LOG=info ballista-executor -c 4
+```
+
+The executor will bind to port 50051 by default. Additional executors can be started by
+manually specifying a bind port. For example:
-```toml
-[dependencies]
-datafusion = "4.0.0-SNAPSHOT"
+```bash
+RUST_LOG=info ballista-executor --bind-port 50052 -c 4
```
diff --git a/docs/user-guide/src/distributed/client-rust.md b/docs/user-guide/src/distributed/client-rust.md
index 7f7ffcb..4e6ecf5 100644
--- a/docs/user-guide/src/distributed/client-rust.md
+++ b/docs/user-guide/src/distributed/client-rust.md
@@ -19,5 +19,81 @@
## Ballista Rust Client
-The Rust client supports a `DataFrame` API as well as SQL. See the
-[TPC-H Benchmark Client](https://github.com/ballista-compute/ballista/tree/main/rust/benchmarks/tpch) for an example.
+Ballista usage is very similar to DataFusion. Tha main difference is that the starting point is a `BallistaContext`
+instead of the DataFusion `ExecutionContext`. Ballista uses the same DataFrame API as DataFusion.
+
+The following code sample demonstrates how to create a `BallistaContext` to connect to a Ballista scheduler process.
+
+```rust
+let config = BallistaConfig::builder()
+ .set("ballista.shuffle.partitions", "4")
+ .build()?;
+
+// connect to Ballista scheduler
+let ctx = BallistaContext::remote("localhost", 50050, &config);
+```
+
+Here is a full example using the DataFrame API.
+
+```rust
+#[tokio::main]
+async fn main() -> Result<()> {
+ let config = BallistaConfig::builder()
+ .set("ballista.shuffle.partitions", "4")
+ .build()?;
+
+ // connect to Ballista scheduler
+ let ctx = BallistaContext::remote("localhost", 50050, &config);
+
+ let testdata = datafusion::arrow::util::test_util::parquet_test_data();
+
+ let filename = &format!("{}/alltypes_plain.parquet", testdata);
+
+ // define the query using the DataFrame trait
+ let df = ctx
+ .read_parquet(filename)?
+ .select_columns(&["id", "bool_col", "timestamp_col"])?
+ .filter(col("id").gt(lit(1)))?;
+
+ // print the results
+ df.show().await?;
+
+ Ok(())
+}
+```
+
+Here is a full example demonstrating SQL usage.
+
+```rust
+#[tokio::main]
+async fn main() -> Result<()> {
+ let config = BallistaConfig::builder()
+ .set("ballista.shuffle.partitions", "4")
+ .build()?;
+
+ // connect to Ballista scheduler
+ let ctx = BallistaContext::remote("localhost", 50050, &config);
+
+ let testdata = datafusion::arrow::util::test_util::arrow_test_data();
+
+ // register csv file with the execution context
+ ctx.register_csv(
+ "aggregate_test_100",
+ &format!("{}/csv/aggregate_test_100.csv", testdata),
+ CsvReadOptions::new(),
+ )?;
+
+ // execute the query
+ let df = ctx.sql(
+ "SELECT c1, MIN(c12), MAX(c12) \
+ FROM aggregate_test_100 \
+ WHERE c11 > 0.1 AND c11 < 0.9 \
+ GROUP BY c1",
+ )?;
+
+ // print the results
+ df.show().await?;
+
+ Ok(())
+}
+```
diff --git a/docs/user-guide/src/distributed/deployment.md b/docs/user-guide/src/distributed/deployment.md
index 3a00c96..fee020c 100644
--- a/docs/user-guide/src/distributed/deployment.md
+++ b/docs/user-guide/src/distributed/deployment.md
@@ -19,8 +19,10 @@
# Deployment
-Ballista is packaged as Docker images. Refer to the following guides to create a Ballista cluster:
+There are multiple ways that a Ballista cluster can be deployed.
-- [Create a cluster using Docker](standalone.md)
+- [Create a cluster using Cargo install](cargo-install.md)
+- [Create a cluster using Docker](docker.md)
- [Create a cluster using Docker Compose](docker-compose.md)
- [Create a cluster using Kubernetes](kubernetes.md)
+- [Create a cluster on Raspberry Pi](raspberrypi.md)
diff --git a/docs/user-guide/src/distributed/docker-compose.md b/docs/user-guide/src/distributed/docker-compose.md
index fc24d89..1b010b5 100644
--- a/docs/user-guide/src/distributed/docker-compose.md
+++ b/docs/user-guide/src/distributed/docker-compose.md
@@ -17,11 +17,29 @@
under the License.
-->
-# Installing Ballista with Docker Compose
+# Starting a Ballista cluster using Docker Compose
-Docker Compose is a convenient way to launch a cluister when testing locally. The following Docker Compose example
-demonstrates how to start a cluster using a single process that acts as both a scheduler and an executor, with a data
-volume mounted into the container so that Ballista can access the host file system.
+Docker Compose is a convenient way to launch a cluster when testing locally.
+
+## Build Docker image
+
+There is no officially published Docker image so it is currently necessary to build the image from source instead.
+
+Run the following commands to clone the source repository and build the Docker image.
+
+```bash
+git clone git@github.com:apache/arrow-datafusion.git -b 5.1.0
+cd arrow-datafusion
+./dev/build-ballista-docker.sh
+```
+
+This will create an image with the tag `ballista:0.6.0`.
+
+## Start a cluster
+
+The following Docker Compose example demonstrates how to start a cluster using one scheduler process and one
+executor process, with the scheduler using etcd as a backing store. A data volume is mounted into each container
+so that Ballista can access the host file system.
```yaml
version: "2.2"
@@ -60,4 +78,20 @@ node cluster.
docker-compose up
```
+This should show output similar to the following:
+
+```bash
+$ docker-compose up
+Creating network "ballista-benchmarks_default" with the default driver
+Creating ballista-benchmarks_etcd_1 ... done
+Creating ballista-benchmarks_ballista-scheduler_1 ... done
+Creating ballista-benchmarks_ballista-executor_1 ... done
+Attaching to ballista-benchmarks_etcd_1, ballista-benchmarks_ballista-scheduler_1, ballista-benchmarks_ballista-executor_1
+ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor] Running with config:
+ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor] work_dir: /tmp/.tmpLVx39c
+ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor] concurrent_tasks: 4
+ballista-scheduler_1 | [2021-08-28T15:55:22Z INFO ballista_scheduler] Ballista v0.6.0 Scheduler listening on 0.0.0.0:50050
+ballista-executor_1 | [2021-08-28T15:55:22Z INFO ballista_executor] Ballista v0.6.0 Rust Executor listening on 0.0.0.0:50051
+```
+
The scheduler listens on port 50050 and this is the port that clients will need to connect to.
diff --git a/docs/user-guide/src/distributed/standalone.md b/docs/user-guide/src/distributed/docker.md
similarity index 55%
rename from docs/user-guide/src/distributed/standalone.md
rename to docs/user-guide/src/distributed/docker.md
index 66b6bc8..4892ab8 100644
--- a/docs/user-guide/src/distributed/standalone.md
+++ b/docs/user-guide/src/distributed/docker.md
@@ -17,7 +17,21 @@
under the License.
-->
-## Deploying a standalone Ballista cluster
+## Starting a Ballista cluster using Docker
+
+## Build Docker image
+
+There is no officially published Docker image so it is currently necessary to build the image from source instead.
+
+Run the following commands to clone the source repository and build the Docker image.
+
+```bash
+git clone git@github.com:apache/arrow-datafusion.git -b 5.1.0
+cd arrow-datafusion
+./dev/build-ballista-docker.sh
+```
+
+This will create an image with the tag `ballista:0.6.0`.
### Start a Scheduler
@@ -25,7 +39,7 @@ Start a scheduler using the following syntax:
```bash
docker run --network=host \
- -d ballistacompute/ballista-rust:0.4.2-SNAPSHOT \
+ -d ballista:0.6.0 \
/scheduler --bind-port 50050
```
@@ -33,15 +47,15 @@ Run `docker ps` to check that the process is running:
```
$ docker ps
-CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
-59452ce72138 ballistacompute/ballista-rust:0.4.2-SNAPSHOT "/scheduler --bind-p…" 6 seconds ago Up 5 seconds affectionate_hofstadter
+CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
+1f3f8b5ed93a ballista:0.6.0 "/scheduler --bind-p…" 2 seconds ago Up 1 second tender_archimedes
```
Run `docker logs CONTAINER_ID` to check the output from the process:
```
-$ docker logs 59452ce72138
-[2021-02-14T18:32:20Z INFO scheduler] Ballista v0.4.2-SNAPSHOT Scheduler listening on 0.0.0.0:50050
+$ docker logs 1f3f8b5ed93a
+[2021-08-28T15:45:11Z INFO ballista_scheduler] Ballista v0.6.0 Scheduler listening on 0.0.0.0:50050
```
### Start executors
@@ -50,7 +64,7 @@ Start one or more executor processes. Each executor process will need to listen
```bash
docker run --network=host \
- -d ballistacompute/ballista-rust:0.4.2-SNAPSHOT \
+ -d ballista:0.6.0 \
/executor --external-host localhost --bind-port 50051
```
@@ -58,32 +72,31 @@ Use `docker ps` to check that both the scheduer and executor(s) are now running:
```
$ docker ps
-CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
-0746ce262a19 ballistacompute/ballista-rust:0.4.2-SNAPSHOT "/executor --externa…" 2 seconds ago Up 1 second naughty_mclean
-59452ce72138 ballistacompute/ballista-rust:0.4.2-SNAPSHOT "/scheduler --bind-p…" 4 minutes ago Up 4 minutes affectionate_hofstadter
+CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
+7c6941bb8dc0 ballista:0.6.0 "/executor --externa…" 3 seconds ago Up 2 seconds tender_goldberg
+1f3f8b5ed93a ballista:0.6.0 "/scheduler --bind-p…" 50 seconds ago Up 49 seconds tender_archimedes
```
Use `docker logs CONTAINER_ID` to check the output from the executor(s):
```
-$ docker logs 0746ce262a19
-[2021-02-14T18:36:25Z INFO executor] Running with config: ExecutorConfig { host: "localhost", bind_port: 50051, work_dir: "/tmp/.tmpVRFSvn", concurrent_tasks: 4 }
-[2021-02-14T18:36:25Z INFO executor] Ballista v0.4.2-SNAPSHOT Rust Executor listening on 0.0.0.0:50051
-[2021-02-14T18:36:25Z INFO executor] Starting registration with scheduler
+$ docker logs 7c6941bb8dc0
+[2021-08-28T15:45:58Z INFO ballista_executor] Running with config:
+[2021-08-28T15:45:58Z INFO ballista_executor] work_dir: /tmp/.tmpeyEM76
+[2021-08-28T15:45:58Z INFO ballista_executor] concurrent_tasks: 4
+[2021-08-28T15:45:58Z INFO ballista_executor] Ballista v0.6.0 Rust Executor listening on 0.0.0.0:50051
```
-The external host and port will be registered with the scheduler. The executors will discover other executors by
-requesting a list of executors from the scheduler.
-
### Using etcd as backing store
_NOTE: This functionality is currently experimental_
-Ballista can optionally use [etcd](https://etcd.io/) as a backing store for the scheduler.
+Ballista can optionally use [etcd](https://etcd.io/) as a backing store for the scheduler. Use the following commands
+to launch the scheduler with this option enabled.
```bash
docker run --network=host \
- -d ballistacompute/ballista-rust:0.4.2-SNAPSHOT \
+ -d ballista:0.6.0 \
/scheduler --bind-port 50050 \
--config-backend etcd \
--etcd-urls etcd:2379
diff --git a/docs/user-guide/src/distributed/introduction.md b/docs/user-guide/src/distributed/introduction.md
index 0d96c26..aebf700 100644
--- a/docs/user-guide/src/distributed/introduction.md
+++ b/docs/user-guide/src/distributed/introduction.md
@@ -28,25 +28,23 @@ The foundational technologies in Ballista are:
- [Apache Arrow](https://arrow.apache.org/) memory model and compute kernels for efficient processing of data.
- [Apache Arrow Flight Protocol](https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/) for efficient data transfer between processes.
- [Google Protocol Buffers](https://developers.google.com/protocol-buffers) for serializing query plans.
-- [Docker](https://www.docker.com/) for packaging up executors along with user-defined code.
-
-## Architecture
-
-The following diagram highlights some of the integrations that will be possible with this unique architecture. Note that not all components shown here are available yet.
-
-![Ballista Architecture Diagram](img/ballista-architecture.png)
+- [DataFusion](https://github.com/apache/arrow-datafusion/) for query execution.
## How does this compare to Apache Spark?
Although Ballista is largely inspired by Apache Spark, there are some key differences.
-- The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead of GC pauses.
+- The choice of Rust as the main execution language means that memory usage is deterministic and avoids the overhead
+ of GC pauses.
- Ballista is designed from the ground up to use columnar data, enabling a number of efficiencies such as vectorized
processing (SIMD and GPU) and efficient compression. Although Spark does have some columnar support, it is still
largely row-based today.
-- The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of distributed compute.
-- The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors in any programming language with minimal serialization overhead.
+- The combination of Rust and Arrow provides excellent memory efficiency and memory usage can be 5x - 10x lower than
+ Apache Spark in some cases, which means that more processing can fit on a single node, reducing the overhead of
+ distributed compute.
+- The use of Apache Arrow as the memory model and network protocol means that data can be exchanged between executors
+ in any programming language with minimal serialization overhead.
## Status
-Ballista is at the proof-of-concept phase currently but is under active development by a growing community.
+Ballista is still in the early stages of development but is capable of executing complex analytical queries at scale.
diff --git a/docs/user-guide/src/distributed/kubernetes.md b/docs/user-guide/src/distributed/kubernetes.md
index ef4acca..9a8e29b 100644
--- a/docs/user-guide/src/distributed/kubernetes.md
+++ b/docs/user-guide/src/distributed/kubernetes.md
@@ -20,9 +20,9 @@
# Deploying Ballista with Kubernetes
Ballista can be deployed to any Kubernetes cluster using the following instructions. These instructions assume that
-you are already comfortable with managing Kubernetes deployments.
+you are already comfortable managing Kubernetes deployments.
-The k8s deployment consists of:
+The Ballista deployment consists of:
- k8s deployment for one or more scheduler processes
- k8s deployment for one or more executor processes
@@ -30,6 +30,15 @@ The k8s deployment consists of:
- k8s persistent volume and persistent volume claims to make local data accessible to Ballista
- _(optional)_ a [keda](http://keda.sh) instance for autoscaling the number of executors
+## Testing locally
+
+[Microk8s](https://microk8s.io/) is recommended for installing a local k8s cluster. Once Microk8s is installed, DNS
+must be enabled using the following command.
+
+```bash
+microk8s enable dns
+```
+
## Limitations
Ballista is at an early stage of development and therefore has some significant limitations:
@@ -39,13 +48,28 @@ Ballista is at an early stage of development and therefore has some significant
- Only a single scheduler instance is currently supported unless the scheduler is configured to use `etcd` as a
backing store.
+## Build Docker image
+
+There is no officially published Docker image so it is currently necessary to build the image from source instead.
+
+Run the following commands to clone the source repository and build the Docker image.
+
+```bash
+git clone git@github.com:apache/arrow-datafusion.git -b 5.1.0
+cd arrow-datafusion
+./dev/build-ballista-docker.sh
+```
+
+This will create an image with the tag `ballista:0.6.0`.
+
## Publishing your images
-Currently there are no official Ballista images that work with the instructions in this guide. For the time being,
-you will need to build and publish your own images. You can do that by invoking the `dev/build-ballista-docker.sh`.
+Once the images have been built, you can retag them and can push them to your favourite docker registry.
-Once the images have been built, you can retag them with `docker tag ballista:0.5.0-SNAPSHOT <new-image-name>` so you
-can push them to your favourite docker registry.
+```bash
+docker tag ballista:0.6.0 <your-repo>/ballista:0.6.0
+docker push <your-repo>/ballista:0.6.0
+```
## Create Persistent Volume and Persistent Volume Claim
@@ -130,7 +154,7 @@ spec:
spec:
containers:
- name: ballista-scheduler
- image: <your-image>
+ image: <your-repo>/ballista:0.6.0
command: ["/scheduler"]
args: ["--bind-port=50050"]
ports:
@@ -161,7 +185,7 @@ spec:
spec:
containers:
- name: ballista-executor
- image: <your-image>
+ image: <your-repo>/ballista:0.6.0
command: ["/executor"]
args:
- "--bind-port=50051"
@@ -205,11 +229,31 @@ You can view the scheduler logs with `kubectl logs ballista-scheduler-0`:
```
$ kubectl logs ballista-scheduler-0
-[2021-02-19T00:24:01Z INFO scheduler] Ballista v0.4.2-SNAPSHOT Scheduler listening on 0.0.0.0:50050
+[2021-02-19T00:24:01Z INFO scheduler] Ballista v0.6.0 Scheduler listening on 0.0.0.0:50050
[2021-02-19T00:24:16Z INFO ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "b5e81711-1c5c-46ec-8522-d8b359793188", host: "10.1.23.149", port: 50051 }
[2021-02-19T00:24:17Z INFO ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "816e4502-a876-4ed8-b33f-86d243dcf63f", host: "10.1.23.150", port: 50051 }
```
+## Port Forwarding
+
+If you want to run applications outside of the cluster and have them connect to the scheduler then it is necessary to
+set up port forwarding.
+
+First, check that the `ballista-scheduler` service is running.
+
+```bash
+$ kubectl get services
+NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
+kubernetes ClusterIP 10.152.183.1 <none> 443/TCP 26h
+ballista-scheduler ClusterIP 10.152.183.21 <none> 50050/TCP 24m
+```
+
+Use the following command to set up port-forwarding.
+
+```bash
+kubectl port-forward service/ballista-scheduler 50050:50050
+```
+
## Deleting the Ballista cluster
Run the following kubectl command to delete the cluster.
diff --git a/docs/user-guide/src/faq.md b/docs/user-guide/src/faq.md
index 5e1a72c..f96306c 100644
--- a/docs/user-guide/src/faq.md
+++ b/docs/user-guide/src/faq.md
@@ -28,5 +28,4 @@ DataFusion is a library for executing queries in-process using the Apache Arrow
model and computational kernels. It is designed to run within a single process, using threads
for parallel query execution.
-Ballista is a distributed compute platform design to leverage DataFusion and other query
-execution libraries.
+Ballista is a distributed compute platform built on DataFusion.
diff --git a/docs/user-guide/src/img/ballista-architecture.png b/docs/user-guide/src/img/ballista-architecture.png
deleted file mode 100644
index 2f78f29..0000000
Binary files a/docs/user-guide/src/img/ballista-architecture.png and /dev/null differ
diff --git a/docs/user-guide/src/library.md b/docs/user-guide/src/library.md
index d35a4b7..d255d78 100644
--- a/docs/user-guide/src/library.md
+++ b/docs/user-guide/src/library.md
@@ -25,5 +25,5 @@ To get started, add the following to your `Cargo.toml` file:
```toml
[dependencies]
-datafusion = "4.0.0-SNAPSHOT"
+datafusion = "5.1.0"
```
diff --git a/docs/user-guide/src/sql/ddl.md b/docs/user-guide/src/sql/ddl.md
index cb16657..b48179d 100644
--- a/docs/user-guide/src/sql/ddl.md
+++ b/docs/user-guide/src/sql/ddl.md
@@ -18,3 +18,39 @@
-->
# DDL
+
+## CREATE EXTERNAL TABLE
+
+Parquet data sources can be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. It is not necessary
+to provide schema information for Parquet files.
+
+```sql
+CREATE EXTERNAL TABLE taxi
+STORED AS PARQUET
+LOCATION '/mnt/nyctaxi/tripdata.parquet';
+```
+
+CSV data sources can also be registered by executing a `CREATE EXTERNAL TABLE` SQL statement. It is necessary to
+provide schema information for CSV files since DataFusion does not automatically infer the schema when using SQL
+to query CSV files.
+
+```sql
+CREATE EXTERNAL TABLE test (
+ c1 VARCHAR NOT NULL,
+ c2 INT NOT NULL,
+ c3 SMALLINT NOT NULL,
+ c4 SMALLINT NOT NULL,
+ c5 INT NOT NULL,
+ c6 BIGINT NOT NULL,
+ c7 SMALLINT NOT NULL,
+ c8 INT NOT NULL,
+ c9 BIGINT NOT NULL,
+ c10 VARCHAR NOT NULL,
+ c11 FLOAT NOT NULL,
+ c12 DOUBLE NOT NULL,
+ c13 VARCHAR NOT NULL
+)
+STORED AS CSV
+WITH HEADER ROW
+LOCATION '/path/to/aggregate_test_100.csv';
+```