You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by gi...@apache.org on 2023/06/15 19:48:12 UTC
[arrow-datafusion] branch asf-site updated: Publish built docs triggered by 9dfaf4249e31e9a08953955fe4837eb287b089bf
This is an automated email from the ASF dual-hosted git repository.
github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 518d407709 Publish built docs triggered by 9dfaf4249e31e9a08953955fe4837eb287b089bf
518d407709 is described below
commit 518d407709e343e4a3679202ea1b5442d7717c8c
Author: github-actions[bot] <gi...@users.noreply.github.com>
AuthorDate: Thu Jun 15 19:48:08 2023 +0000
Publish built docs triggered by 9dfaf4249e31e9a08953955fe4837eb287b089bf
---
_sources/contributor-guide/roadmap.md.txt | 121 ++++--------------
_sources/user-guide/introduction.md.txt | 58 ++++++---
contributor-guide/roadmap.html | 202 +++++-------------------------
searchindex.js | 2 +-
user-guide/introduction.html | 61 ++++++---
5 files changed, 135 insertions(+), 309 deletions(-)
diff --git a/_sources/contributor-guide/roadmap.md.txt b/_sources/contributor-guide/roadmap.md.txt
index 8413fef20d..a7e81555b7 100644
--- a/_sources/contributor-guide/roadmap.md.txt
+++ b/_sources/contributor-guide/roadmap.md.txt
@@ -19,100 +19,27 @@ under the License.
# Roadmap
-This document describes high level goals of the DataFusion and
-Ballista development community. It is not meant to restrict
-possibilities, but rather help newcomers understand the broader
-context of where the community is headed, and inspire
-additional contributions.
-
-DataFusion and Ballista are part of the [Apache
-Arrow](https://arrow.apache.org/) project and governed by the Apache
-Software Foundation governance model. These projects are entirely
-driven by volunteers, and we welcome contributions for items not on
-this roadmap. However, before submitting a large PR, we strongly
-suggest you start a conversation using a github issue or the
-dev@arrow.apache.org mailing list to make review efficient and avoid
-surprises.
-
-## DataFusion
-
-DataFusion's goal is to become the embedded query engine of choice
-for new analytic applications, by leveraging the unique features of
-[Rust](https://www.rust-lang.org/) and [Apache Arrow](https://arrow.apache.org/)
-to provide:
-
-1. Best-in-class single node query performance
-2. A Declarative SQL query interface compatible with PostgreSQL
-3. A Dataframe API, similar to those offered by Pandas and Spark
-4. A Procedural API for programmatically creating and running execution plans
-5. High performance, data race free, ergonomic extensibility points at at every layer
-
-### Additional SQL Language Features
-
-- Decimal Support [#122](https://github.com/apache/arrow-datafusion/issues/122)
-- Complete support list on [status](https://github.com/apache/arrow-datafusion/blob/main/README.md#status)
-- Timestamp Arithmetic [#194](https://github.com/apache/arrow-datafusion/issues/194)
-- SQL Parser extension point [#533](https://github.com/apache/arrow-datafusion/issues/533)
-- Support for nested structures (fields, lists, structs) [#119](https://github.com/apache/arrow-datafusion/issues/119)
-- Run all queries from the TPCH benchmark (see [milestone](https://github.com/apache/arrow-datafusion/milestone/2) for more details)
-
-### Query Optimizer
-
-- More sophisticated cost based optimizer for join ordering
-- Implement advanced query optimization framework (Tokomak) [#440](https://github.com/apache/arrow-datafusion/issues/440)
-- Finer optimizations for group by and aggregate functions
-
-### Datasources
-
-- Better support for reading data from remote filesystems (e.g. S3) without caching it locally [#907](https://github.com/apache/arrow-datafusion/issues/907) [#1060](https://github.com/apache/arrow-datafusion/issues/1060)
-- Improve performances of file format datasources (parallelize file listings, async Arrow readers, file chunk prefetching capability...)
-
-### Runtime / Infrastructure
-
-- Migrate to some sort of arrow2 based implementation (see [milestone](https://github.com/apache/arrow-datafusion/milestone/3) for more details)
-- Add DataFusion to h2oai/db-benchmark [#147](https://github.com/apache/arrow-datafusion/issues/147)
-- Improve build time [#348](https://github.com/apache/arrow-datafusion/issues/348)
-
-### Resource Management
-
-- Finer grain control and limit of runtime memory [#587](https://github.com/apache/arrow-datafusion/issues/587) and CPU usage [#54](https://github.com/apache/arrow-datafusion/issues/64)
-
-### Python Interface
-
-TBD
-
-### DataFusion CLI (`datafusion-cli`)
-
-Note: There are some additional thoughts on a datafusion-cli vision on [#1096](https://github.com/apache/arrow-datafusion/issues/1096#issuecomment-939418770).
-
-- Better abstraction between REPL parsing and queries so that commands are separated and handled correctly
-- Connect to the `Statistics` subsystem and have the cli print out more stats for query debugging, etc.
-- Improved error handling for interactive use and shell scripting usage
-- publishing to apt, brew, and possible NuGet registry so that people can use it more easily
-- adopt a shorter name, like dfcli?
-
-## Ballista
-
-Ballista is a distributed compute platform based on Apache Arrow and DataFusion. It provides a query scheduler that
-breaks a physical plan into stages and tasks and then schedules tasks for execution across the available executors
-in the cluster.
-
-Having Ballista as part of the DataFusion codebase helps ensure that DataFusion remains suitable for distributed
-compute. For example, it helps ensure that physical query plans can be serialized to protobuf format and that they
-remain language-agnostic so that executors can be built in languages other than Rust.
-
-### Ballista Roadmap
-
-### Move query scheduler into DataFusion
-
-The Ballista scheduler has some advantages over DataFusion query execution because it doesn't try to eagerly execute
-the entire query at once but breaks it down into a directionally-acyclic graph (DAG) of stages and executes a
-configurable number of stages and tasks concurrently. It should be possible to push some of this logic down to
-DataFusion so that the same scheduler can be used to scale across cores in-process and across nodes in a cluster.
-
-### Implement execution-time cost-based optimizations based on statistics
-
-After the execution of a query stage, accurate statistics are available for the resulting data. These statistics
-could be leveraged by the scheduler to optimize the query during execution. For example, when performing a hash join
-it is desirable to load the smaller side of the join into memory and in some cases we cannot predict which side will
-be smaller until execution time.
+The [project introduction](../user-guide/introduction) explains the
+overview and goals of DataFusion, and our development efforts largely
+align to that vision.
+
+## Planning `EPIC`s
+
+DataFusion uses [GitHub
+issues](https://github.com/apache/arrow-datafusion/issues) to track
+planned work. We collect related tickets using tracking issues labeled
+with `[EPIC]` which contain discussion and links to more detailed items.
+
+Epics offer a high level roadmap of what the DataFusion
+community is thinking about. The epics are not meant to restrict
+possibilities, but rather help the community see where development is
+headed, align our work, and inspire additional contributions.
+
+As this project is entirely driven by volunteers, we welcome
+contributions for items not currently covered by epics. However,
+before submitting a large PR, we strongly suggest and request you
+start a conversation using a github issue or the
+[dev@arrow.apache.org](mailto:dev@arrow.apache.org) mailing list to
+make review efficient and avoid surprises.
+
+[The current list of `EPIC`s can be found here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+epic).
diff --git a/_sources/user-guide/introduction.md.txt b/_sources/user-guide/introduction.md.txt
index 23157d3f36..80195c6a3f 100644
--- a/_sources/user-guide/introduction.md.txt
+++ b/_sources/user-guide/introduction.md.txt
@@ -22,8 +22,20 @@
DataFusion is a very fast, extensible query engine for building
high-quality data-centric systems in [Rust](http://rustlang.org),
using the [Apache Arrow](https://arrow.apache.org) in-memory format.
+DataFusion is part of the [Apache Arrow](https://arrow.apache.org/)
+project.
-DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.
+DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet, JSON, and Avro, [python bindings], extensive customization, a great community, and more.
+
+[python bindings]: https://github.com/apache/arrow-datafusion-python
+
+## Project Goals
+
+DataFusion aims to be the query engine of choice for new, fast
+data centric systems such as databases, dataframe libraries, machine
+learning and streaming applications by leveraging the unique features
+of [Rust](https://www.rust-lang.org/) and [Apache
+Arrow](https://arrow.apache.org/).
## Features
@@ -34,24 +46,34 @@ DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchm
- Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
other query languages, custom plan and execution nodes, optimizer passes, and more.
- Streaming, asynchronous IO directly from popular object stores, including AWS S3,
- Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
- `ObjectStore` trait.
+ Azure Blob Storage, and Google Cloud Storage (Other storage systems are supported via the
+ `ObjectStore` trait).
- [Excellent Documentation](https://docs.rs/datafusion/latest) and a
[welcoming community](https://arrow.apache.org/datafusion/contributor-guide/communication.html).
-- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
- automatic join reordering, expression coercion, and more.
-- Permissive Apache 2.0 License, Apache Software Foundation governance
-- Written in [Rust](https://www.rust-lang.org/), a modern system language with development
- productivity similar to Java or Golang, the performance of C++, and
- [loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
-- Support for [Substrait](https://substrait.io/) for query plan serialization, making it easier to integrate DataFusion
- with other projects, and to pass plans across language boundaries.
+- A state of the art query optimizer with expression coercion and
+ simplification, projection and filter pushdown, sort and distribution
+ aware optimizations, automatic join reordering, and more.
+- Permissive Apache 2.0 License, predictable and well understood
+ [Apache Software Foundation](https://www.apache.org/) governance.
+- Implementation in [Rust](https://www.rust-lang.org/), a modern
+ system language with development productivity similar to Java or
+ Golang, the performance of C++, and [loved by programmers
+ everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
+- Support for [Substrait](https://substrait.io/) query plans, to
+ easily pass plans across language and system boundaries.
## Use Cases
DataFusion can be used without modification as an embedded SQL
engine or can be customized and used as a foundation for
-building new systems. Here are some examples of systems built using DataFusion:
+building new systems.
+
+While most current usecases are "analytic" or (throughput) some
+components of DataFusion such as the plan representations, are
+suitable for "streaming" and "transaction" style systems (low
+latency).
+
+Here are some example systems built using DataFusion:
- Specialized Analytical Database systems such as [CeresDB] and more general Apache Spark like system such a [Ballista].
- New query language engines such as [prql-query] and accelerators such as [VegaFusion]
@@ -59,12 +81,12 @@ building new systems. Here are some examples of systems built using DataFusion:
- SQL support to another library, such as [dask sql]
- Streaming data platforms such as [Synnada]
- Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv]
-- A faster Spark runtime replacement [Blaze]
+- Native Spark runtime replacement such as [Blaze]
-By using DataFusion, the projects are freed to focus on their specific
+By using DataFusion, projects are freed to focus on their specific
features, and avoid reimplementing general (but still necessary)
features such as an expression representation, standard optimizations,
-execution plans, file format support, etc.
+parellelized streaming execution plans, file format support, etc.
## Known Users
@@ -119,7 +141,7 @@ Here are some of the projects known to use DataFusion:
## Integrations and Extensions
There are a number of community projects that extend DataFusion or
-provide integrations with other systems.
+provide integrations with other systems, some of which are described below:
### Language Bindings
@@ -137,5 +159,5 @@ provide integrations with other systems.
- _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast.
- _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
-- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
-- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
+- _Easy to Embed_: Allowing extension at almost any point in its design, and published regularly as a crate on [crates.io](http://crates.io), DataFusion can be integrated and tailored for your specific usecase.
+- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can and is used as the foundation for production systems.
diff --git a/contributor-guide/roadmap.html b/contributor-guide/roadmap.html
index 285104171b..e3868089dc 100644
--- a/contributor-guide/roadmap.html
+++ b/contributor-guide/roadmap.html
@@ -283,74 +283,15 @@
<nav id="bd-toc-nav">
<ul class="visible nav section-nav flex-column">
<li class="toc-h2 nav-item toc-entry">
- <a class="reference internal nav-link" href="#datafusion">
- DataFusion
- </a>
- <ul class="nav section-nav flex-column">
- <li class="toc-h3 nav-item toc-entry">
- <a class="reference internal nav-link" href="#additional-sql-language-features">
- Additional SQL Language Features
- </a>
- </li>
- <li class="toc-h3 nav-item toc-entry">
- <a class="reference internal nav-link" href="#query-optimizer">
- Query Optimizer
- </a>
- </li>
- <li class="toc-h3 nav-item toc-entry">
- <a class="reference internal nav-link" href="#datasources">
- Datasources
- </a>
- </li>
- <li class="toc-h3 nav-item toc-entry">
- <a class="reference internal nav-link" href="#runtime-infrastructure">
- Runtime / Infrastructure
- </a>
- </li>
- <li class="toc-h3 nav-item toc-entry">
- <a class="reference internal nav-link" href="#resource-management">
- Resource Management
- </a>
- </li>
- <li class="toc-h3 nav-item toc-entry">
- <a class="reference internal nav-link" href="#python-interface">
- Python Interface
- </a>
- </li>
- <li class="toc-h3 nav-item toc-entry">
- <a class="reference internal nav-link" href="#datafusion-cli-datafusion-cli">
- DataFusion CLI (
- <code class="docutils literal notranslate">
- <span class="pre">
- datafusion-cli
- </span>
- </code>
- )
- </a>
- </li>
- </ul>
- </li>
- <li class="toc-h2 nav-item toc-entry">
- <a class="reference internal nav-link" href="#ballista">
- Ballista
+ <a class="reference internal nav-link" href="#planning-epics">
+ Planning
+ <code class="docutils literal notranslate">
+ <span class="pre">
+ EPIC
+ </span>
+ </code>
+ s
</a>
- <ul class="nav section-nav flex-column">
- <li class="toc-h3 nav-item toc-entry">
- <a class="reference internal nav-link" href="#ballista-roadmap">
- Ballista Roadmap
- </a>
- </li>
- <li class="toc-h3 nav-item toc-entry">
- <a class="reference internal nav-link" href="#move-query-scheduler-into-datafusion">
- Move query scheduler into DataFusion
- </a>
- </li>
- <li class="toc-h3 nav-item toc-entry">
- <a class="reference internal nav-link" href="#implement-execution-time-cost-based-optimizations-based-on-statistics">
- Implement execution-time cost-based optimizations based on statistics
- </a>
- </li>
- </ul>
</li>
</ul>
@@ -400,113 +341,26 @@ under the License.
-->
<section id="roadmap">
<h1>Roadmap<a class="headerlink" href="#roadmap" title="Permalink to this heading">¶</a></h1>
-<p>This document describes high level goals of the DataFusion and
-Ballista development community. It is not meant to restrict
-possibilities, but rather help newcomers understand the broader
-context of where the community is headed, and inspire
-additional contributions.</p>
-<p>DataFusion and Ballista are part of the <a class="reference external" href="https://arrow.apache.org/">Apache
-Arrow</a> project and governed by the Apache
-Software Foundation governance model. These projects are entirely
-driven by volunteers, and we welcome contributions for items not on
-this roadmap. However, before submitting a large PR, we strongly
-suggest you start a conversation using a github issue or the
-dev@arrow.apache.org mailing list to make review efficient and avoid
-surprises.</p>
-<section id="datafusion">
-<h2>DataFusion<a class="headerlink" href="#datafusion" title="Permalink to this heading">¶</a></h2>
-<p>DataFusion’s goal is to become the embedded query engine of choice
-for new analytic applications, by leveraging the unique features of
-<a class="reference external" href="https://www.rust-lang.org/">Rust</a> and <a class="reference external" href="https://arrow.apache.org/">Apache Arrow</a>
-to provide:</p>
-<ol class="arabic simple">
-<li><p>Best-in-class single node query performance</p></li>
-<li><p>A Declarative SQL query interface compatible with PostgreSQL</p></li>
-<li><p>A Dataframe API, similar to those offered by Pandas and Spark</p></li>
-<li><p>A Procedural API for programmatically creating and running execution plans</p></li>
-<li><p>High performance, data race free, ergonomic extensibility points at at every layer</p></li>
-</ol>
-<section id="additional-sql-language-features">
-<h3>Additional SQL Language Features<a class="headerlink" href="#additional-sql-language-features" title="Permalink to this heading">¶</a></h3>
-<ul class="simple">
-<li><p>Decimal Support <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/122">#122</a></p></li>
-<li><p>Complete support list on <a class="reference external" href="https://github.com/apache/arrow-datafusion/blob/main/README.md#status">status</a></p></li>
-<li><p>Timestamp Arithmetic <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/194">#194</a></p></li>
-<li><p>SQL Parser extension point <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/533">#533</a></p></li>
-<li><p>Support for nested structures (fields, lists, structs) <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/119">#119</a></p></li>
-<li><p>Run all queries from the TPCH benchmark (see <a class="reference external" href="https://github.com/apache/arrow-datafusion/milestone/2">milestone</a> for more details)</p></li>
-</ul>
-</section>
-<section id="query-optimizer">
-<h3>Query Optimizer<a class="headerlink" href="#query-optimizer" title="Permalink to this heading">¶</a></h3>
-<ul class="simple">
-<li><p>More sophisticated cost based optimizer for join ordering</p></li>
-<li><p>Implement advanced query optimization framework (Tokomak) <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/440">#440</a></p></li>
-<li><p>Finer optimizations for group by and aggregate functions</p></li>
-</ul>
-</section>
-<section id="datasources">
-<h3>Datasources<a class="headerlink" href="#datasources" title="Permalink to this heading">¶</a></h3>
-<ul class="simple">
-<li><p>Better support for reading data from remote filesystems (e.g. S3) without caching it locally <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/907">#907</a> <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/1060">#1060</a></p></li>
-<li><p>Improve performances of file format datasources (parallelize file listings, async Arrow readers, file chunk prefetching capability…)</p></li>
-</ul>
-</section>
-<section id="runtime-infrastructure">
-<h3>Runtime / Infrastructure<a class="headerlink" href="#runtime-infrastructure" title="Permalink to this heading">¶</a></h3>
-<ul class="simple">
-<li><p>Migrate to some sort of arrow2 based implementation (see <a class="reference external" href="https://github.com/apache/arrow-datafusion/milestone/3">milestone</a> for more details)</p></li>
-<li><p>Add DataFusion to h2oai/db-benchmark <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/147">#147</a></p></li>
-<li><p>Improve build time <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/348">#348</a></p></li>
-</ul>
-</section>
-<section id="resource-management">
-<h3>Resource Management<a class="headerlink" href="#resource-management" title="Permalink to this heading">¶</a></h3>
-<ul class="simple">
-<li><p>Finer grain control and limit of runtime memory <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/587">#587</a> and CPU usage <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/64">#54</a></p></li>
-</ul>
-</section>
-<section id="python-interface">
-<h3>Python Interface<a class="headerlink" href="#python-interface" title="Permalink to this heading">¶</a></h3>
-<p>TBD</p>
-</section>
-<section id="datafusion-cli-datafusion-cli">
-<h3>DataFusion CLI (<code class="docutils literal notranslate"><span class="pre">datafusion-cli</span></code>)<a class="headerlink" href="#datafusion-cli-datafusion-cli" title="Permalink to this heading">¶</a></h3>
-<p>Note: There are some additional thoughts on a datafusion-cli vision on <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/1096#issuecomment-939418770">#1096</a>.</p>
-<ul class="simple">
-<li><p>Better abstraction between REPL parsing and queries so that commands are separated and handled correctly</p></li>
-<li><p>Connect to the <code class="docutils literal notranslate"><span class="pre">Statistics</span></code> subsystem and have the cli print out more stats for query debugging, etc.</p></li>
-<li><p>Improved error handling for interactive use and shell scripting usage</p></li>
-<li><p>publishing to apt, brew, and possible NuGet registry so that people can use it more easily</p></li>
-<li><p>adopt a shorter name, like dfcli?</p></li>
-</ul>
-</section>
-</section>
-<section id="ballista">
-<h2>Ballista<a class="headerlink" href="#ballista" title="Permalink to this heading">¶</a></h2>
-<p>Ballista is a distributed compute platform based on Apache Arrow and DataFusion. It provides a query scheduler that
-breaks a physical plan into stages and tasks and then schedules tasks for execution across the available executors
-in the cluster.</p>
-<p>Having Ballista as part of the DataFusion codebase helps ensure that DataFusion remains suitable for distributed
-compute. For example, it helps ensure that physical query plans can be serialized to protobuf format and that they
-remain language-agnostic so that executors can be built in languages other than Rust.</p>
-<section id="ballista-roadmap">
-<h3>Ballista Roadmap<a class="headerlink" href="#ballista-roadmap" title="Permalink to this heading">¶</a></h3>
-</section>
-<section id="move-query-scheduler-into-datafusion">
-<h3>Move query scheduler into DataFusion<a class="headerlink" href="#move-query-scheduler-into-datafusion" title="Permalink to this heading">¶</a></h3>
-<p>The Ballista scheduler has some advantages over DataFusion query execution because it doesn’t try to eagerly execute
-the entire query at once but breaks it down into a directionally-acyclic graph (DAG) of stages and executes a
-configurable number of stages and tasks concurrently. It should be possible to push some of this logic down to
-DataFusion so that the same scheduler can be used to scale across cores in-process and across nodes in a cluster.</p>
-</section>
-<section id="implement-execution-time-cost-based-optimizations-based-on-statistics">
-<h3>Implement execution-time cost-based optimizations based on statistics<a class="headerlink" href="#implement-execution-time-cost-based-optimizations-based-on-statistics" title="Permalink to this heading">¶</a></h3>
-<p>After the execution of a query stage, accurate statistics are available for the resulting data. These statistics
-could be leveraged by the scheduler to optimize the query during execution. For example, when performing a hash join
-it is desirable to load the smaller side of the join into memory and in some cases we cannot predict which side will
-be smaller until execution time.</p>
-</section>
+<p>The <a class="reference internal" href="../user-guide/introduction.html"><span class="doc std std-doc">project introduction</span></a> explains the
+overview and goals of DataFusion, and our development efforts largely
+align to that vision.</p>
+<section id="planning-epics">
+<h2>Planning <code class="docutils literal notranslate"><span class="pre">EPIC</span></code>s<a class="headerlink" href="#planning-epics" title="Permalink to this heading">¶</a></h2>
+<p>DataFusion uses <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues">GitHub
+issues</a> to track
+planned work. We collect related tickets using tracking issues labeled
+with <code class="docutils literal notranslate"><span class="pre">[EPIC]</span></code> which contain discussion and links to more detailed items.</p>
+<p>Epics offer a high level roadmap of what the DataFusion
+community is thinking about. The epics are not meant to restrict
+possibilities, but rather help the community see where development is
+headed, align our work, and inspire additional contributions.</p>
+<p>As this project is entirely driven by volunteers, we welcome
+contributions for items not currently covered by epics. However,
+before submitting a large PR, we strongly suggest and request you
+start a conversation using a github issue or the
+<a class="reference external" href="mailto:dev%40arrow.apache.org">dev<span>@</span>arrow<span>.</span>apache<span>.</span>org</a> mailing list to
+make review efficient and avoid surprises.</p>
+<p><a class="reference external" href="https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+epic">The current list of <code class="docutils literal notranslate"><span class="pre">EPIC</span></code>s can be found here</a>.</p>
</section>
</section>
diff --git a/searchindex.js b/searchindex.js
index a07e14ade8..b350ff6bf6 100644
--- a/searchindex.js
+++ b/searchindex.js
@@ -1 +1 @@
-Search.setIndex({"docnames": ["contributor-guide/architecture", "contributor-guide/communication", "contributor-guide/index", "contributor-guide/quarterly_roadmap", "contributor-guide/roadmap", "contributor-guide/specification/index", "contributor-guide/specification/invariants", "contributor-guide/specification/output-field-name-semantic", "index", "user-guide/cli", "user-guide/configs", "user-guide/dataframe", "user-guide/example-usage", "user-guide/expressions", "user-guide/faq", "use [...]
\ No newline at end of file
+Search.setIndex({"docnames": ["contributor-guide/architecture", "contributor-guide/communication", "contributor-guide/index", "contributor-guide/quarterly_roadmap", "contributor-guide/roadmap", "contributor-guide/specification/index", "contributor-guide/specification/invariants", "contributor-guide/specification/output-field-name-semantic", "index", "user-guide/cli", "user-guide/configs", "user-guide/dataframe", "user-guide/example-usage", "user-guide/expressions", "user-guide/faq", "use [...]
\ No newline at end of file
diff --git a/user-guide/introduction.html b/user-guide/introduction.html
index 2f2b991c55..4fba7979b4 100644
--- a/user-guide/introduction.html
+++ b/user-guide/introduction.html
@@ -282,6 +282,11 @@
<nav id="bd-toc-nav">
<ul class="visible nav section-nav flex-column">
+ <li class="toc-h2 nav-item toc-entry">
+ <a class="reference internal nav-link" href="#project-goals">
+ Project Goals
+ </a>
+ </li>
<li class="toc-h2 nav-item toc-entry">
<a class="reference internal nav-link" href="#features">
Features
@@ -369,8 +374,18 @@
<h1>Introduction<a class="headerlink" href="#introduction" title="Permalink to this heading">¶</a></h1>
<p>DataFusion is a very fast, extensible query engine for building
high-quality data-centric systems in <a class="reference external" href="http://rustlang.org">Rust</a>,
-using the <a class="reference external" href="https://arrow.apache.org">Apache Arrow</a> in-memory format.</p>
-<p>DataFusion offers SQL and Dataframe APIs, excellent <a class="reference external" href="https://benchmark.clickhouse.com/">performance</a>, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.</p>
+using the <a class="reference external" href="https://arrow.apache.org">Apache Arrow</a> in-memory format.
+DataFusion is part of the <a class="reference external" href="https://arrow.apache.org/">Apache Arrow</a>
+project.</p>
+<p>DataFusion offers SQL and Dataframe APIs, excellent <a class="reference external" href="https://benchmark.clickhouse.com/">performance</a>, built-in support for CSV, Parquet, JSON, and Avro, <a class="reference external" href="https://github.com/apache/arrow-datafusion-python">python bindings</a>, extensive customization, a great community, and more.</p>
+<section id="project-goals">
+<h2>Project Goals<a class="headerlink" href="#project-goals" title="Permalink to this heading">¶</a></h2>
+<p>DataFusion aims to be the query engine of choice for new, fast
+data centric systems such as databases, dataframe libraries, machine
+learning and streaming applications by leveraging the unique features
+of <a class="reference external" href="https://www.rust-lang.org/">Rust</a> and <a class="reference external" href="https://arrow.apache.org/">Apache
+Arrow</a>.</p>
+</section>
<section id="features">
<h2>Features<a class="headerlink" href="#features" title="Permalink to this heading">¶</a></h2>
<ul class="simple">
@@ -381,25 +396,33 @@ for custom file formats and non file datasources via the <code class="docutils l
<li><p>Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
other query languages, custom plan and execution nodes, optimizer passes, and more.</p></li>
<li><p>Streaming, asynchronous IO directly from popular object stores, including AWS S3,
-Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
-<code class="docutils literal notranslate"><span class="pre">ObjectStore</span></code> trait.</p></li>
+Azure Blob Storage, and Google Cloud Storage (Other storage systems are supported via the
+<code class="docutils literal notranslate"><span class="pre">ObjectStore</span></code> trait).</p></li>
<li><p><a class="reference external" href="https://docs.rs/datafusion/latest">Excellent Documentation</a> and a
<a class="reference external" href="https://arrow.apache.org/datafusion/contributor-guide/communication.html">welcoming community</a>.</p></li>
-<li><p>A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
-automatic join reordering, expression coercion, and more.</p></li>
-<li><p>Permissive Apache 2.0 License, Apache Software Foundation governance</p></li>
-<li><p>Written in <a class="reference external" href="https://www.rust-lang.org/">Rust</a>, a modern system language with development
-productivity similar to Java or Golang, the performance of C++, and
-<a class="reference external" href="https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted">loved by programmers everywhere</a>.</p></li>
-<li><p>Support for <a class="reference external" href="https://substrait.io/">Substrait</a> for query plan serialization, making it easier to integrate DataFusion
-with other projects, and to pass plans across language boundaries.</p></li>
+<li><p>A state of the art query optimizer with expression coercion and
+simplification, projection and filter pushdown, sort and distribution
+aware optimizations, automatic join reordering, and more.</p></li>
+<li><p>Permissive Apache 2.0 License, predictable and well understood
+<a class="reference external" href="https://www.apache.org/">Apache Software Foundation</a> governance.</p></li>
+<li><p>Implementation in <a class="reference external" href="https://www.rust-lang.org/">Rust</a>, a modern
+system language with development productivity similar to Java or
+Golang, the performance of C++, and <a class="reference external" href="https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted">loved by programmers
+everywhere</a>.</p></li>
+<li><p>Support for <a class="reference external" href="https://substrait.io/">Substrait</a> query plans, to
+easily pass plans across language and system boundaries.</p></li>
</ul>
</section>
<section id="use-cases">
<h2>Use Cases<a class="headerlink" href="#use-cases" title="Permalink to this heading">¶</a></h2>
<p>DataFusion can be used without modification as an embedded SQL
engine or can be customized and used as a foundation for
-building new systems. Here are some examples of systems built using DataFusion:</p>
+building new systems.</p>
+<p>While most current usecases are “analytic” or (throughput) some
+components of DataFusion such as the plan representations, are
+suitable for “streaming” and “transaction” style systems (low
+latency).</p>
+<p>Here are some example systems built using DataFusion:</p>
<ul class="simple">
<li><p>Specialized Analytical Database systems such as <a class="reference external" href="https://github.com/CeresDB/ceresdb">CeresDB</a> and more general Apache Spark like system such a <a class="reference external" href="https://github.com/apache/arrow-ballista">Ballista</a>.</p></li>
<li><p>New query language engines such as <a class="reference external" href="https://github.com/prql/prql-query">prql-query</a> and accelerators such as <a class="reference external" href="https://vegafusion.io/">VegaFusion</a></p></li>
@@ -407,12 +430,12 @@ building new systems. Here are some examples of systems built using DataFusion:<
<li><p>SQL support to another library, such as <a class="reference external" href="https://github.com/dask-contrib/dask-sql">dask sql</a></p></li>
<li><p>Streaming data platforms such as <a class="reference external" href="https://synnada.ai/">Synnada</a></p></li>
<li><p>Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as <a class="reference external" href="https://github.com/timvw/qv">qv</a></p></li>
-<li><p>A faster Spark runtime replacement <a class="reference external" href="https://github.com/blaze-init/blaze">Blaze</a></p></li>
+<li><p>Native Spark runtime replacement such as <a class="reference external" href="https://github.com/blaze-init/blaze">Blaze</a></p></li>
</ul>
-<p>By using DataFusion, the projects are freed to focus on their specific
+<p>By using DataFusion, projects are freed to focus on their specific
features, and avoid reimplementing general (but still necessary)
features such as an expression representation, standard optimizations,
-execution plans, file format support, etc.</p>
+parellelized streaming execution plans, file format support, etc.</p>
</section>
<section id="known-users">
<h2>Known Users<a class="headerlink" href="#known-users" title="Permalink to this heading">¶</a></h2>
@@ -445,7 +468,7 @@ execution plans, file format support, etc.</p>
<section id="integrations-and-extensions">
<h2>Integrations and Extensions<a class="headerlink" href="#integrations-and-extensions" title="Permalink to this heading">¶</a></h2>
<p>There are a number of community projects that extend DataFusion or
-provide integrations with other systems.</p>
+provide integrations with other systems, some of which are described below:</p>
<section id="language-bindings">
<h3>Language Bindings<a class="headerlink" href="#language-bindings" title="Permalink to this heading">¶</a></h3>
<ul class="simple">
@@ -468,8 +491,8 @@ provide integrations with other systems.</p>
<ul class="simple">
<li><p><em>High Performance</em>: Leveraging Rust and Arrow’s memory model, DataFusion is very fast.</p></li>
<li><p><em>Easy to Connect</em>: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem</p></li>
-<li><p><em>Easy to Embed</em>: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase</p></li>
-<li><p><em>High Quality</em>: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.</p></li>
+<li><p><em>Easy to Embed</em>: Allowing extension at almost any point in its design, and published regularly as a crate on <a class="reference external" href="http://crates.io">crates.io</a>, DataFusion can be integrated and tailored for your specific usecase.</p></li>
+<li><p><em>High Quality</em>: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can and is used as the foundation for production systems.</p></li>
</ul>
</section>
</section>