You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by gi...@apache.org on 2023/06/15 19:48:12 UTC

[arrow-datafusion] branch asf-site updated: Publish built docs triggered by 9dfaf4249e31e9a08953955fe4837eb287b089bf

This is an automated email from the ASF dual-hosted git repository.

github-bot pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/arrow-datafusion.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 518d407709 Publish built docs triggered by 9dfaf4249e31e9a08953955fe4837eb287b089bf
518d407709 is described below

commit 518d407709e343e4a3679202ea1b5442d7717c8c
Author: github-actions[bot] <gi...@users.noreply.github.com>
AuthorDate: Thu Jun 15 19:48:08 2023 +0000

    Publish built docs triggered by 9dfaf4249e31e9a08953955fe4837eb287b089bf
---
 _sources/contributor-guide/roadmap.md.txt | 121 ++++--------------
 _sources/user-guide/introduction.md.txt   |  58 ++++++---
 contributor-guide/roadmap.html            | 202 +++++-------------------------
 searchindex.js                            |   2 +-
 user-guide/introduction.html              |  61 ++++++---
 5 files changed, 135 insertions(+), 309 deletions(-)

diff --git a/_sources/contributor-guide/roadmap.md.txt b/_sources/contributor-guide/roadmap.md.txt
index 8413fef20d..a7e81555b7 100644
--- a/_sources/contributor-guide/roadmap.md.txt
+++ b/_sources/contributor-guide/roadmap.md.txt
@@ -19,100 +19,27 @@ under the License.
 
 # Roadmap
 
-This document describes high level goals of the DataFusion and
-Ballista development community. It is not meant to restrict
-possibilities, but rather help newcomers understand the broader
-context of where the community is headed, and inspire
-additional contributions.
-
-DataFusion and Ballista are part of the [Apache
-Arrow](https://arrow.apache.org/) project and governed by the Apache
-Software Foundation governance model. These projects are entirely
-driven by volunteers, and we welcome contributions for items not on
-this roadmap. However, before submitting a large PR, we strongly
-suggest you start a conversation using a github issue or the
-dev@arrow.apache.org mailing list to make review efficient and avoid
-surprises.
-
-## DataFusion
-
-DataFusion's goal is to become the embedded query engine of choice
-for new analytic applications, by leveraging the unique features of
-[Rust](https://www.rust-lang.org/) and [Apache Arrow](https://arrow.apache.org/)
-to provide:
-
-1. Best-in-class single node query performance
-2. A Declarative SQL query interface compatible with PostgreSQL
-3. A Dataframe API, similar to those offered by Pandas and Spark
-4. A Procedural API for programmatically creating and running execution plans
-5. High performance, data race free, ergonomic extensibility points at at every layer
-
-### Additional SQL Language Features
-
-- Decimal Support [#122](https://github.com/apache/arrow-datafusion/issues/122)
-- Complete support list on [status](https://github.com/apache/arrow-datafusion/blob/main/README.md#status)
-- Timestamp Arithmetic [#194](https://github.com/apache/arrow-datafusion/issues/194)
-- SQL Parser extension point [#533](https://github.com/apache/arrow-datafusion/issues/533)
-- Support for nested structures (fields, lists, structs) [#119](https://github.com/apache/arrow-datafusion/issues/119)
-- Run all queries from the TPCH benchmark (see [milestone](https://github.com/apache/arrow-datafusion/milestone/2) for more details)
-
-### Query Optimizer
-
-- More sophisticated cost based optimizer for join ordering
-- Implement advanced query optimization framework (Tokomak) [#440](https://github.com/apache/arrow-datafusion/issues/440)
-- Finer optimizations for group by and aggregate functions
-
-### Datasources
-
-- Better support for reading data from remote filesystems (e.g. S3) without caching it locally [#907](https://github.com/apache/arrow-datafusion/issues/907) [#1060](https://github.com/apache/arrow-datafusion/issues/1060)
-- Improve performances of file format datasources (parallelize file listings, async Arrow readers, file chunk prefetching capability...)
-
-### Runtime / Infrastructure
-
-- Migrate to some sort of arrow2 based implementation (see [milestone](https://github.com/apache/arrow-datafusion/milestone/3) for more details)
-- Add DataFusion to h2oai/db-benchmark [#147](https://github.com/apache/arrow-datafusion/issues/147)
-- Improve build time [#348](https://github.com/apache/arrow-datafusion/issues/348)
-
-### Resource Management
-
-- Finer grain control and limit of runtime memory [#587](https://github.com/apache/arrow-datafusion/issues/587) and CPU usage [#54](https://github.com/apache/arrow-datafusion/issues/64)
-
-### Python Interface
-
-TBD
-
-### DataFusion CLI (`datafusion-cli`)
-
-Note: There are some additional thoughts on a datafusion-cli vision on [#1096](https://github.com/apache/arrow-datafusion/issues/1096#issuecomment-939418770).
-
-- Better abstraction between REPL parsing and queries so that commands are separated and handled correctly
-- Connect to the `Statistics` subsystem and have the cli print out more stats for query debugging, etc.
-- Improved error handling for interactive use and shell scripting usage
-- publishing to apt, brew, and possible NuGet registry so that people can use it more easily
-- adopt a shorter name, like dfcli?
-
-## Ballista
-
-Ballista is a distributed compute platform based on Apache Arrow and DataFusion. It provides a query scheduler that
-breaks a physical plan into stages and tasks and then schedules tasks for execution across the available executors
-in the cluster.
-
-Having Ballista as part of the DataFusion codebase helps ensure that DataFusion remains suitable for distributed
-compute. For example, it helps ensure that physical query plans can be serialized to protobuf format and that they
-remain language-agnostic so that executors can be built in languages other than Rust.
-
-### Ballista Roadmap
-
-### Move query scheduler into DataFusion
-
-The Ballista scheduler has some advantages over DataFusion query execution because it doesn't try to eagerly execute
-the entire query at once but breaks it down into a directionally-acyclic graph (DAG) of stages and executes a
-configurable number of stages and tasks concurrently. It should be possible to push some of this logic down to
-DataFusion so that the same scheduler can be used to scale across cores in-process and across nodes in a cluster.
-
-### Implement execution-time cost-based optimizations based on statistics
-
-After the execution of a query stage, accurate statistics are available for the resulting data. These statistics
-could be leveraged by the scheduler to optimize the query during execution. For example, when performing a hash join
-it is desirable to load the smaller side of the join into memory and in some cases we cannot predict which side will
-be smaller until execution time.
+The [project introduction](../user-guide/introduction) explains the
+overview and goals of DataFusion, and our development efforts largely
+align to that vision.
+
+## Planning `EPIC`s
+
+DataFusion uses [GitHub
+issues](https://github.com/apache/arrow-datafusion/issues) to track
+planned work. We collect related tickets using tracking issues labeled
+with `[EPIC]` which contain discussion and links to more detailed items.
+
+Epics offer a high level roadmap of what the DataFusion
+community is thinking about. The epics are not meant to restrict
+possibilities, but rather help the community see where development is
+headed, align our work, and inspire additional contributions.
+
+As this project is entirely driven by volunteers, we welcome
+contributions for items not currently covered by epics. However,
+before submitting a large PR, we strongly suggest and request you
+start a conversation using a github issue or the
+[dev@arrow.apache.org](mailto:dev@arrow.apache.org) mailing list to
+make review efficient and avoid surprises.
+
+[The current list of `EPIC`s can be found here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+epic).
diff --git a/_sources/user-guide/introduction.md.txt b/_sources/user-guide/introduction.md.txt
index 23157d3f36..80195c6a3f 100644
--- a/_sources/user-guide/introduction.md.txt
+++ b/_sources/user-guide/introduction.md.txt
@@ -22,8 +22,20 @@
 DataFusion is a very fast, extensible query engine for building
 high-quality data-centric systems in [Rust](http://rustlang.org),
 using the [Apache Arrow](https://arrow.apache.org) in-memory format.
+DataFusion is part of the [Apache Arrow](https://arrow.apache.org/)
+project.
 
-DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.
+DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet, JSON, and Avro, [python bindings], extensive customization, a great community, and more.
+
+[python bindings]: https://github.com/apache/arrow-datafusion-python
+
+## Project Goals
+
+DataFusion aims to be the query engine of choice for new, fast
+data centric systems such as databases, dataframe libraries, machine
+learning and streaming applications by leveraging the unique features
+of [Rust](https://www.rust-lang.org/) and [Apache
+Arrow](https://arrow.apache.org/).
 
 ## Features
 
@@ -34,24 +46,34 @@ DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchm
 - Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
   other query languages, custom plan and execution nodes, optimizer passes, and more.
 - Streaming, asynchronous IO directly from popular object stores, including AWS S3,
-  Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
-  `ObjectStore` trait.
+  Azure Blob Storage, and Google Cloud Storage (Other storage systems are supported via the
+  `ObjectStore` trait).
 - [Excellent Documentation](https://docs.rs/datafusion/latest) and a
   [welcoming community](https://arrow.apache.org/datafusion/contributor-guide/communication.html).
-- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
-  automatic join reordering, expression coercion, and more.
-- Permissive Apache 2.0 License, Apache Software Foundation governance
-- Written in [Rust](https://www.rust-lang.org/), a modern system language with development
-  productivity similar to Java or Golang, the performance of C++, and
-  [loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
-- Support for [Substrait](https://substrait.io/) for query plan serialization, making it easier to integrate DataFusion
-  with other projects, and to pass plans across language boundaries.
+- A state of the art query optimizer with expression coercion and
+  simplification, projection and filter pushdown, sort and distribution
+  aware optimizations, automatic join reordering, and more.
+- Permissive Apache 2.0 License, predictable and well understood
+  [Apache Software Foundation](https://www.apache.org/) governance.
+- Implementation in [Rust](https://www.rust-lang.org/), a modern
+  system language with development productivity similar to Java or
+  Golang, the performance of C++, and [loved by programmers
+  everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
+- Support for [Substrait](https://substrait.io/) query plans, to
+  easily pass plans across language and system boundaries.
 
 ## Use Cases
 
 DataFusion can be used without modification as an embedded SQL
 engine or can be customized and used as a foundation for
-building new systems. Here are some examples of systems built using DataFusion:
+building new systems.
+
+While most current usecases are "analytic" or (throughput) some
+components of DataFusion such as the plan representations, are
+suitable for "streaming" and "transaction" style systems (low
+latency).
+
+Here are some example systems built using DataFusion:
 
 - Specialized Analytical Database systems such as [CeresDB] and more general Apache Spark like system such a [Ballista].
 - New query language engines such as [prql-query] and accelerators such as [VegaFusion]
@@ -59,12 +81,12 @@ building new systems. Here are some examples of systems built using DataFusion:
 - SQL support to another library, such as [dask sql]
 - Streaming data platforms such as [Synnada]
 - Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv]
-- A faster Spark runtime replacement [Blaze]
+- Native Spark runtime replacement such as [Blaze]
 
-By using DataFusion, the projects are freed to focus on their specific
+By using DataFusion, projects are freed to focus on their specific
 features, and avoid reimplementing general (but still necessary)
 features such as an expression representation, standard optimizations,
-execution plans, file format support, etc.
+parellelized streaming execution plans, file format support, etc.
 
 ## Known Users
 
@@ -119,7 +141,7 @@ Here are some of the projects known to use DataFusion:
 ## Integrations and Extensions
 
 There are a number of community projects that extend DataFusion or
-provide integrations with other systems.
+provide integrations with other systems, some of which are described below:
 
 ### Language Bindings
 
@@ -137,5 +159,5 @@ provide integrations with other systems.
 
 - _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion is very fast.
 - _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
-- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
-- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
+- _Easy to Embed_: Allowing extension at almost any point in its design, and published regularly as a crate on [crates.io](http://crates.io), DataFusion can be integrated and tailored for your specific usecase.
+- _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can and is used as the foundation for production systems.
diff --git a/contributor-guide/roadmap.html b/contributor-guide/roadmap.html
index 285104171b..e3868089dc 100644
--- a/contributor-guide/roadmap.html
+++ b/contributor-guide/roadmap.html
@@ -283,74 +283,15 @@
 <nav id="bd-toc-nav">
     <ul class="visible nav section-nav flex-column">
  <li class="toc-h2 nav-item toc-entry">
-  <a class="reference internal nav-link" href="#datafusion">
-   DataFusion
-  </a>
-  <ul class="nav section-nav flex-column">
-   <li class="toc-h3 nav-item toc-entry">
-    <a class="reference internal nav-link" href="#additional-sql-language-features">
-     Additional SQL Language Features
-    </a>
-   </li>
-   <li class="toc-h3 nav-item toc-entry">
-    <a class="reference internal nav-link" href="#query-optimizer">
-     Query Optimizer
-    </a>
-   </li>
-   <li class="toc-h3 nav-item toc-entry">
-    <a class="reference internal nav-link" href="#datasources">
-     Datasources
-    </a>
-   </li>
-   <li class="toc-h3 nav-item toc-entry">
-    <a class="reference internal nav-link" href="#runtime-infrastructure">
-     Runtime / Infrastructure
-    </a>
-   </li>
-   <li class="toc-h3 nav-item toc-entry">
-    <a class="reference internal nav-link" href="#resource-management">
-     Resource Management
-    </a>
-   </li>
-   <li class="toc-h3 nav-item toc-entry">
-    <a class="reference internal nav-link" href="#python-interface">
-     Python Interface
-    </a>
-   </li>
-   <li class="toc-h3 nav-item toc-entry">
-    <a class="reference internal nav-link" href="#datafusion-cli-datafusion-cli">
-     DataFusion CLI (
-     <code class="docutils literal notranslate">
-      <span class="pre">
-       datafusion-cli
-      </span>
-     </code>
-     )
-    </a>
-   </li>
-  </ul>
- </li>
- <li class="toc-h2 nav-item toc-entry">
-  <a class="reference internal nav-link" href="#ballista">
-   Ballista
+  <a class="reference internal nav-link" href="#planning-epics">
+   Planning
+   <code class="docutils literal notranslate">
+    <span class="pre">
+     EPIC
+    </span>
+   </code>
+   s
   </a>
-  <ul class="nav section-nav flex-column">
-   <li class="toc-h3 nav-item toc-entry">
-    <a class="reference internal nav-link" href="#ballista-roadmap">
-     Ballista Roadmap
-    </a>
-   </li>
-   <li class="toc-h3 nav-item toc-entry">
-    <a class="reference internal nav-link" href="#move-query-scheduler-into-datafusion">
-     Move query scheduler into DataFusion
-    </a>
-   </li>
-   <li class="toc-h3 nav-item toc-entry">
-    <a class="reference internal nav-link" href="#implement-execution-time-cost-based-optimizations-based-on-statistics">
-     Implement execution-time cost-based optimizations based on statistics
-    </a>
-   </li>
-  </ul>
  </li>
 </ul>
 
@@ -400,113 +341,26 @@ under the License.
 -->
 <section id="roadmap">
 <h1>Roadmap<a class="headerlink" href="#roadmap" title="Permalink to this heading">¶</a></h1>
-<p>This document describes high level goals of the DataFusion and
-Ballista development community. It is not meant to restrict
-possibilities, but rather help newcomers understand the broader
-context of where the community is headed, and inspire
-additional contributions.</p>
-<p>DataFusion and Ballista are part of the <a class="reference external" href="https://arrow.apache.org/">Apache
-Arrow</a> project and governed by the Apache
-Software Foundation governance model. These projects are entirely
-driven by volunteers, and we welcome contributions for items not on
-this roadmap. However, before submitting a large PR, we strongly
-suggest you start a conversation using a github issue or the
-dev&#64;arrow.apache.org mailing list to make review efficient and avoid
-surprises.</p>
-<section id="datafusion">
-<h2>DataFusion<a class="headerlink" href="#datafusion" title="Permalink to this heading">¶</a></h2>
-<p>DataFusion’s goal is to become the embedded query engine of choice
-for new analytic applications, by leveraging the unique features of
-<a class="reference external" href="https://www.rust-lang.org/">Rust</a> and <a class="reference external" href="https://arrow.apache.org/">Apache Arrow</a>
-to provide:</p>
-<ol class="arabic simple">
-<li><p>Best-in-class single node query performance</p></li>
-<li><p>A Declarative SQL query interface compatible with PostgreSQL</p></li>
-<li><p>A Dataframe API, similar to those offered by Pandas and Spark</p></li>
-<li><p>A Procedural API for programmatically creating and running execution plans</p></li>
-<li><p>High performance, data race free, ergonomic extensibility points at at every layer</p></li>
-</ol>
-<section id="additional-sql-language-features">
-<h3>Additional SQL Language Features<a class="headerlink" href="#additional-sql-language-features" title="Permalink to this heading">¶</a></h3>
-<ul class="simple">
-<li><p>Decimal Support <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/122">#122</a></p></li>
-<li><p>Complete support list on <a class="reference external" href="https://github.com/apache/arrow-datafusion/blob/main/README.md#status">status</a></p></li>
-<li><p>Timestamp Arithmetic <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/194">#194</a></p></li>
-<li><p>SQL Parser extension point <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/533">#533</a></p></li>
-<li><p>Support for nested structures (fields, lists, structs) <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/119">#119</a></p></li>
-<li><p>Run all queries from the TPCH benchmark (see <a class="reference external" href="https://github.com/apache/arrow-datafusion/milestone/2">milestone</a> for more details)</p></li>
-</ul>
-</section>
-<section id="query-optimizer">
-<h3>Query Optimizer<a class="headerlink" href="#query-optimizer" title="Permalink to this heading">¶</a></h3>
-<ul class="simple">
-<li><p>More sophisticated cost based optimizer for join ordering</p></li>
-<li><p>Implement advanced query optimization framework (Tokomak) <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/440">#440</a></p></li>
-<li><p>Finer optimizations for group by and aggregate functions</p></li>
-</ul>
-</section>
-<section id="datasources">
-<h3>Datasources<a class="headerlink" href="#datasources" title="Permalink to this heading">¶</a></h3>
-<ul class="simple">
-<li><p>Better support for reading data from remote filesystems (e.g. S3) without caching it locally <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/907">#907</a> <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/1060">#1060</a></p></li>
-<li><p>Improve performances of file format datasources (parallelize file listings, async Arrow readers, file chunk prefetching capability…)</p></li>
-</ul>
-</section>
-<section id="runtime-infrastructure">
-<h3>Runtime / Infrastructure<a class="headerlink" href="#runtime-infrastructure" title="Permalink to this heading">¶</a></h3>
-<ul class="simple">
-<li><p>Migrate to some sort of arrow2 based implementation (see <a class="reference external" href="https://github.com/apache/arrow-datafusion/milestone/3">milestone</a> for more details)</p></li>
-<li><p>Add DataFusion to h2oai/db-benchmark <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/147">#147</a></p></li>
-<li><p>Improve build time <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/348">#348</a></p></li>
-</ul>
-</section>
-<section id="resource-management">
-<h3>Resource Management<a class="headerlink" href="#resource-management" title="Permalink to this heading">¶</a></h3>
-<ul class="simple">
-<li><p>Finer grain control and limit of runtime memory <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/587">#587</a> and CPU usage <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/64">#54</a></p></li>
-</ul>
-</section>
-<section id="python-interface">
-<h3>Python Interface<a class="headerlink" href="#python-interface" title="Permalink to this heading">¶</a></h3>
-<p>TBD</p>
-</section>
-<section id="datafusion-cli-datafusion-cli">
-<h3>DataFusion CLI (<code class="docutils literal notranslate"><span class="pre">datafusion-cli</span></code>)<a class="headerlink" href="#datafusion-cli-datafusion-cli" title="Permalink to this heading">¶</a></h3>
-<p>Note: There are some additional thoughts on a datafusion-cli vision on <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues/1096#issuecomment-939418770">#1096</a>.</p>
-<ul class="simple">
-<li><p>Better abstraction between REPL parsing and queries so that commands are separated and handled correctly</p></li>
-<li><p>Connect to the <code class="docutils literal notranslate"><span class="pre">Statistics</span></code> subsystem and have the cli print out more stats for query debugging, etc.</p></li>
-<li><p>Improved error handling for interactive use and shell scripting usage</p></li>
-<li><p>publishing to apt, brew, and possible NuGet registry so that people can use it more easily</p></li>
-<li><p>adopt a shorter name, like dfcli?</p></li>
-</ul>
-</section>
-</section>
-<section id="ballista">
-<h2>Ballista<a class="headerlink" href="#ballista" title="Permalink to this heading">¶</a></h2>
-<p>Ballista is a distributed compute platform based on Apache Arrow and DataFusion. It provides a query scheduler that
-breaks a physical plan into stages and tasks and then schedules tasks for execution across the available executors
-in the cluster.</p>
-<p>Having Ballista as part of the DataFusion codebase helps ensure that DataFusion remains suitable for distributed
-compute. For example, it helps ensure that physical query plans can be serialized to protobuf format and that they
-remain language-agnostic so that executors can be built in languages other than Rust.</p>
-<section id="ballista-roadmap">
-<h3>Ballista Roadmap<a class="headerlink" href="#ballista-roadmap" title="Permalink to this heading">¶</a></h3>
-</section>
-<section id="move-query-scheduler-into-datafusion">
-<h3>Move query scheduler into DataFusion<a class="headerlink" href="#move-query-scheduler-into-datafusion" title="Permalink to this heading">¶</a></h3>
-<p>The Ballista scheduler has some advantages over DataFusion query execution because it doesn’t try to eagerly execute
-the entire query at once but breaks it down into a directionally-acyclic graph (DAG) of stages and executes a
-configurable number of stages and tasks concurrently. It should be possible to push some of this logic down to
-DataFusion so that the same scheduler can be used to scale across cores in-process and across nodes in a cluster.</p>
-</section>
-<section id="implement-execution-time-cost-based-optimizations-based-on-statistics">
-<h3>Implement execution-time cost-based optimizations based on statistics<a class="headerlink" href="#implement-execution-time-cost-based-optimizations-based-on-statistics" title="Permalink to this heading">¶</a></h3>
-<p>After the execution of a query stage, accurate statistics are available for the resulting data. These statistics
-could be leveraged by the scheduler to optimize the query during execution. For example, when performing a hash join
-it is desirable to load the smaller side of the join into memory and in some cases we cannot predict which side will
-be smaller until execution time.</p>
-</section>
+<p>The <a class="reference internal" href="../user-guide/introduction.html"><span class="doc std std-doc">project introduction</span></a> explains the
+overview and goals of DataFusion, and our development efforts largely
+align to that vision.</p>
+<section id="planning-epics">
+<h2>Planning <code class="docutils literal notranslate"><span class="pre">EPIC</span></code>s<a class="headerlink" href="#planning-epics" title="Permalink to this heading">¶</a></h2>
+<p>DataFusion uses <a class="reference external" href="https://github.com/apache/arrow-datafusion/issues">GitHub
+issues</a> to track
+planned work. We collect related tickets using tracking issues labeled
+with <code class="docutils literal notranslate"><span class="pre">[EPIC]</span></code> which contain discussion and links to more detailed items.</p>
+<p>Epics offer a high level roadmap of what the DataFusion
+community is thinking about. The epics are not meant to restrict
+possibilities, but rather help the community see where development is
+headed, align our work, and inspire additional contributions.</p>
+<p>As this project is entirely driven by volunteers, we welcome
+contributions for items not currently covered by epics. However,
+before submitting a large PR, we strongly suggest and request you
+start a conversation using a github issue or the
+<a class="reference external" href="mailto:dev&#37;&#52;&#48;arrow&#46;apache&#46;org">dev<span>&#64;</span>arrow<span>&#46;</span>apache<span>&#46;</span>org</a> mailing list to
+make review efficient and avoid surprises.</p>
+<p><a class="reference external" href="https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+epic">The current list of <code class="docutils literal notranslate"><span class="pre">EPIC</span></code>s can be found here</a>.</p>
 </section>
 </section>
 
diff --git a/searchindex.js b/searchindex.js
index a07e14ade8..b350ff6bf6 100644
--- a/searchindex.js
+++ b/searchindex.js
@@ -1 +1 @@
-Search.setIndex({"docnames": ["contributor-guide/architecture", "contributor-guide/communication", "contributor-guide/index", "contributor-guide/quarterly_roadmap", "contributor-guide/roadmap", "contributor-guide/specification/index", "contributor-guide/specification/invariants", "contributor-guide/specification/output-field-name-semantic", "index", "user-guide/cli", "user-guide/configs", "user-guide/dataframe", "user-guide/example-usage", "user-guide/expressions", "user-guide/faq", "use [...]
\ No newline at end of file
+Search.setIndex({"docnames": ["contributor-guide/architecture", "contributor-guide/communication", "contributor-guide/index", "contributor-guide/quarterly_roadmap", "contributor-guide/roadmap", "contributor-guide/specification/index", "contributor-guide/specification/invariants", "contributor-guide/specification/output-field-name-semantic", "index", "user-guide/cli", "user-guide/configs", "user-guide/dataframe", "user-guide/example-usage", "user-guide/expressions", "user-guide/faq", "use [...]
\ No newline at end of file
diff --git a/user-guide/introduction.html b/user-guide/introduction.html
index 2f2b991c55..4fba7979b4 100644
--- a/user-guide/introduction.html
+++ b/user-guide/introduction.html
@@ -282,6 +282,11 @@
 
 <nav id="bd-toc-nav">
     <ul class="visible nav section-nav flex-column">
+ <li class="toc-h2 nav-item toc-entry">
+  <a class="reference internal nav-link" href="#project-goals">
+   Project Goals
+  </a>
+ </li>
  <li class="toc-h2 nav-item toc-entry">
   <a class="reference internal nav-link" href="#features">
    Features
@@ -369,8 +374,18 @@
 <h1>Introduction<a class="headerlink" href="#introduction" title="Permalink to this heading">¶</a></h1>
 <p>DataFusion is a very fast, extensible query engine for building
 high-quality data-centric systems in <a class="reference external" href="http://rustlang.org">Rust</a>,
-using the <a class="reference external" href="https://arrow.apache.org">Apache Arrow</a> in-memory format.</p>
-<p>DataFusion offers SQL and Dataframe APIs, excellent <a class="reference external" href="https://benchmark.clickhouse.com/">performance</a>, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.</p>
+using the <a class="reference external" href="https://arrow.apache.org">Apache Arrow</a> in-memory format.
+DataFusion is part of the <a class="reference external" href="https://arrow.apache.org/">Apache Arrow</a>
+project.</p>
+<p>DataFusion offers SQL and Dataframe APIs, excellent <a class="reference external" href="https://benchmark.clickhouse.com/">performance</a>, built-in support for CSV, Parquet, JSON, and Avro, <a class="reference external" href="https://github.com/apache/arrow-datafusion-python">python bindings</a>, extensive customization, a great community, and more.</p>
+<section id="project-goals">
+<h2>Project Goals<a class="headerlink" href="#project-goals" title="Permalink to this heading">¶</a></h2>
+<p>DataFusion aims to be the query engine of choice for new, fast
+data centric systems such as databases, dataframe libraries, machine
+learning and streaming applications by leveraging the unique features
+of <a class="reference external" href="https://www.rust-lang.org/">Rust</a> and <a class="reference external" href="https://arrow.apache.org/">Apache
+Arrow</a>.</p>
+</section>
 <section id="features">
 <h2>Features<a class="headerlink" href="#features" title="Permalink to this heading">¶</a></h2>
 <ul class="simple">
@@ -381,25 +396,33 @@ for custom file formats and non file datasources via the <code class="docutils l
 <li><p>Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
 other query languages, custom plan and execution nodes, optimizer passes, and more.</p></li>
 <li><p>Streaming, asynchronous IO directly from popular object stores, including AWS S3,
-Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
-<code class="docutils literal notranslate"><span class="pre">ObjectStore</span></code> trait.</p></li>
+Azure Blob Storage, and Google Cloud Storage (Other storage systems are supported via the
+<code class="docutils literal notranslate"><span class="pre">ObjectStore</span></code> trait).</p></li>
 <li><p><a class="reference external" href="https://docs.rs/datafusion/latest">Excellent Documentation</a> and a
 <a class="reference external" href="https://arrow.apache.org/datafusion/contributor-guide/communication.html">welcoming community</a>.</p></li>
-<li><p>A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
-automatic join reordering, expression coercion, and more.</p></li>
-<li><p>Permissive Apache 2.0 License, Apache Software Foundation governance</p></li>
-<li><p>Written in <a class="reference external" href="https://www.rust-lang.org/">Rust</a>, a modern system language with development
-productivity similar to Java or Golang, the performance of C++, and
-<a class="reference external" href="https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted">loved by programmers everywhere</a>.</p></li>
-<li><p>Support for <a class="reference external" href="https://substrait.io/">Substrait</a> for query plan serialization, making it easier to integrate DataFusion
-with other projects, and to pass plans across language boundaries.</p></li>
+<li><p>A state of the art query optimizer with expression coercion and
+simplification, projection and filter pushdown, sort and distribution
+aware optimizations, automatic join reordering, and more.</p></li>
+<li><p>Permissive Apache 2.0 License, predictable and well understood
+<a class="reference external" href="https://www.apache.org/">Apache Software Foundation</a> governance.</p></li>
+<li><p>Implementation in <a class="reference external" href="https://www.rust-lang.org/">Rust</a>, a modern
+system language with development productivity similar to Java or
+Golang, the performance of C++, and <a class="reference external" href="https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted">loved by programmers
+everywhere</a>.</p></li>
+<li><p>Support for <a class="reference external" href="https://substrait.io/">Substrait</a> query plans, to
+easily pass plans across language and system boundaries.</p></li>
 </ul>
 </section>
 <section id="use-cases">
 <h2>Use Cases<a class="headerlink" href="#use-cases" title="Permalink to this heading">¶</a></h2>
 <p>DataFusion can be used without modification as an embedded SQL
 engine or can be customized and used as a foundation for
-building new systems. Here are some examples of systems built using DataFusion:</p>
+building new systems.</p>
+<p>While most current usecases are “analytic” or (throughput) some
+components of DataFusion such as the plan representations, are
+suitable for “streaming” and “transaction” style systems (low
+latency).</p>
+<p>Here are some example systems built using DataFusion:</p>
 <ul class="simple">
 <li><p>Specialized Analytical Database systems such as <a class="reference external" href="https://github.com/CeresDB/ceresdb">CeresDB</a> and more general Apache Spark like system such a <a class="reference external" href="https://github.com/apache/arrow-ballista">Ballista</a>.</p></li>
 <li><p>New query language engines such as <a class="reference external" href="https://github.com/prql/prql-query">prql-query</a> and accelerators such as <a class="reference external" href="https://vegafusion.io/">VegaFusion</a></p></li>
@@ -407,12 +430,12 @@ building new systems. Here are some examples of systems built using DataFusion:<
 <li><p>SQL support to another library, such as <a class="reference external" href="https://github.com/dask-contrib/dask-sql">dask sql</a></p></li>
 <li><p>Streaming data platforms such as <a class="reference external" href="https://synnada.ai/">Synnada</a></p></li>
 <li><p>Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as <a class="reference external" href="https://github.com/timvw/qv">qv</a></p></li>
-<li><p>A faster Spark runtime replacement <a class="reference external" href="https://github.com/blaze-init/blaze">Blaze</a></p></li>
+<li><p>Native Spark runtime replacement such as <a class="reference external" href="https://github.com/blaze-init/blaze">Blaze</a></p></li>
 </ul>
-<p>By using DataFusion, the projects are freed to focus on their specific
+<p>By using DataFusion, projects are freed to focus on their specific
 features, and avoid reimplementing general (but still necessary)
 features such as an expression representation, standard optimizations,
-execution plans, file format support, etc.</p>
+parellelized streaming execution plans, file format support, etc.</p>
 </section>
 <section id="known-users">
 <h2>Known Users<a class="headerlink" href="#known-users" title="Permalink to this heading">¶</a></h2>
@@ -445,7 +468,7 @@ execution plans, file format support, etc.</p>
 <section id="integrations-and-extensions">
 <h2>Integrations and Extensions<a class="headerlink" href="#integrations-and-extensions" title="Permalink to this heading">¶</a></h2>
 <p>There are a number of community projects that extend DataFusion or
-provide integrations with other systems.</p>
+provide integrations with other systems, some of which are described below:</p>
 <section id="language-bindings">
 <h3>Language Bindings<a class="headerlink" href="#language-bindings" title="Permalink to this heading">¶</a></h3>
 <ul class="simple">
@@ -468,8 +491,8 @@ provide integrations with other systems.</p>
 <ul class="simple">
 <li><p><em>High Performance</em>: Leveraging Rust and Arrow’s memory model, DataFusion is very fast.</p></li>
 <li><p><em>Easy to Connect</em>: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem</p></li>
-<li><p><em>Easy to Embed</em>: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase</p></li>
-<li><p><em>High Quality</em>: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.</p></li>
+<li><p><em>Easy to Embed</em>: Allowing extension at almost any point in its design, and published regularly as a crate on <a class="reference external" href="http://crates.io">crates.io</a>, DataFusion can be integrated and tailored for your specific usecase.</p></li>
+<li><p><em>High Quality</em>: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can and is used as the foundation for production systems.</p></li>
 </ul>
 </section>
 </section>