You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/04/25 07:58:07 UTC

[GitHub] [arrow] westonpace opened a new pull request, #35320: GH-32335: [C++] Add design document for Acero

westonpace opened a new pull request, #35320:
URL: https://github.com/apache/arrow/pull/35320

   ### Rationale for this change
   
   The documentation for Acero was incomplete.  This PR refactors the existing documentation and adds several entirely new sections to form a complete design document for Acero.
   
   ### What changes are included in this PR?
   
   Some existing documentation is cleaned up.  Acero documentation is moved into its own folder and broken into several pages.
   
   ### Are these changes tested?
   
   The documentation is built as part of the CI but I wouldn't say it is fully tested.
   
   ### Are there any user-facing changes?
   
   There are not code changes (other than the removal of two legacy methods) but there are many user-facing documentation changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] bkietz commented on a diff in pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "bkietz (via GitHub)" <gi...@apache.org>.
bkietz commented on code in PR #35320:
URL: https://github.com/apache/arrow/pull/35320#discussion_r1179166745


##########
docs/source/cpp/acero/overview.rst:
##########
@@ -0,0 +1,262 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==============
+Acero Overview
+==============
+
+This page gives an overview of the basic Acero concepts and helps distinguish Acero
+from other modules in the Arrow code base.  It's intended for users, developers,
+potential contributors, and for those that would like to extend Acero, either for
+research or for business use.  This page assumes the reader is already familiar with
+core Arrow concepts.  This page does not expect any existing knowledge in relational
+algebra.
+
+What is Acero?
+==============
+
+Acero is a C++ library that can be used to analyze large (potentially infinite) streams
+of data.  Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`).
+An execution plan takes in zero or more streams of input data and emits a single
+stream of output data.  The plan describes how the data will be transformed as it
+passes through.  For example, a plan might:
+
+ * Merge two streams of data using a common column
+ * Create additional columns by evaluating expressions against the existing columns
+ * Consume a stream of data by writing it to disk in a partitioned layout
+
+.. image:: simple_graph.svg
+   :alt: A sample execution plan that joins three streams of data and writes to disk
+
+Acero is not...
+---------------
+
+A Library for Data Scientists
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Acero is not intended to be used directly by data scientists.  It is expected that
+end users will typically be using some kind of frontend.  For example, Pandas, Ibis,
+or SQL.  The API for Acero is focused around capabilities and available algorithms.
+However, such users may be intersted in knowing more about how Acero works so that
+they can better understand how the backend processing for their libraries operates.
+
+A Database
+^^^^^^^^^^
+
+A database (or DBMS) is typically a much more expansive application and often packaged
+as a standalone service.  Acero could be a component in a database (most databases have
+some kind of execution engine) or could be a component in some other data processing
+application that hardly resembles a database.  Acero does not concern itself with
+user management, external communication, isolation, durability, or consistency.  In
+addition, Acero is focused primarily on the read path, and the write utilities lack
+any sort of transaction support.
+
+An Optimizer
+^^^^^^^^^^^^
+
+Acero does not have an SQL parser.  It does not have a query planner.  It does not have
+any sort of optimizer.  Acero expects to be given very detailed and low-level instructions
+on how to manipulate data and then it will perform that manipulation exactly as described.
+
+Creating the best execution plan is very hard.  Small details can have a big impact on
+performance.  We do think optimizers are important and we hope that tools will emerge
+someday which can provide these capabilities using standards such as Substrait.

Review Comment:
   ```suggestion
   Creating the best execution plan is very hard.  Small details can have a big impact on
   performance.  We do think an optimizer is important but we believe it should be
   implemented independent of acero, hopefully in a composable way through standards such
   as Substrait so that any backend could leverage it.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #35320:
URL: https://github.com/apache/arrow/pull/35320#discussion_r1192738776


##########
docs/source/cpp/acero/substrait.rst:
##########
@@ -0,0 +1,248 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::engine::substrait
+
+.. _acero-substrait:
+
+==========================
+Using Acero with Substrait
+==========================
+
+In order to use Acero you will need to create an execution plan.  This is the
+model that describes the computation you want to apply to your data.  Acero has
+its own internal representation for execution plans but most users should not
+interact with this directly as it will couple their code to Acero.
+
+`Substrait <https://substrait.io>`_ is an open standard for execution plans.
+Acero implements the Substrait "consumer" interface.  This means that Acero can
+accept a Substrait plan and fulfill the plan, loading the requested data and
+applying the desired computation.  By using Substrait plans users can easily
+switch out to a different execution engine at a later time.
+
+Substrait Conformance
+---------------------
+
+Substrait defines a broad set of operators and functions for many different
+situations and it is unlikely that Acero will ever completely satisfy all
+defined Substrait operators and functions.  To help understand what features
+are available the following sections define which features have been currently
+implemented in Acero and any caveats that apply.
+
+Plans
+^^^^^
+
+ * A plan should have a single top-level relation.
+ * The consumer is currently based on version 0.20.0 of Substrait.
+   Any features added that are newer will not be supported.
+ * Due to a breaking change in 0.20.0 any Substrait plan older than 0.20.0
+   will be rejected.
+
+Extensions
+^^^^^^^^^^
+
+ * If a plan contains any extension type variations it will be rejected.
+ * Advanced extensions can be provided by supplying a custom implementation of
+   :class:`arrow::engine::ExtensionProvider`.
+
+Relations (in general)
+^^^^^^^^^^^^^^^^^^^^^^
+
+ * Any relation not explicitly listed below will not be supported
+   and will cause the plan to be rejected.
+
+Read Relations
+^^^^^^^^^^^^^^
+
+ * The ``projection`` property is not supported and plans containing this
+   property will be rejected.
+ * The ``VirtualTable`` and ``ExtensionTable`` read types are not supported.
+   Plans containing these types will be rejected.
+ * Only the parquet and arrow file formats are currently supported.
+ * All URIs must use the ``file`` scheme
+ * ``partition_index``, ``start``, and ``length`` are not supported.  Plans containing
+   non-default values for these properties will be rejected.
+ * The Substrait spec requires that a ``filter`` be completely satisfied by a read
+   relation.  However, Acero only uses a read filter for pushdown projection and
+   it may not be fully satisfied.  Users should generally attach an additional
+   filter relation with the same filter expression after the read relation.
+
+Filter Relations
+^^^^^^^^^^^^^^^^
+
+ * No known caveats
+
+Project Relations
+^^^^^^^^^^^^^^^^^
+
+ * No known caveats
+
+Join Relations
+^^^^^^^^^^^^^^
+
+ * The join type ``JOIN_TYPE_SINGLE`` is not supported and plans containing this
+   will be rejected.
+ * The join expression must be a call to either the ``equal`` or ``is_not_distinct_from``
+   functions.  Both arguments to the call must be direct references.  Only a single
+   join key is supported.
+ * The ``post_join_filter`` property is not supported and will be ignored.
+
+Aggregate Relations
+^^^^^^^^^^^^^^^^^^^
+
+ * At most one grouping set is supported.
+ * Each grouping expression must be a direct reference.
+ * Each measure's arguments must be direct references.
+ * A measure may not have a filter
+ * A measure may not have sorts
+ * A measure's invocation must be AGGREGATION_INVOCATION_ALL or 
+   AGGREGATION_INVOCATION_UNSPECIFIED
+ * A measure's phase must be AGGREGATION_PHASE_INITIAL_TO_RESULT
+
+Expressions (general)
+^^^^^^^^^^^^^^^^^^^^^
+
+ * Various places in the Substrait spec allow for expressions to be used outside
+   of a filter or project relation.  For example, a join expression or an aggregate
+   grouping set.  Acero typically expects these expressions to be direct references.
+   Planners should extract the implicit projection into a formal project relation
+   before delivering the plan to Acero.
+
+Literals
+^^^^^^^^
+
+ * A literal with non-default nullability will cause a plan to be rejected.
+
+Types
+^^^^^
+
+ * Acero does not have full support for non-nullable types and may allow input
+   to have nulls without rejecting it.
+ * The table below shows the mapping between Arrow types and Substrait type
+   classes that are currently supported
+
+.. list-table:: Substrait / Arrow Type Mapping
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Substrait Type
+     - Arrow Type
+     - Caveat
+   * - boolean
+     - boolean
+     - 
+   * - i8
+     - int8
+     - 
+   * - i16
+     - int16
+     - 
+   * - i32
+     - int32
+     - 
+   * - i64
+     - int64
+     - 
+   * - fp32
+     - float32
+     - 
+   * - fp64
+     - float64
+     - 
+   * - string
+     - string
+     - 
+   * - binary
+     - binary
+     - 
+   * - timestamp
+     - timestamp<MICRO,"">
+     - 
+   * - timestamp_tz
+     - timestamp<MICRO,"UTC">
+     - 
+   * - date
+     - date32<DAY>
+     - 
+   * - time
+     - time64<MICRO>
+     - 
+   * - interval_year
+     - 
+     - Not currently supported
+   * - interval_day
+     - 
+     - Not currently supported
+   * - uuid
+     - 
+     - Not currently supported
+   * - FIXEDCHAR<L>
+     - 
+     - Not currently supported
+   * - VARCHAR<L>
+     - 
+     - Not currently supported
+   * - FIXEDBINARY<L>
+     - fixed_size_binary<L>
+     - 
+   * - DECIMAL<P,S>
+     - decimal128<P,S>
+     - 
+   * - STRUCT<T1...TN>
+     - struct<T1...TN>
+     - Arrow struct fields will have no name (empty string)
+   * - NSTRUCT<N:T1...N:Tn>
+     - 
+     - Not currently supported
+   * - LIST<T>
+     - list<T>
+     - 
+   * - MAP<K,V>
+     - map<K,V>
+     - K must not be nullable
+
+Functions
+^^^^^^^^^
+
+ * The following functions have caveats or are not supported at all.  Note that
+   this is not a comprehensive list.  Functions are being added to Substrait at
+   a rapid pace and new functions may be missing.
+
+   * Acero does not support the SATURATE option for overflow
+   * Acero does not support kernels that take more than two arguments
+     for the functions ``and``, ``or``, ``xor``
+   * Acero does not support temporal arithmetic

Review Comment:
   I think that's a fair point.  Let's address in a follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35320:
URL: https://github.com/apache/arrow/pull/35320#issuecomment-1521453827

   Revision: 62c725d1044911a7e7b0fa1bf0ea84ec76f6f15a
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-838f0ce05e](https://github.com/ursacomputing/crossbow/branches/all?query=actions-838f0ce05e)
   
   |Task|Status|
   |----|------|
   |preview-docs|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-838f0ce05e-github-preview-docs)](https://github.com/ursacomputing/crossbow/actions/runs/4795782738/jobs/8530764569)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #35320:
URL: https://github.com/apache/arrow/pull/35320#issuecomment-1521847291

   C++ User Guide: http://crossbow.voltrondata.com/pr_docs/35320/cpp/streaming_execution.html
   C++ API Reference: http://crossbow.voltrondata.com/pr_docs/35320/cpp/api/acero.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #35320:
URL: https://github.com/apache/arrow/pull/35320#discussion_r1192743082


##########
docs/source/cpp/acero/overview.rst:
##########
@@ -0,0 +1,262 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==============
+Acero Overview
+==============
+
+This page gives an overview of the basic Acero concepts and helps distinguish Acero
+from other modules in the Arrow code base.  It's intended for users, developers,
+potential contributors, and for those that would like to extend Acero, either for
+research or for business use.  This page assumes the reader is already familiar with
+core Arrow concepts.  This page does not expect any existing knowledge in relational
+algebra.
+
+What is Acero?
+==============
+
+Acero is a C++ library that can be used to analyze large (potentially infinite) streams
+of data.  Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`).
+An execution plan takes in zero or more streams of input data and emits a single
+stream of output data.  The plan describes how the data will be transformed as it
+passes through.  For example, a plan might:
+
+ * Merge two streams of data using a common column
+ * Create additional columns by evaluating expressions against the existing columns
+ * Consume a stream of data by writing it to disk in a partitioned layout
+
+.. image:: simple_graph.svg
+   :alt: A sample execution plan that joins three streams of data and writes to disk
+
+Acero is not...
+---------------
+
+A Library for Data Scientists
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Acero is not intended to be used directly by data scientists.  It is expected that
+end users will typically be using some kind of frontend.  For example, Pandas, Ibis,
+or SQL.  The API for Acero is focused around capabilities and available algorithms.
+However, such users may be intersted in knowing more about how Acero works so that
+they can better understand how the backend processing for their libraries operates.
+
+A Database
+^^^^^^^^^^
+
+A database (or DBMS) is typically a much more expansive application and often packaged
+as a standalone service.  Acero could be a component in a database (most databases have
+some kind of execution engine) or could be a component in some other data processing
+application that hardly resembles a database.  Acero does not concern itself with
+user management, external communication, isolation, durability, or consistency.  In
+addition, Acero is focused primarily on the read path, and the write utilities lack
+any sort of transaction support.
+
+An Optimizer
+^^^^^^^^^^^^
+
+Acero does not have an SQL parser.  It does not have a query planner.  It does not have
+any sort of optimizer.  Acero expects to be given very detailed and low-level instructions
+on how to manipulate data and then it will perform that manipulation exactly as described.
+
+Creating the best execution plan is very hard.  Small details can have a big impact on
+performance.  We do think optimizers are important and we hope that tools will emerge
+someday which can provide these capabilities using standards such as Substrait.
+
+Distributed
+^^^^^^^^^^^
+
+Acero does not provide distributed execution.  However, Acero aims to be usable by a distributed
+query execution engine.  In other words, Acero will not configure and coordinate workers but
+it does except to be used as a worker.  Sometimes, the distinction is a bit fuzzy.  For example,
+an Acero source may be a smart storage device that is capable of performing filtering or other
+advanced analytics.  One might consider this a distributed plan.  The key distinction is Acero
+does not have the capability of transforming a logical plan into a distributed execution plan.
+That step will need to be done elsewhere.
+
+Acero vs...
+-----------
+
+Arrow Compute
+^^^^^^^^^^^^^
+
+This is described in more detail in the overview but the key difference is that Acero handles

Review Comment:
   That's a good idea.  I added the link (it's not in the preview yet because I messed it up the first time but I've committed it)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #35320:
URL: https://github.com/apache/arrow/pull/35320#issuecomment-1546064922

   @github-actions crossbow submit preview-docs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #35320:
URL: https://github.com/apache/arrow/pull/35320#discussion_r1192745329


##########
docs/source/cpp/acero/expression_ast.svg:
##########


Review Comment:
   I moved to one line and centered the image.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #35320:
URL: https://github.com/apache/arrow/pull/35320#issuecomment-1546223157

   @github-actions crossbow submit preview-docs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35320:
URL: https://github.com/apache/arrow/pull/35320#issuecomment-1521337490

   :warning: GitHub issue #32335 **has been automatically assigned in GitHub** to PR creator.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #35320:
URL: https://github.com/apache/arrow/pull/35320#discussion_r1192745786


##########
docs/source/cpp/acero/layers.svg:
##########


Review Comment:
   I changed to "not used directly by Acero".  You are correct that it is used indirectly when Acero executes functions.



##########
docs/source/cpp/acero/overview.rst:
##########
@@ -0,0 +1,262 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==============
+Acero Overview
+==============
+
+This page gives an overview of the basic Acero concepts and helps distinguish Acero
+from other modules in the Arrow code base.  It's intended for users, developers,
+potential contributors, and for those that would like to extend Acero, either for
+research or for business use.  This page assumes the reader is already familiar with
+core Arrow concepts.  This page does not expect any existing knowledge in relational
+algebra.
+
+What is Acero?
+==============
+
+Acero is a C++ library that can be used to analyze large (potentially infinite) streams
+of data.  Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`).
+An execution plan takes in zero or more streams of input data and emits a single
+stream of output data.  The plan describes how the data will be transformed as it
+passes through.  For example, a plan might:
+
+ * Merge two streams of data using a common column
+ * Create additional columns by evaluating expressions against the existing columns
+ * Consume a stream of data by writing it to disk in a partitioned layout
+
+.. image:: simple_graph.svg
+   :alt: A sample execution plan that joins three streams of data and writes to disk
+
+Acero is not...
+---------------
+
+A Library for Data Scientists
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Acero is not intended to be used directly by data scientists.  It is expected that
+end users will typically be using some kind of frontend.  For example, Pandas, Ibis,
+or SQL.  The API for Acero is focused around capabilities and available algorithms.
+However, such users may be intersted in knowing more about how Acero works so that
+they can better understand how the backend processing for their libraries operates.
+
+A Database
+^^^^^^^^^^
+
+A database (or DBMS) is typically a much more expansive application and often packaged
+as a standalone service.  Acero could be a component in a database (most databases have
+some kind of execution engine) or could be a component in some other data processing
+application that hardly resembles a database.  Acero does not concern itself with
+user management, external communication, isolation, durability, or consistency.  In
+addition, Acero is focused primarily on the read path, and the write utilities lack
+any sort of transaction support.
+
+An Optimizer
+^^^^^^^^^^^^
+
+Acero does not have an SQL parser.  It does not have a query planner.  It does not have
+any sort of optimizer.  Acero expects to be given very detailed and low-level instructions
+on how to manipulate data and then it will perform that manipulation exactly as described.
+
+Creating the best execution plan is very hard.  Small details can have a big impact on
+performance.  We do think optimizers are important and we hope that tools will emerge
+someday which can provide these capabilities using standards such as Substrait.
+
+Distributed
+^^^^^^^^^^^
+
+Acero does not provide distributed execution.  However, Acero aims to be usable by a distributed
+query execution engine.  In other words, Acero will not configure and coordinate workers but
+it does except to be used as a worker.  Sometimes, the distinction is a bit fuzzy.  For example,
+an Acero source may be a smart storage device that is capable of performing filtering or other
+advanced analytics.  One might consider this a distributed plan.  The key distinction is Acero
+does not have the capability of transforming a logical plan into a distributed execution plan.
+That step will need to be done elsewhere.
+
+Acero vs...
+-----------
+
+Arrow Compute
+^^^^^^^^^^^^^
+
+This is described in more detail in the overview but the key difference is that Acero handles
+streams of data and Arrow Compute handles situations where all the data is in memory.
+
+Arrow Datasets
+^^^^^^^^^^^^^^
+
+The Arrow datasets library provides some basic routines for discovering, scanning, and
+writing collections of files.  The datasets module depends on Acero.  Both scanning and
+writing datasets uses Acero.  The scan node and the write node are part of the datasets
+module.  This helps to keep the complexity of file formats and filesystems out of the core
+Acero logic.
+
+Substrait
+^^^^^^^^^
+
+Substrait is a project establishing standards for query plans.  Acero executes query plans
+and generates data.  This makes Acero a Substrait consumer.  There are more details on the
+Substrait capabilities below.
+
+Datafusion / DuckDb / Velox / Etc.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are many columnar data engines emerging. We view this as a good thing and encourage
+projects like Substrait to help allow switching between engines as needed.  We generally
+discourage comparative benchmarks as they are almost inevitably going to be workload-driven
+and rarely manage to capture an apples-vs-apples comparison.  Discussions of the pros and
+cons of each is beyond the scope of this guide.
+
+Relation to Arrow C++
+=====================
+
+The Acero module is part of the Arrow C++ implementation.  It is built as a separate
+module but it depends on core Arrow modules and does not stand alone.  Acero uses
+and extends the capabilities from the core Arrow module and the Arrow compute kernels.
+
+.. image:: layers.svg
+   :alt: A diagram of layers with core on the left, compute in the middle, and acero on the right
+
+The core Arrow library provides containers for buffers and arrays that are laid out according
+to the Arrow columnar format.  With few exceptions the core Arrow library does not examine
+or modify the contents of buffers.  For example, converting a string array from lowercase
+strings to uppercase strings would not be a part of the core Arrow library because that would
+require examining the contents of the array.
+
+The compute module expands on the core library and provides functions which analyze and
+transform data.  The compute module's capabilites are all exposed via a function registry.
+An Arrow "function" accepts zero or more arrays, batches, or tables, and produces an array,
+batch, or table.  In addition, function calls can be combined, along with field references
+and literals, to form an expression (a tree of function calls) which the compute module can
+evaluate.  For example, calculating ``x + (y * 3)`` given a table with columns ``x`` and ``y``.
+
+.. image:: expression_ast.svg
+   :alt: A sample expression tree
+
+Acero expands on these capabilities by adding compute operations for streams of data.  For
+example, a project node can apply a compute expression on a stream of batches.  This will
+create a new stream of batches with the result of the expression added as a new column.  These
+nodes can be combined into a graph to form a more complex execution plan.  This is very similar
+to the way functions are combined into a tree to form a complex expression.
+
+.. image:: simple_plan.svg
+   :alt: A simple plan that uses compute expressions
+
+.. note::
+   Acero does not use the :class:`arrow::Table` or :class:`arrow::ChunkedArray` containers
+   from the core Arrow library.  This is because Acero operates on streams of batches and
+   so there is no need for a multi-batch container of data.  This helps to reduce the
+   complexity of Acero and avoids tricky situations that can arise from tables whose
+   columns have different chunk sizes.  Acero will often use :class:`arrow::Datum`
+   which is a variant from the core module that can hold many different types.  Within
+   Acero, a datum will always hold either an :class:`arrow::Array` or a :class:`arrow::Scalar`.
+
+Core Concepts
+=============
+
+ExecNode
+--------
+
+The most basic concept in Acero is the ExecNode.  An ExecNode has zero or more inputs and
+zero or one outputs.  If an ExecNode has zero inputs we call it a source and if an ExecNode
+does not have an output then we call it a sink.  There are many different kinds of nodes and
+each one transforms is inputs in different ways.  For example:
+
+ * A scan node is a source node that reads data from files
+ * An aggregate node accumulates batches of data to compute summary statistics
+ * A filter node removes rows from the data according to a filter expression
+ * A table sink node accumulates data into a table

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #35320:
URL: https://github.com/apache/arrow/pull/35320#discussion_r1192740991


##########
docs/source/cpp/acero/overview.rst:
##########
@@ -0,0 +1,262 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==============
+Acero Overview
+==============
+
+This page gives an overview of the basic Acero concepts and helps distinguish Acero
+from other modules in the Arrow code base.  It's intended for users, developers,
+potential contributors, and for those that would like to extend Acero, either for
+research or for business use.  This page assumes the reader is already familiar with
+core Arrow concepts.  This page does not expect any existing knowledge in relational
+algebra.
+
+What is Acero?
+==============
+
+Acero is a C++ library that can be used to analyze large (potentially infinite) streams
+of data.  Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`).
+An execution plan takes in zero or more streams of input data and emits a single
+stream of output data.  The plan describes how the data will be transformed as it
+passes through.  For example, a plan might:
+
+ * Merge two streams of data using a common column
+ * Create additional columns by evaluating expressions against the existing columns
+ * Consume a stream of data by writing it to disk in a partitioned layout

Review Comment:
   Fixed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] ursabot commented on pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "ursabot (via GitHub)" <gi...@apache.org>.
ursabot commented on PR #35320:
URL: https://github.com/apache/arrow/pull/35320#issuecomment-1556293793

   Benchmark runs are scheduled for baseline = 3f736ff1eb0ca750153fce7c851c2cbab6c75b6e and contender = 8a856c90f9b8fa33de636f98eb33123ccc304b25. 8a856c90f9b8fa33de636f98eb33123ccc304b25 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/1ba946c1ab2645da98e5db3fd129c08a...4ec0af40c406403b95dbc5d39e90c12c/)
   [Finished :arrow_down:0.36% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/0424f39ea42547d0933e9c82f9eddd46...5f6004d72b3a4ba6be69b057ade1259c/)
   [Finished :arrow_down:0.33% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/026b582011b24b02b2d484f9614be2d7...bb42ae27d3f14ab4859115557bbc969e/)
   [Finished :arrow_down:0.51% :arrow_up:0.12%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/52226271e8654363a80bbe209f9784f0...a8db8dbe44894d79bb00a2acd7b653c6/)
   Buildkite builds:
   [Finished] [`8a856c90` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2903)
   [Finished] [`8a856c90` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2939)
   [Finished] [`8a856c90` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2904)
   [Finished] [`8a856c90` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2929)
   [Finished] [`3f736ff1` ec2-t3-xlarge-us-east-2](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ec2-t3-xlarge-us-east-2/builds/2902)
   [Finished] [`3f736ff1` test-mac-arm](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-test-mac-arm/builds/2938)
   [Finished] [`3f736ff1` ursa-i9-9960x](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-i9-9960x/builds/2903)
   [Finished] [`3f736ff1` ursa-thinkcentre-m75q](https://buildkite.com/apache-arrow/arrow-bci-benchmark-on-ursa-thinkcentre-m75q/builds/2928)
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #35320:
URL: https://github.com/apache/arrow/pull/35320#issuecomment-1521449510

   @github-actions crossbow submit preview-docs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35320:
URL: https://github.com/apache/arrow/pull/35320#issuecomment-1521337426

   * Closes: #32335


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #35320:
URL: https://github.com/apache/arrow/pull/35320#discussion_r1192750841


##########
docs/source/cpp/acero/developer_guide.rst:
##########
@@ -0,0 +1,692 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+=================
+Developer's Guide
+=================
+
+This page goes into more detail into the design of Acero.  It discusses how
+to create custom exec nodes and describes some of the philosophies behind Acero's
+design and implementation.  Finally, it gives an overview of how to extend Acero
+with new behaviors and how this new behavior can be upstreamed into the core Arrow
+repository.
+
+Understanding ExecNode
+======================
+
+ExecNode is an abstract class with several pure virtual methods that control how the node operates:
+
+:func:`ExecNode::StartProducing`
+--------------------------------
+
+This method is called once at the start of the plan.  Most nodes ignore this method (any
+neccesary initialization should happen in the construtor or Init).  However, source nodes
+will typically provide a custom implementation.  Source nodes should schedule whatever tasks
+are needed to start reading and providing the data.  Source nodes are usually the primary
+creator of tasks in a plan.
+
+.. note::
+   The ExecPlan operates on a push-based model.  Sources are often pull-based.  For example,
+   your source may be an iterator.  The source node will typically then schedule tasks to pull one
+   item from the source and push it into the plan.

Review Comment:
   Yes.  This is a little imprecise.  I've updated to:
   
   ```
      The ExecPlan operates on a push-based model.  Sources are often pull-based.  For example,
      your source may be an iterator.  The source node will typically then schedule tasks to pull one
      item from the source and push that item into the source's output node (via ``InputReceived``).
      ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] mapleFU commented on a diff in pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "mapleFU (via GitHub)" <gi...@apache.org>.
mapleFU commented on code in PR #35320:
URL: https://github.com/apache/arrow/pull/35320#discussion_r1192667674


##########
docs/source/cpp/acero/developer_guide.rst:
##########
@@ -0,0 +1,692 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+=================
+Developer's Guide
+=================
+
+This page goes into more detail into the design of Acero.  It discusses how
+to create custom exec nodes and describes some of the philosophies behind Acero's
+design and implementation.  Finally, it gives an overview of how to extend Acero
+with new behaviors and how this new behavior can be upstreamed into the core Arrow
+repository.
+
+Understanding ExecNode
+======================
+
+ExecNode is an abstract class with several pure virtual methods that control how the node operates:
+
+:func:`ExecNode::StartProducing`
+--------------------------------
+
+This method is called once at the start of the plan.  Most nodes ignore this method (any
+neccesary initialization should happen in the construtor or Init).  However, source nodes
+will typically provide a custom implementation.  Source nodes should schedule whatever tasks
+are needed to start reading and providing the data.  Source nodes are usually the primary
+creator of tasks in a plan.
+
+.. note::
+   The ExecPlan operates on a push-based model.  Sources are often pull-based.  For example,
+   your source may be an iterator.  The source node will typically then schedule tasks to pull one
+   item from the source and push it into the plan.

Review Comment:
   ```
      item from the source and push it into the plan.
   ```
   
   Should here be `into the node`? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35320:
URL: https://github.com/apache/arrow/pull/35320#issuecomment-1521360231

   Revision: 04e8cc0bf44fecc6a66fa648badab2c65ef6d661
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-22cf5a633d](https://github.com/ursacomputing/crossbow/branches/all?query=actions-22cf5a633d)
   
   |Task|Status|
   |----|------|
   |preview-docs|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-22cf5a633d-github-preview-docs)](https://github.com/ursacomputing/crossbow/actions/runs/4795191042/jobs/8529446045)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35320:
URL: https://github.com/apache/arrow/pull/35320#issuecomment-1546225614

   Revision: 8707d9282cacb52c3529d53ebcc3ce57e4825739
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-62d908a2ed](https://github.com/ursacomputing/crossbow/branches/all?query=actions-62d908a2ed)
   
   |Task|Status|
   |----|------|
   |preview-docs|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-62d908a2ed-github-preview-docs)](https://github.com/ursacomputing/crossbow/actions/runs/4962569384/jobs/8880788621)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace merged pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace merged PR #35320:
URL: https://github.com/apache/arrow/pull/35320


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #35320:
URL: https://github.com/apache/arrow/pull/35320#issuecomment-1554602444

   This has been out for a while and I think there is some other ongoing docs work I don't want to conflict with.  I'm going to merge this when CI passes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #35320:
URL: https://github.com/apache/arrow/pull/35320#discussion_r1192746315


##########
docs/source/cpp/acero/user_guide.rst:
##########
@@ -0,0 +1,735 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==================
+Acero User's Guide
+==================
+
+This page describes how to use Acero.  It's recommended that you read the
+overview first and familiarize yourself with the basic concepts.
+
+Using Acero
+===========
+
+The basic workflow for Acero is this:
+
+#. First, create a graph of :class:`Declaration` objects describing the plan
+ 
+#. Call one of the DeclarationToXyz methods to execute the Declaration.
+
+   a. A new ExecPlan is created from the graph of Declarations.  Each Declaration will correspond to one
+      ExecNode in the plan.  In addition, a sink node will be added, depending on which DeclarationToXyz method
+      was used.
+
+   b. The ExecPlan is executed.  Typically this happens as part of the DeclarationToXyz call but in 
+      DeclarationToReader the reader is returned before the plan is finished executing.
+
+   c. Once the plan is finished it is destroyed
+
+Creating a Plan
+===============
+
+Using Substrait
+---------------
+
+Substrait is the preferred mechanism for creating a plan (graph of :class:`Declaration`).  There are a few
+reasons for this:
+
+* Substrait producers spend a lot of time and energy in creating user-friendly APIs for producing complex
+  execution plans in a simple way.  For example, the ``pivot_wider`` operation can be achieved using a complex
+  series of ``aggregate`` nodes.  Rather than create all of those ``aggregate`` nodes by hand a producer will
+  give you a much simpler API.
+
+* If you are using Substrait then you can easily switch out to any other Substrait-consuming engine should you
+  at some point find that it serves your needs better than Acero.
+
+* We hope that tools will eventually emerge for Substrait-based optimizers and planners.  By using Substrait
+  you will be making it much easier to use these tools in the future.
+
+You could create the Substrait plan yourself but you'll probably have a much easier time finding an existing
+Susbstrait producer.  For example, you could use `ibis-substrait <https://github.com/ibis-project/ibis-substrait>`_
+to easily create Substrait plans from python expressions.  There are a few different tools that are able to create
+Substrait plans from SQL.  Eventually, we hope that C++ based Substrait producers will emerge.  However, we
+are not aware of any at this time.
+
+Detailed instructions on creating an execution plan from Substrait can be found in
+:ref:`the Substrait page<acero-substrait>`
+
+Programmatic Plan Creation
+--------------------------
+
+Creating an execution plan programmatically is simpler than creating a plan from Substrait, though loses some of
+the flexibility and future-proofing guarantees.  The simplest way to create a Declaration is to simply instantiate
+one.  You will need the name of the declaration, a vector of inputs, and an options object.  For example:
+
+.. literalinclude:: ../../../../cpp/examples/arrow/execution_plan_documentation_examples.cc
+  :language: cpp
+  :start-after: (Doc section: Project Example)
+  :end-before: (Doc section: Project Example)
+  :linenos:
+  :lineno-match:
+
+The above code creates a scan declaration (which has no inputs) and a project declaration (using the scan as
+input).  This is simple enough but we can make it slightly easier.  If you are creating a linear sequence of
+declarations (like in the above example) then you can also use the :func:`Declaration::Sequence` function.
+
+.. literalinclude:: ../../../../cpp/examples/arrow/execution_plan_documentation_examples.cc
+  :language: cpp
+  :start-after: (Doc section: Project Sequence Example)
+  :end-before: (Doc section: Project Sequence Example)
+  :linenos:
+  :lineno-match:
+
+There are many more examples of programmatic plan creation later in this document.
+
+Executing a Plan
+================
+
+There are a number of different methods that can be used to execute a declaration.  Each one provides the
+data in a slightly different form.  Since all of these methods start with ``DeclarationTo...`` this guide
+will often refer to these methods as the ``DeclarationToXyz`` methods.
+
+DeclarationToTable
+------------------
+
+The :func:`DeclarationToTable` method will accumulate all of the results into a single :class:`arrow::Table`.
+This is perhaps the simplest way to collect results from Acero.  The main disadvantage to this approach is
+that it requires accumulating all results into memory.
+
+.. note::
+
+   Acero processes large datasets in small chunks.  This is described in more detail in the developer's guide.
+   As a result, you may be surprised to find that a table collected with DeclarationToTable is chunked
+   differently than your input.  For example, your input might be a large table with a single chunk with 2
+   million rows.  Your output table might then have 64 chunks with 32Ki rows each.  There is a current request
+   to specify the chunk size for the output in `GH-15155 <https://github.com/apache/arrow/issues/15155>`_.
+
+DeclarationToReader
+-------------------
+
+The :func:`DeclarationToReader` method allows you to iteratively consume the results.  It will create an
+:class:`arrow::RecordBatchReader` which you can read from at your liesure.  If you do not read from the
+reader quickly enough then backpressure will be applied and the execution plan will pause.  Closing the
+reader will cancel the running execution plan and the reader's destructor will wait for the execution plan
+to finish whatever it is doing and so it may block.
+
+DeclarationToStatus
+-------------------
+
+The :func:`DeclarationToStatus` method is useful if you want to run the plan but do not actually want to
+consume the results.  For example, this is useful when benchmarking or when the plan has side effects such
+as a dataset write node.  If the plan generates any results then they will be immediately discarded.
+
+Running a Plan Directly
+-----------------------
+
+If one of the ``DeclarationToXyz`` methods is not sufficient for some reason then it is possible to run a plan
+directly.  This should only be needed if you are doing something unique.  For example, if you have created a
+custom sink node or if you need a plan that has multiple outputs.
+
+.. note::
+   In academic literature and many existing systems there is a general assumption that an execution plan has
+   at most one output.  There are some things in Acero, such as the DeclarationToXyz methods, which will expect
+   this.  However, there is nothing in the design that strictly prevents having multiple sink nodes.
+
+Detailed instructions on how to do this are out of scope for this guide but the rough steps are:
+
+1. Create a new :class:`ExecPlan` object.
+2. Add sink nodes to your graph of :class:`Declaration` objects (this is the only type you will need
+   to create declarations for sink nodes)
+3. Use :func:`Declaration::AddToPlan` to add your declaration to your plan (if you have more than one output
+   then you will not be able to use this method and will need to add your nodes one at a time)
+4. Validate the plan with :func:`ExecPlan::Validate`
+5. Start the plan with :func:`ExecPlan::StartProducing`
+6. Wait for the future returned by :func:`ExecPlan::finished` to complete.
+
+Providing Input
+===============
+
+Input data for an exec plan can come from a variety of sources.  It is often read from files stored on some
+kind of filesystem.  It is also common for input to come from in-memory data.  In-memory data is typical, for
+example, in a pandas-like frontend.  Input could also come from network streams like a Flight request.  Acero
+can support all of these cases and can even support unique and custom situations not mentioned here.
+
+There are pre-defined source nodes that cover the most common input scenarios.  These are listed below.  However,
+if your source data is unique then you will need to use the generic ``source`` node.  This node expects you to
+provide an asycnhronous stream of batches and is covered in more detail :ref:`here <stream_execution_source_docs>`.
+
+.. _ExecNode List:
+
+Available ``ExecNode`` Implementations
+======================================
+
+The following tables quickly summarize the available operators.
+
+Sources
+-------
+
+These nodes can be used as sources of data
+
+.. list-table:: Source Nodes
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Factory Name
+     - Options
+     - Brief Description
+   * - ``source``
+     - :class:`SourceNodeOptions`
+     - A generic source node that wraps an asynchronous stream of data (:ref:`example <stream_execution_source_docs>`)
+   * - ``table_source``
+     - :class:`TableSourceNodeOptions`
+     - Generates data from an :class:`arrow::Table` (:ref:`example <stream_execution_table_source_docs>`)
+   * - ``record_batch_source``
+     - :class:`RecordBatchSourceNodeOptions`
+     - Generates data from an iterator of :class:`arrow::RecordBatch`
+   * - ``record_batch_reader_source``
+     - :class:`RecordBatchReaderSourceNodeOptions`
+     - Generates data from an :class:`arrow::RecordBatchReader`
+   * - ``exec_batch_source``
+     - :class:`ExecBatchSourceNodeOptions`
+     - Generates data from an iterator of :class:`arrow::compute::ExecBatch`
+   * - ``array_vector_source``
+     - :class:`ArrayVectorSourceNodeOptions`
+     - Generates data from an iterator of vectors of :class:`arrow::Array`
+   * - ``scan``
+     - :class:`arrow::dataset::ScanNodeOptions`
+     - Generates data from an `arrow::dataset::Dataset` (requires the datasets module)
+       (:ref:`example <stream_execution_scan_docs>`)
+
+Compute Nodes
+-------------
+
+These nodes perform computations on data and may transform or reshape the data
+
+.. list-table:: Compute Nodes
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Factory Name
+     - Options
+     - Brief Description
+   * - ``filter``
+     - :class:`FilterNodeOptions`
+     - Removes rows that do not match a given filter expression
+   * - ``project``
+     - :class:`ProjectNodeOptions`
+     - Creates new columns by evaluating compute expressions.  Can also drop and reorder columns
+       (:ref:`example <stream_execution_project_docs>`)
+   * - ``aggregate``
+     - :class:`AggregateNodeOptions`
+     - Calculates summary statistics across the entire input stream or on groups of data
+       (:ref:`example <stream_execution_aggregate_docs>`)
+   * - ``pivot_longer``
+     - :class:`PivotLongerNodeOptions`
+     - Reshapes data by converting some columns into additional rows
+
+Arrangement Nodes
+-----------------
+
+These nodes reorder, combine, or slice streams of data
+
+.. list-table:: Arrangement Nodes
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Factory Name
+     - Options
+     - Brief Description
+   * - ``hash_join``
+     - :class:`HashJoinNodeOptions`
+     - Joins two inputs based on common columns (:ref:`example <stream_execution_hashjoin_docs>`)

Review Comment:
   I added asofjoin to the table.  Regrettably we don't have an example yet but I'll defer that for follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #35320:
URL: https://github.com/apache/arrow/pull/35320#discussion_r1192739696


##########
docs/source/cpp/acero/substrait.rst:
##########
@@ -0,0 +1,248 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::engine::substrait
+
+.. _acero-substrait:
+
+==========================
+Using Acero with Substrait
+==========================
+
+In order to use Acero you will need to create an execution plan.  This is the
+model that describes the computation you want to apply to your data.  Acero has
+its own internal representation for execution plans but most users should not
+interact with this directly as it will couple their code to Acero.
+
+`Substrait <https://substrait.io>`_ is an open standard for execution plans.
+Acero implements the Substrait "consumer" interface.  This means that Acero can
+accept a Substrait plan and fulfill the plan, loading the requested data and
+applying the desired computation.  By using Substrait plans users can easily
+switch out to a different execution engine at a later time.
+
+Substrait Conformance
+---------------------
+
+Substrait defines a broad set of operators and functions for many different
+situations and it is unlikely that Acero will ever completely satisfy all
+defined Substrait operators and functions.  To help understand what features
+are available the following sections define which features have been currently
+implemented in Acero and any caveats that apply.
+
+Plans
+^^^^^
+
+ * A plan should have a single top-level relation.
+ * The consumer is currently based on version 0.20.0 of Substrait.
+   Any features added that are newer will not be supported.
+ * Due to a breaking change in 0.20.0 any Substrait plan older than 0.20.0
+   will be rejected.
+
+Extensions
+^^^^^^^^^^
+
+ * If a plan contains any extension type variations it will be rejected.
+ * Advanced extensions can be provided by supplying a custom implementation of
+   :class:`arrow::engine::ExtensionProvider`.
+
+Relations (in general)
+^^^^^^^^^^^^^^^^^^^^^^
+
+ * Any relation not explicitly listed below will not be supported
+   and will cause the plan to be rejected.
+
+Read Relations
+^^^^^^^^^^^^^^
+
+ * The ``projection`` property is not supported and plans containing this
+   property will be rejected.
+ * The ``VirtualTable`` and ``ExtensionTable`` read types are not supported.
+   Plans containing these types will be rejected.
+ * Only the parquet and arrow file formats are currently supported.
+ * All URIs must use the ``file`` scheme
+ * ``partition_index``, ``start``, and ``length`` are not supported.  Plans containing
+   non-default values for these properties will be rejected.
+ * The Substrait spec requires that a ``filter`` be completely satisfied by a read
+   relation.  However, Acero only uses a read filter for pushdown projection and
+   it may not be fully satisfied.  Users should generally attach an additional
+   filter relation with the same filter expression after the read relation.
+
+Filter Relations
+^^^^^^^^^^^^^^^^
+
+ * No known caveats
+
+Project Relations
+^^^^^^^^^^^^^^^^^
+
+ * No known caveats
+
+Join Relations
+^^^^^^^^^^^^^^
+
+ * The join type ``JOIN_TYPE_SINGLE`` is not supported and plans containing this
+   will be rejected.
+ * The join expression must be a call to either the ``equal`` or ``is_not_distinct_from``
+   functions.  Both arguments to the call must be direct references.  Only a single
+   join key is supported.
+ * The ``post_join_filter`` property is not supported and will be ignored.
+
+Aggregate Relations
+^^^^^^^^^^^^^^^^^^^
+
+ * At most one grouping set is supported.
+ * Each grouping expression must be a direct reference.
+ * Each measure's arguments must be direct references.
+ * A measure may not have a filter
+ * A measure may not have sorts
+ * A measure's invocation must be AGGREGATION_INVOCATION_ALL or 
+   AGGREGATION_INVOCATION_UNSPECIFIED
+ * A measure's phase must be AGGREGATION_PHASE_INITIAL_TO_RESULT
+
+Expressions (general)
+^^^^^^^^^^^^^^^^^^^^^
+
+ * Various places in the Substrait spec allow for expressions to be used outside
+   of a filter or project relation.  For example, a join expression or an aggregate
+   grouping set.  Acero typically expects these expressions to be direct references.
+   Planners should extract the implicit projection into a formal project relation
+   before delivering the plan to Acero.
+
+Literals
+^^^^^^^^
+
+ * A literal with non-default nullability will cause a plan to be rejected.
+
+Types
+^^^^^
+
+ * Acero does not have full support for non-nullable types and may allow input
+   to have nulls without rejecting it.
+ * The table below shows the mapping between Arrow types and Substrait type
+   classes that are currently supported
+
+.. list-table:: Substrait / Arrow Type Mapping
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Substrait Type
+     - Arrow Type
+     - Caveat
+   * - boolean
+     - boolean
+     - 
+   * - i8
+     - int8
+     - 
+   * - i16
+     - int16
+     - 
+   * - i32
+     - int32
+     - 
+   * - i64
+     - int64
+     - 
+   * - fp32
+     - float32
+     - 
+   * - fp64
+     - float64
+     - 
+   * - string
+     - string
+     - 
+   * - binary
+     - binary
+     - 
+   * - timestamp
+     - timestamp<MICRO,"">
+     - 
+   * - timestamp_tz
+     - timestamp<MICRO,"UTC">
+     - 
+   * - date
+     - date32<DAY>
+     - 
+   * - time
+     - time64<MICRO>
+     - 
+   * - interval_year
+     - 
+     - Not currently supported
+   * - interval_day
+     - 
+     - Not currently supported
+   * - uuid
+     - 
+     - Not currently supported
+   * - FIXEDCHAR<L>
+     - 
+     - Not currently supported
+   * - VARCHAR<L>
+     - 
+     - Not currently supported
+   * - FIXEDBINARY<L>
+     - fixed_size_binary<L>
+     - 
+   * - DECIMAL<P,S>
+     - decimal128<P,S>
+     - 
+   * - STRUCT<T1...TN>
+     - struct<T1...TN>
+     - Arrow struct fields will have no name (empty string)
+   * - NSTRUCT<N:T1...N:Tn>
+     - 
+     - Not currently supported
+   * - LIST<T>
+     - list<T>
+     - 
+   * - MAP<K,V>
+     - map<K,V>
+     - K must not be nullable
+
+Functions
+^^^^^^^^^
+
+ * The following functions have caveats or are not supported at all.  Note that
+   this is not a comprehensive list.  Functions are being added to Substrait at
+   a rapid pace and new functions may be missing.
+
+   * Acero does not support the SATURATE option for overflow
+   * Acero does not support kernels that take more than two arguments
+     for the functions ``and``, ``or``, ``xor``
+   * Acero does not support temporal arithmetic

Review Comment:
   Ignore that above comment.  Replied to the wrong thing.  What I meant to say was...
   
   I think we actually might support temporal arithmetic now.  Also, some of the things in this list were definitely wrong.  This section has always been a bit volatile and I went ahead and removed it for now (there is already a blanket statement that new functions are added regularly and mappings may not be available instantly).



##########
docs/source/cpp/acero/substrait.rst:
##########
@@ -0,0 +1,248 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::engine::substrait
+
+.. _acero-substrait:
+
+==========================
+Using Acero with Substrait
+==========================
+
+In order to use Acero you will need to create an execution plan.  This is the
+model that describes the computation you want to apply to your data.  Acero has
+its own internal representation for execution plans but most users should not
+interact with this directly as it will couple their code to Acero.
+
+`Substrait <https://substrait.io>`_ is an open standard for execution plans.
+Acero implements the Substrait "consumer" interface.  This means that Acero can
+accept a Substrait plan and fulfill the plan, loading the requested data and
+applying the desired computation.  By using Substrait plans users can easily
+switch out to a different execution engine at a later time.
+
+Substrait Conformance
+---------------------
+
+Substrait defines a broad set of operators and functions for many different
+situations and it is unlikely that Acero will ever completely satisfy all
+defined Substrait operators and functions.  To help understand what features
+are available the following sections define which features have been currently
+implemented in Acero and any caveats that apply.
+
+Plans
+^^^^^
+
+ * A plan should have a single top-level relation.
+ * The consumer is currently based on version 0.20.0 of Substrait.
+   Any features added that are newer will not be supported.
+ * Due to a breaking change in 0.20.0 any Substrait plan older than 0.20.0
+   will be rejected.
+
+Extensions
+^^^^^^^^^^
+
+ * If a plan contains any extension type variations it will be rejected.
+ * Advanced extensions can be provided by supplying a custom implementation of
+   :class:`arrow::engine::ExtensionProvider`.
+
+Relations (in general)
+^^^^^^^^^^^^^^^^^^^^^^
+
+ * Any relation not explicitly listed below will not be supported
+   and will cause the plan to be rejected.
+
+Read Relations
+^^^^^^^^^^^^^^
+
+ * The ``projection`` property is not supported and plans containing this
+   property will be rejected.
+ * The ``VirtualTable`` and ``ExtensionTable`` read types are not supported.
+   Plans containing these types will be rejected.
+ * Only the parquet and arrow file formats are currently supported.
+ * All URIs must use the ``file`` scheme
+ * ``partition_index``, ``start``, and ``length`` are not supported.  Plans containing
+   non-default values for these properties will be rejected.
+ * The Substrait spec requires that a ``filter`` be completely satisfied by a read
+   relation.  However, Acero only uses a read filter for pushdown projection and
+   it may not be fully satisfied.  Users should generally attach an additional
+   filter relation with the same filter expression after the read relation.
+
+Filter Relations
+^^^^^^^^^^^^^^^^
+
+ * No known caveats
+
+Project Relations
+^^^^^^^^^^^^^^^^^
+
+ * No known caveats
+
+Join Relations
+^^^^^^^^^^^^^^
+
+ * The join type ``JOIN_TYPE_SINGLE`` is not supported and plans containing this
+   will be rejected.
+ * The join expression must be a call to either the ``equal`` or ``is_not_distinct_from``
+   functions.  Both arguments to the call must be direct references.  Only a single
+   join key is supported.
+ * The ``post_join_filter`` property is not supported and will be ignored.
+
+Aggregate Relations
+^^^^^^^^^^^^^^^^^^^
+
+ * At most one grouping set is supported.
+ * Each grouping expression must be a direct reference.
+ * Each measure's arguments must be direct references.
+ * A measure may not have a filter
+ * A measure may not have sorts
+ * A measure's invocation must be AGGREGATION_INVOCATION_ALL or 
+   AGGREGATION_INVOCATION_UNSPECIFIED
+ * A measure's phase must be AGGREGATION_PHASE_INITIAL_TO_RESULT
+
+Expressions (general)
+^^^^^^^^^^^^^^^^^^^^^
+
+ * Various places in the Substrait spec allow for expressions to be used outside
+   of a filter or project relation.  For example, a join expression or an aggregate
+   grouping set.  Acero typically expects these expressions to be direct references.
+   Planners should extract the implicit projection into a formal project relation
+   before delivering the plan to Acero.
+
+Literals
+^^^^^^^^
+
+ * A literal with non-default nullability will cause a plan to be rejected.
+
+Types
+^^^^^
+
+ * Acero does not have full support for non-nullable types and may allow input
+   to have nulls without rejecting it.
+ * The table below shows the mapping between Arrow types and Substrait type
+   classes that are currently supported
+
+.. list-table:: Substrait / Arrow Type Mapping
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Substrait Type
+     - Arrow Type
+     - Caveat
+   * - boolean
+     - boolean
+     - 
+   * - i8
+     - int8
+     - 
+   * - i16
+     - int16
+     - 
+   * - i32
+     - int32
+     - 
+   * - i64
+     - int64
+     - 
+   * - fp32
+     - float32
+     - 
+   * - fp64
+     - float64
+     - 
+   * - string
+     - string
+     - 
+   * - binary
+     - binary
+     - 
+   * - timestamp
+     - timestamp<MICRO,"">
+     - 
+   * - timestamp_tz
+     - timestamp<MICRO,"UTC">
+     - 
+   * - date
+     - date32<DAY>
+     - 
+   * - time
+     - time64<MICRO>
+     - 
+   * - interval_year
+     - 
+     - Not currently supported
+   * - interval_day
+     - 
+     - Not currently supported
+   * - uuid
+     - 
+     - Not currently supported
+   * - FIXEDCHAR<L>
+     - 
+     - Not currently supported
+   * - VARCHAR<L>
+     - 
+     - Not currently supported
+   * - FIXEDBINARY<L>
+     - fixed_size_binary<L>
+     - 
+   * - DECIMAL<P,S>
+     - decimal128<P,S>
+     - 
+   * - STRUCT<T1...TN>
+     - struct<T1...TN>
+     - Arrow struct fields will have no name (empty string)
+   * - NSTRUCT<N:T1...N:Tn>
+     - 
+     - Not currently supported
+   * - LIST<T>
+     - list<T>
+     - 
+   * - MAP<K,V>
+     - map<K,V>
+     - K must not be nullable
+
+Functions
+^^^^^^^^^
+
+ * The following functions have caveats or are not supported at all.  Note that
+   this is not a comprehensive list.  Functions are being added to Substrait at
+   a rapid pace and new functions may be missing.
+
+   * Acero does not support the SATURATE option for overflow
+   * Acero does not support kernels that take more than two arguments
+     for the functions ``and``, ``or``, ``xor``
+   * Acero does not support temporal arithmetic

Review Comment:
   ~~I think that's a fair point.  Let's address in a follow-up.~~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on a diff in pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on code in PR #35320:
URL: https://github.com/apache/arrow/pull/35320#discussion_r1192739068


##########
docs/source/cpp/acero/substrait.rst:
##########
@@ -0,0 +1,248 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::engine::substrait
+
+.. _acero-substrait:
+
+==========================
+Using Acero with Substrait
+==========================
+
+In order to use Acero you will need to create an execution plan.  This is the
+model that describes the computation you want to apply to your data.  Acero has
+its own internal representation for execution plans but most users should not
+interact with this directly as it will couple their code to Acero.
+
+`Substrait <https://substrait.io>`_ is an open standard for execution plans.
+Acero implements the Substrait "consumer" interface.  This means that Acero can
+accept a Substrait plan and fulfill the plan, loading the requested data and
+applying the desired computation.  By using Substrait plans users can easily
+switch out to a different execution engine at a later time.
+
+Substrait Conformance
+---------------------
+
+Substrait defines a broad set of operators and functions for many different
+situations and it is unlikely that Acero will ever completely satisfy all
+defined Substrait operators and functions.  To help understand what features
+are available the following sections define which features have been currently
+implemented in Acero and any caveats that apply.
+
+Plans
+^^^^^
+
+ * A plan should have a single top-level relation.
+ * The consumer is currently based on version 0.20.0 of Substrait.
+   Any features added that are newer will not be supported.
+ * Due to a breaking change in 0.20.0 any Substrait plan older than 0.20.0
+   will be rejected.
+
+Extensions
+^^^^^^^^^^
+
+ * If a plan contains any extension type variations it will be rejected.
+ * Advanced extensions can be provided by supplying a custom implementation of
+   :class:`arrow::engine::ExtensionProvider`.
+
+Relations (in general)
+^^^^^^^^^^^^^^^^^^^^^^
+
+ * Any relation not explicitly listed below will not be supported
+   and will cause the plan to be rejected.
+
+Read Relations
+^^^^^^^^^^^^^^
+
+ * The ``projection`` property is not supported and plans containing this
+   property will be rejected.
+ * The ``VirtualTable`` and ``ExtensionTable`` read types are not supported.
+   Plans containing these types will be rejected.
+ * Only the parquet and arrow file formats are currently supported.
+ * All URIs must use the ``file`` scheme
+ * ``partition_index``, ``start``, and ``length`` are not supported.  Plans containing
+   non-default values for these properties will be rejected.
+ * The Substrait spec requires that a ``filter`` be completely satisfied by a read
+   relation.  However, Acero only uses a read filter for pushdown projection and
+   it may not be fully satisfied.  Users should generally attach an additional
+   filter relation with the same filter expression after the read relation.

Review Comment:
   I think that's a fair point.  Let's address in a follow-up.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] github-actions[bot] commented on pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "github-actions[bot] (via GitHub)" <gi...@apache.org>.
github-actions[bot] commented on PR #35320:
URL: https://github.com/apache/arrow/pull/35320#issuecomment-1546067513

   Revision: e54a5463a3b6ed17b9a2916518bb051b55a46a97
   
   Submitted crossbow builds: [ursacomputing/crossbow @ actions-cb2af88888](https://github.com/ursacomputing/crossbow/branches/all?query=actions-cb2af88888)
   
   |Task|Status|
   |----|------|
   |preview-docs|[![Github Actions](https://github.com/ursacomputing/crossbow/workflows/Crossbow/badge.svg?branch=actions-cb2af88888-github-preview-docs)](https://github.com/ursacomputing/crossbow/actions/runs/4961502342/jobs/8878359495)|


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "westonpace (via GitHub)" <gi...@apache.org>.
westonpace commented on PR #35320:
URL: https://github.com/apache/arrow/pull/35320#issuecomment-1521356554

   @github-actions crossbow submit preview-docs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] edponce commented on a diff in pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "edponce (via GitHub)" <gi...@apache.org>.
edponce commented on code in PR #35320:
URL: https://github.com/apache/arrow/pull/35320#discussion_r1177298441


##########
docs/source/cpp/acero/overview.rst:
##########
@@ -0,0 +1,262 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==============
+Acero Overview
+==============
+
+This page gives an overview of the basic Acero concepts and helps distinguish Acero
+from other modules in the Arrow code base.  It's intended for users, developers,
+potential contributors, and for those that would like to extend Acero, either for
+research or for business use.  This page assumes the reader is already familiar with
+core Arrow concepts.  This page does not expect any existing knowledge in relational
+algebra.
+
+What is Acero?
+==============
+
+Acero is a C++ library that can be used to analyze large (potentially infinite) streams
+of data.  Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`).
+An execution plan takes in zero or more streams of input data and emits a single
+stream of output data.  The plan describes how the data will be transformed as it
+passes through.  For example, a plan might:
+
+ * Merge two streams of data using a common column
+ * Create additional columns by evaluating expressions against the existing columns
+ * Consume a stream of data by writing it to disk in a partitioned layout
+
+.. image:: simple_graph.svg
+   :alt: A sample execution plan that joins three streams of data and writes to disk
+
+Acero is not...
+---------------
+
+A Library for Data Scientists
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Acero is not intended to be used directly by data scientists.  It is expected that
+end users will typically be using some kind of frontend.  For example, Pandas, Ibis,
+or SQL.  The API for Acero is focused around capabilities and available algorithms.
+However, such users may be intersted in knowing more about how Acero works so that
+they can better understand how the backend processing for their libraries operates.
+
+A Database
+^^^^^^^^^^
+
+A database (or DBMS) is typically a much more expansive application and often packaged
+as a standalone service.  Acero could be a component in a database (most databases have
+some kind of execution engine) or could be a component in some other data processing
+application that hardly resembles a database.  Acero does not concern itself with
+user management, external communication, isolation, durability, or consistency.  In
+addition, Acero is focused primarily on the read path, and the write utilities lack
+any sort of transaction support.
+
+An Optimizer
+^^^^^^^^^^^^
+
+Acero does not have an SQL parser.  It does not have a query planner.  It does not have
+any sort of optimizer.  Acero expects to be given very detailed and low-level instructions
+on how to manipulate data and then it will perform that manipulation exactly as described.
+
+Creating the best execution plan is very hard.  Small details can have a big impact on
+performance.  We do think optimizers are important and we hope that tools will emerge
+someday which can provide these capabilities using standards such as Substrait.
+
+Distributed
+^^^^^^^^^^^
+
+Acero does not provide distributed execution.  However, Acero aims to be usable by a distributed
+query execution engine.  In other words, Acero will not configure and coordinate workers but
+it does except to be used as a worker.  Sometimes, the distinction is a bit fuzzy.  For example,

Review Comment:
   except -> expect



##########
docs/source/cpp/acero/overview.rst:
##########
@@ -0,0 +1,262 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==============
+Acero Overview
+==============
+
+This page gives an overview of the basic Acero concepts and helps distinguish Acero
+from other modules in the Arrow code base.  It's intended for users, developers,
+potential contributors, and for those that would like to extend Acero, either for
+research or for business use.  This page assumes the reader is already familiar with
+core Arrow concepts.  This page does not expect any existing knowledge in relational
+algebra.
+
+What is Acero?
+==============
+
+Acero is a C++ library that can be used to analyze large (potentially infinite) streams
+of data.  Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`).
+An execution plan takes in zero or more streams of input data and emits a single
+stream of output data.  The plan describes how the data will be transformed as it
+passes through.  For example, a plan might:
+
+ * Merge two streams of data using a common column
+ * Create additional columns by evaluating expressions against the existing columns
+ * Consume a stream of data by writing it to disk in a partitioned layout
+
+.. image:: simple_graph.svg
+   :alt: A sample execution plan that joins three streams of data and writes to disk
+
+Acero is not...
+---------------
+
+A Library for Data Scientists
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Acero is not intended to be used directly by data scientists.  It is expected that
+end users will typically be using some kind of frontend.  For example, Pandas, Ibis,
+or SQL.  The API for Acero is focused around capabilities and available algorithms.
+However, such users may be intersted in knowing more about how Acero works so that
+they can better understand how the backend processing for their libraries operates.
+
+A Database
+^^^^^^^^^^
+
+A database (or DBMS) is typically a much more expansive application and often packaged
+as a standalone service.  Acero could be a component in a database (most databases have
+some kind of execution engine) or could be a component in some other data processing
+application that hardly resembles a database.  Acero does not concern itself with
+user management, external communication, isolation, durability, or consistency.  In
+addition, Acero is focused primarily on the read path, and the write utilities lack
+any sort of transaction support.
+
+An Optimizer
+^^^^^^^^^^^^
+
+Acero does not have an SQL parser.  It does not have a query planner.  It does not have
+any sort of optimizer.  Acero expects to be given very detailed and low-level instructions
+on how to manipulate data and then it will perform that manipulation exactly as described.
+
+Creating the best execution plan is very hard.  Small details can have a big impact on
+performance.  We do think optimizers are important and we hope that tools will emerge
+someday which can provide these capabilities using standards such as Substrait.
+
+Distributed
+^^^^^^^^^^^
+
+Acero does not provide distributed execution.  However, Acero aims to be usable by a distributed
+query execution engine.  In other words, Acero will not configure and coordinate workers but
+it does except to be used as a worker.  Sometimes, the distinction is a bit fuzzy.  For example,
+an Acero source may be a smart storage device that is capable of performing filtering or other
+advanced analytics.  One might consider this a distributed plan.  The key distinction is Acero
+does not have the capability of transforming a logical plan into a distributed execution plan.
+That step will need to be done elsewhere.
+
+Acero vs...
+-----------
+
+Arrow Compute
+^^^^^^^^^^^^^
+
+This is described in more detail in the overview but the key difference is that Acero handles
+streams of data and Arrow Compute handles situations where all the data is in memory.
+
+Arrow Datasets
+^^^^^^^^^^^^^^
+
+The Arrow datasets library provides some basic routines for discovering, scanning, and
+writing collections of files.  The datasets module depends on Acero.  Both scanning and
+writing datasets uses Acero.  The scan node and the write node are part of the datasets
+module.  This helps to keep the complexity of file formats and filesystems out of the core
+Acero logic.
+
+Substrait
+^^^^^^^^^
+
+Substrait is a project establishing standards for query plans.  Acero executes query plans
+and generates data.  This makes Acero a Substrait consumer.  There are more details on the
+Substrait capabilities below.
+
+Datafusion / DuckDb / Velox / Etc.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are many columnar data engines emerging. We view this as a good thing and encourage
+projects like Substrait to help allow switching between engines as needed.  We generally
+discourage comparative benchmarks as they are almost inevitably going to be workload-driven
+and rarely manage to capture an apples-vs-apples comparison.  Discussions of the pros and
+cons of each is beyond the scope of this guide.
+
+Relation to Arrow C++
+=====================
+
+The Acero module is part of the Arrow C++ implementation.  It is built as a separate
+module but it depends on core Arrow modules and does not stand alone.  Acero uses
+and extends the capabilities from the core Arrow module and the Arrow compute kernels.
+
+.. image:: layers.svg
+   :alt: A diagram of layers with core on the left, compute in the middle, and acero on the right
+
+The core Arrow library provides containers for buffers and arrays that are laid out according
+to the Arrow columnar format.  With few exceptions the core Arrow library does not examine
+or modify the contents of buffers.  For example, converting a string array from lowercase
+strings to uppercase strings would not be a part of the core Arrow library because that would
+require examining the contents of the array.
+
+The compute module expands on the core library and provides functions which analyze and
+transform data.  The compute module's capabilites are all exposed via a function registry.
+An Arrow "function" accepts zero or more arrays, batches, or tables, and produces an array,
+batch, or table.  In addition, function calls can be combined, along with field references
+and literals, to form an expression (a tree of function calls) which the compute module can
+evaluate.  For example, calculating ``x + (y * 3)`` given a table with columns ``x`` and ``y``.
+
+.. image:: expression_ast.svg
+   :alt: A sample expression tree
+
+Acero expands on these capabilities by adding compute operations for streams of data.  For
+example, a project node can apply a compute expression on a stream of batches.  This will
+create a new stream of batches with the result of the expression added as a new column.  These
+nodes can be combined into a graph to form a more complex execution plan.  This is very similar
+to the way functions are combined into a tree to form a complex expression.
+
+.. image:: simple_plan.svg
+   :alt: A simple plan that uses compute expressions
+
+.. note::
+   Acero does not use the :class:`arrow::Table` or :class:`arrow::ChunkedArray` containers
+   from the core Arrow library.  This is because Acero operates on streams of batches and
+   so there is no need for a multi-batch container of data.  This helps to reduce the
+   complexity of Acero and avoids tricky situations that can arise from tables whose
+   columns have different chunk sizes.  Acero will often use :class:`arrow::Datum`
+   which is a variant from the core module that can hold many different types.  Within
+   Acero, a datum will always hold either an :class:`arrow::Array` or a :class:`arrow::Scalar`.
+
+Core Concepts
+=============
+
+ExecNode
+--------
+
+The most basic concept in Acero is the ExecNode.  An ExecNode has zero or more inputs and
+zero or one outputs.  If an ExecNode has zero inputs we call it a source and if an ExecNode
+does not have an output then we call it a sink.  There are many different kinds of nodes and
+each one transforms is inputs in different ways.  For example:
+
+ * A scan node is a source node that reads data from files
+ * An aggregate node accumulates batches of data to compute summary statistics
+ * A filter node removes rows from the data according to a filter expression
+ * A table sink node accumulates data into a table
+
+.. note::
+   A full list of the available compute modules is included in the :ref:`user's guide<ExecNode List>`
+
+.. _exec-batch:
+
+ExecBatch
+---------
+
+Batches of data are represented by the ExecBatch class.  An ExecBatch is a 2D structure that
+is very similar to a RecordBatch.  It can have zero or more columns and all of the columns
+must have the same length.  There are a few key differences from ExecBatch:
+
+.. figure:: rb_vs_eb.svg
+   
+   Both the record batch and the exec batch have strong ownership of the arrays & buffers
+
+* An `ExecBatch` does not have a schema.  This is because an `ExecBatch` is assumed to be
+  part of a stream of batches and the stream is assumed to have a consistent schema.  So
+  the schema for an `ExecBatch` is typically stored in the ExecNode.
+* Columns in an `ExecBatch` are either an `Array` or a `Scalar`.  When a column is a `Scalar`
+  this means that the column has a single value for every row in the batch.  An `ExecBatch`
+  also has a length property which describes how many rows are in a batch.  So another way to
+  view a `Scalar` is a constant array with `length` elements.
+* An `ExecBatch` contains additional information used by the exec plan.  For example, an
+  `index` can be used to describe a batche's position in an ordered stream.  We expect 
+  that `ExecBatch` will also evolve to contain additional fields such as a selection vector.
+
+.. figure:: scalar_vs_array.svg
+
+   There are four different ways to represent the given batch of data using different combinations
+   of arrays and scalars.  All four exec batches should be considered semantically equivalent.
+
+Converting from a record batch to an exec batch is is always zero copy.  Both RecordBatch and ExecBatch

Review Comment:
   is ~~is~~



##########
docs/source/cpp/acero/user_guide.rst:
##########
@@ -0,0 +1,735 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==================
+Acero User's Guide
+==================
+
+This page describes how to use Acero.  It's recommended that you read the
+overview first and familiarize yourself with the basic concepts.
+
+Using Acero
+===========
+
+The basic workflow for Acero is this:
+
+#. First, create a graph of :class:`Declaration` objects describing the plan
+ 
+#. Call one of the DeclarationToXyz methods to execute the Declaration.
+
+   a. A new ExecPlan is created from the graph of Declarations.  Each Declaration will correspond to one
+      ExecNode in the plan.  In addition, a sink node will be added, depending on which DeclarationToXyz method
+      was used.
+
+   b. The ExecPlan is executed.  Typically this happens as part of the DeclarationToXyz call but in 
+      DeclarationToReader the reader is returned before the plan is finished executing.
+
+   c. Once the plan is finished it is destroyed
+
+Creating a Plan
+===============
+
+Using Substrait
+---------------
+
+Substrait is the preferred mechanism for creating a plan (graph of :class:`Declaration`).  There are a few
+reasons for this:
+
+* Substrait producers spend a lot of time and energy in creating user-friendly APIs for producing complex
+  execution plans in a simple way.  For example, the ``pivot_wider`` operation can be achieved using a complex
+  series of ``aggregate`` nodes.  Rather than create all of those ``aggregate`` nodes by hand a producer will
+  give you a much simpler API.
+
+* If you are using Substrait then you can easily switch out to any other Substrait-consuming engine should you
+  at some point find that it serves your needs better than Acero.
+
+* We hope that tools will eventually emerge for Substrait-based optimizers and planners.  By using Substrait
+  you will be making it much easier to use these tools in the future.
+
+You could create the Substrait plan yourself but you'll probably have a much easier time finding an existing
+Susbstrait producer.  For example, you could use `ibis-substrait <https://github.com/ibis-project/ibis-substrait>`_

Review Comment:
   Susbstrait -> Substrait



##########
docs/source/cpp/acero/overview.rst:
##########
@@ -0,0 +1,262 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==============
+Acero Overview
+==============
+
+This page gives an overview of the basic Acero concepts and helps distinguish Acero
+from other modules in the Arrow code base.  It's intended for users, developers,
+potential contributors, and for those that would like to extend Acero, either for
+research or for business use.  This page assumes the reader is already familiar with
+core Arrow concepts.  This page does not expect any existing knowledge in relational
+algebra.
+
+What is Acero?
+==============
+
+Acero is a C++ library that can be used to analyze large (potentially infinite) streams
+of data.  Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`).
+An execution plan takes in zero or more streams of input data and emits a single
+stream of output data.  The plan describes how the data will be transformed as it
+passes through.  For example, a plan might:
+
+ * Merge two streams of data using a common column
+ * Create additional columns by evaluating expressions against the existing columns
+ * Consume a stream of data by writing it to disk in a partitioned layout
+
+.. image:: simple_graph.svg
+   :alt: A sample execution plan that joins three streams of data and writes to disk
+
+Acero is not...
+---------------
+
+A Library for Data Scientists
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Acero is not intended to be used directly by data scientists.  It is expected that
+end users will typically be using some kind of frontend.  For example, Pandas, Ibis,
+or SQL.  The API for Acero is focused around capabilities and available algorithms.
+However, such users may be intersted in knowing more about how Acero works so that
+they can better understand how the backend processing for their libraries operates.
+
+A Database
+^^^^^^^^^^
+
+A database (or DBMS) is typically a much more expansive application and often packaged
+as a standalone service.  Acero could be a component in a database (most databases have
+some kind of execution engine) or could be a component in some other data processing
+application that hardly resembles a database.  Acero does not concern itself with
+user management, external communication, isolation, durability, or consistency.  In
+addition, Acero is focused primarily on the read path, and the write utilities lack
+any sort of transaction support.
+
+An Optimizer
+^^^^^^^^^^^^
+
+Acero does not have an SQL parser.  It does not have a query planner.  It does not have
+any sort of optimizer.  Acero expects to be given very detailed and low-level instructions
+on how to manipulate data and then it will perform that manipulation exactly as described.
+
+Creating the best execution plan is very hard.  Small details can have a big impact on
+performance.  We do think optimizers are important and we hope that tools will emerge
+someday which can provide these capabilities using standards such as Substrait.
+
+Distributed
+^^^^^^^^^^^
+
+Acero does not provide distributed execution.  However, Acero aims to be usable by a distributed
+query execution engine.  In other words, Acero will not configure and coordinate workers but
+it does except to be used as a worker.  Sometimes, the distinction is a bit fuzzy.  For example,
+an Acero source may be a smart storage device that is capable of performing filtering or other
+advanced analytics.  One might consider this a distributed plan.  The key distinction is Acero
+does not have the capability of transforming a logical plan into a distributed execution plan.
+That step will need to be done elsewhere.
+
+Acero vs...
+-----------
+
+Arrow Compute
+^^^^^^^^^^^^^
+
+This is described in more detail in the overview but the key difference is that Acero handles
+streams of data and Arrow Compute handles situations where all the data is in memory.
+
+Arrow Datasets
+^^^^^^^^^^^^^^
+
+The Arrow datasets library provides some basic routines for discovering, scanning, and
+writing collections of files.  The datasets module depends on Acero.  Both scanning and
+writing datasets uses Acero.  The scan node and the write node are part of the datasets
+module.  This helps to keep the complexity of file formats and filesystems out of the core
+Acero logic.
+
+Substrait
+^^^^^^^^^
+
+Substrait is a project establishing standards for query plans.  Acero executes query plans
+and generates data.  This makes Acero a Substrait consumer.  There are more details on the
+Substrait capabilities below.
+
+Datafusion / DuckDb / Velox / Etc.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are many columnar data engines emerging. We view this as a good thing and encourage
+projects like Substrait to help allow switching between engines as needed.  We generally
+discourage comparative benchmarks as they are almost inevitably going to be workload-driven
+and rarely manage to capture an apples-vs-apples comparison.  Discussions of the pros and
+cons of each is beyond the scope of this guide.
+
+Relation to Arrow C++
+=====================
+
+The Acero module is part of the Arrow C++ implementation.  It is built as a separate
+module but it depends on core Arrow modules and does not stand alone.  Acero uses
+and extends the capabilities from the core Arrow module and the Arrow compute kernels.
+
+.. image:: layers.svg
+   :alt: A diagram of layers with core on the left, compute in the middle, and acero on the right
+
+The core Arrow library provides containers for buffers and arrays that are laid out according
+to the Arrow columnar format.  With few exceptions the core Arrow library does not examine
+or modify the contents of buffers.  For example, converting a string array from lowercase
+strings to uppercase strings would not be a part of the core Arrow library because that would
+require examining the contents of the array.
+
+The compute module expands on the core library and provides functions which analyze and
+transform data.  The compute module's capabilites are all exposed via a function registry.
+An Arrow "function" accepts zero or more arrays, batches, or tables, and produces an array,
+batch, or table.  In addition, function calls can be combined, along with field references
+and literals, to form an expression (a tree of function calls) which the compute module can
+evaluate.  For example, calculating ``x + (y * 3)`` given a table with columns ``x`` and ``y``.
+
+.. image:: expression_ast.svg
+   :alt: A sample expression tree
+
+Acero expands on these capabilities by adding compute operations for streams of data.  For
+example, a project node can apply a compute expression on a stream of batches.  This will
+create a new stream of batches with the result of the expression added as a new column.  These
+nodes can be combined into a graph to form a more complex execution plan.  This is very similar
+to the way functions are combined into a tree to form a complex expression.
+
+.. image:: simple_plan.svg
+   :alt: A simple plan that uses compute expressions
+
+.. note::
+   Acero does not use the :class:`arrow::Table` or :class:`arrow::ChunkedArray` containers
+   from the core Arrow library.  This is because Acero operates on streams of batches and
+   so there is no need for a multi-batch container of data.  This helps to reduce the
+   complexity of Acero and avoids tricky situations that can arise from tables whose
+   columns have different chunk sizes.  Acero will often use :class:`arrow::Datum`
+   which is a variant from the core module that can hold many different types.  Within
+   Acero, a datum will always hold either an :class:`arrow::Array` or a :class:`arrow::Scalar`.
+
+Core Concepts
+=============
+
+ExecNode
+--------
+
+The most basic concept in Acero is the ExecNode.  An ExecNode has zero or more inputs and
+zero or one outputs.  If an ExecNode has zero inputs we call it a source and if an ExecNode
+does not have an output then we call it a sink.  There are many different kinds of nodes and
+each one transforms is inputs in different ways.  For example:
+
+ * A scan node is a source node that reads data from files
+ * An aggregate node accumulates batches of data to compute summary statistics
+ * A filter node removes rows from the data according to a filter expression
+ * A table sink node accumulates data into a table
+
+.. note::
+   A full list of the available compute modules is included in the :ref:`user's guide<ExecNode List>`
+
+.. _exec-batch:
+
+ExecBatch
+---------
+
+Batches of data are represented by the ExecBatch class.  An ExecBatch is a 2D structure that
+is very similar to a RecordBatch.  It can have zero or more columns and all of the columns
+must have the same length.  There are a few key differences from ExecBatch:
+
+.. figure:: rb_vs_eb.svg
+   
+   Both the record batch and the exec batch have strong ownership of the arrays & buffers
+
+* An `ExecBatch` does not have a schema.  This is because an `ExecBatch` is assumed to be
+  part of a stream of batches and the stream is assumed to have a consistent schema.  So
+  the schema for an `ExecBatch` is typically stored in the ExecNode.
+* Columns in an `ExecBatch` are either an `Array` or a `Scalar`.  When a column is a `Scalar`
+  this means that the column has a single value for every row in the batch.  An `ExecBatch`
+  also has a length property which describes how many rows are in a batch.  So another way to
+  view a `Scalar` is a constant array with `length` elements.
+* An `ExecBatch` contains additional information used by the exec plan.  For example, an
+  `index` can be used to describe a batche's position in an ordered stream.  We expect 
+  that `ExecBatch` will also evolve to contain additional fields such as a selection vector.
+
+.. figure:: scalar_vs_array.svg
+
+   There are four different ways to represent the given batch of data using different combinations
+   of arrays and scalars.  All four exec batches should be considered semantically equivalent.
+
+Converting from a record batch to an exec batch is is always zero copy.  Both RecordBatch and ExecBatch
+refer to the exact same underlying arrays.  Converting from an exec batch to a record batch is
+only zero copy if there are no scalars in the exec batch.
+
+.. note::
+   Both Acero and the compute module have "lightweight" versions of batches and arrays.
+   In the compute module these are called `BatchSpan`, `ArraySpan`, and `BufferSpan`.  In
+   Acero the concept is called `KeyColumnArray`.  These types were developed concurrently
+   and serve the same purpose.  They aim to provide an array container that can be completely
+   stack allocated (provided the data type is non-nested) in order to avoid heap allocation
+   overhead.  Ideally these two concepts will be merged someday.
+
+ExecPlan
+--------
+
+An ExecPlan represents a graph of ExecNode objects.  A valid ExecPlan must always have at
+least one source node but it does not technically need to have a sink node.  The ExecPlan contains
+resources shared by all of the nodes and has utility functions to control starting and stopping
+execution of the nodes.  Both ExecPlan and ExecNode are tied to the lifecycle of a single execution.
+They have state and are not expected to be restartable.
+
+.. warning::
+   The structures within Acero, including `ExecBatch`, are still experimental.  The `ExecBatch`
+   class should not be used outside of Acero.  Instead, an `ExecBatch` should be converted to
+   a more standard structure such as a `RecordBatch`.
+
+   Similarly, an ExecPlan is an internal concept.  Users creating plans should be using Declaration
+   objects.  APIs for consuming and executing plans should abstract away the details of the underlying
+   plan and not expose the object itself.
+
+Declaration
+-----------
+
+A Declaration is a blueprint for an ExecNode.  Declarations can be combined into a graph to
+form the blueprint for an ExecPlan.  A Declaration describes the computation that needs to be
+done but is not actually responsible for carrying out the computation.  In this way, a Declaration is
+analgous to an expression.  It is expected that Declarations will need to be converted to and from

Review Comment:
   analgous -> analogous



##########
docs/source/cpp/acero/overview.rst:
##########
@@ -0,0 +1,262 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==============
+Acero Overview
+==============
+
+This page gives an overview of the basic Acero concepts and helps distinguish Acero
+from other modules in the Arrow code base.  It's intended for users, developers,
+potential contributors, and for those that would like to extend Acero, either for
+research or for business use.  This page assumes the reader is already familiar with
+core Arrow concepts.  This page does not expect any existing knowledge in relational
+algebra.
+
+What is Acero?
+==============
+
+Acero is a C++ library that can be used to analyze large (potentially infinite) streams
+of data.  Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`).
+An execution plan takes in zero or more streams of input data and emits a single
+stream of output data.  The plan describes how the data will be transformed as it
+passes through.  For example, a plan might:
+
+ * Merge two streams of data using a common column
+ * Create additional columns by evaluating expressions against the existing columns
+ * Consume a stream of data by writing it to disk in a partitioned layout
+
+.. image:: simple_graph.svg
+   :alt: A sample execution plan that joins three streams of data and writes to disk
+
+Acero is not...
+---------------
+
+A Library for Data Scientists
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Acero is not intended to be used directly by data scientists.  It is expected that
+end users will typically be using some kind of frontend.  For example, Pandas, Ibis,
+or SQL.  The API for Acero is focused around capabilities and available algorithms.
+However, such users may be intersted in knowing more about how Acero works so that
+they can better understand how the backend processing for their libraries operates.
+
+A Database
+^^^^^^^^^^
+
+A database (or DBMS) is typically a much more expansive application and often packaged
+as a standalone service.  Acero could be a component in a database (most databases have
+some kind of execution engine) or could be a component in some other data processing
+application that hardly resembles a database.  Acero does not concern itself with
+user management, external communication, isolation, durability, or consistency.  In
+addition, Acero is focused primarily on the read path, and the write utilities lack
+any sort of transaction support.
+
+An Optimizer
+^^^^^^^^^^^^
+
+Acero does not have an SQL parser.  It does not have a query planner.  It does not have
+any sort of optimizer.  Acero expects to be given very detailed and low-level instructions
+on how to manipulate data and then it will perform that manipulation exactly as described.
+
+Creating the best execution plan is very hard.  Small details can have a big impact on
+performance.  We do think optimizers are important and we hope that tools will emerge
+someday which can provide these capabilities using standards such as Substrait.
+
+Distributed
+^^^^^^^^^^^
+
+Acero does not provide distributed execution.  However, Acero aims to be usable by a distributed
+query execution engine.  In other words, Acero will not configure and coordinate workers but
+it does except to be used as a worker.  Sometimes, the distinction is a bit fuzzy.  For example,
+an Acero source may be a smart storage device that is capable of performing filtering or other
+advanced analytics.  One might consider this a distributed plan.  The key distinction is Acero
+does not have the capability of transforming a logical plan into a distributed execution plan.
+That step will need to be done elsewhere.
+
+Acero vs...
+-----------
+
+Arrow Compute
+^^^^^^^^^^^^^
+
+This is described in more detail in the overview but the key difference is that Acero handles
+streams of data and Arrow Compute handles situations where all the data is in memory.
+
+Arrow Datasets
+^^^^^^^^^^^^^^
+
+The Arrow datasets library provides some basic routines for discovering, scanning, and
+writing collections of files.  The datasets module depends on Acero.  Both scanning and
+writing datasets uses Acero.  The scan node and the write node are part of the datasets
+module.  This helps to keep the complexity of file formats and filesystems out of the core
+Acero logic.
+
+Substrait
+^^^^^^^^^
+
+Substrait is a project establishing standards for query plans.  Acero executes query plans
+and generates data.  This makes Acero a Substrait consumer.  There are more details on the
+Substrait capabilities below.
+
+Datafusion / DuckDb / Velox / Etc.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are many columnar data engines emerging. We view this as a good thing and encourage
+projects like Substrait to help allow switching between engines as needed.  We generally
+discourage comparative benchmarks as they are almost inevitably going to be workload-driven
+and rarely manage to capture an apples-vs-apples comparison.  Discussions of the pros and
+cons of each is beyond the scope of this guide.
+
+Relation to Arrow C++
+=====================
+
+The Acero module is part of the Arrow C++ implementation.  It is built as a separate
+module but it depends on core Arrow modules and does not stand alone.  Acero uses
+and extends the capabilities from the core Arrow module and the Arrow compute kernels.
+
+.. image:: layers.svg
+   :alt: A diagram of layers with core on the left, compute in the middle, and acero on the right
+
+The core Arrow library provides containers for buffers and arrays that are laid out according
+to the Arrow columnar format.  With few exceptions the core Arrow library does not examine
+or modify the contents of buffers.  For example, converting a string array from lowercase
+strings to uppercase strings would not be a part of the core Arrow library because that would
+require examining the contents of the array.
+
+The compute module expands on the core library and provides functions which analyze and
+transform data.  The compute module's capabilites are all exposed via a function registry.
+An Arrow "function" accepts zero or more arrays, batches, or tables, and produces an array,
+batch, or table.  In addition, function calls can be combined, along with field references
+and literals, to form an expression (a tree of function calls) which the compute module can
+evaluate.  For example, calculating ``x + (y * 3)`` given a table with columns ``x`` and ``y``.
+
+.. image:: expression_ast.svg
+   :alt: A sample expression tree
+
+Acero expands on these capabilities by adding compute operations for streams of data.  For
+example, a project node can apply a compute expression on a stream of batches.  This will
+create a new stream of batches with the result of the expression added as a new column.  These
+nodes can be combined into a graph to form a more complex execution plan.  This is very similar
+to the way functions are combined into a tree to form a complex expression.
+
+.. image:: simple_plan.svg
+   :alt: A simple plan that uses compute expressions
+
+.. note::
+   Acero does not use the :class:`arrow::Table` or :class:`arrow::ChunkedArray` containers
+   from the core Arrow library.  This is because Acero operates on streams of batches and
+   so there is no need for a multi-batch container of data.  This helps to reduce the
+   complexity of Acero and avoids tricky situations that can arise from tables whose
+   columns have different chunk sizes.  Acero will often use :class:`arrow::Datum`
+   which is a variant from the core module that can hold many different types.  Within
+   Acero, a datum will always hold either an :class:`arrow::Array` or a :class:`arrow::Scalar`.
+
+Core Concepts
+=============
+
+ExecNode
+--------
+
+The most basic concept in Acero is the ExecNode.  An ExecNode has zero or more inputs and
+zero or one outputs.  If an ExecNode has zero inputs we call it a source and if an ExecNode
+does not have an output then we call it a sink.  There are many different kinds of nodes and
+each one transforms is inputs in different ways.  For example:

Review Comment:
   is -> its



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jorisvandenbossche commented on a diff in pull request #35320: GH-32335: [C++][Docs] Add design document for Acero

Posted by "jorisvandenbossche (via GitHub)" <gi...@apache.org>.
jorisvandenbossche commented on code in PR #35320:
URL: https://github.com/apache/arrow/pull/35320#discussion_r1177813084


##########
docs/source/cpp/acero/overview.rst:
##########
@@ -0,0 +1,262 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==============
+Acero Overview
+==============
+
+This page gives an overview of the basic Acero concepts and helps distinguish Acero
+from other modules in the Arrow code base.  It's intended for users, developers,
+potential contributors, and for those that would like to extend Acero, either for
+research or for business use.  This page assumes the reader is already familiar with
+core Arrow concepts.  This page does not expect any existing knowledge in relational
+algebra.
+
+What is Acero?
+==============
+
+Acero is a C++ library that can be used to analyze large (potentially infinite) streams
+of data.  Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`).
+An execution plan takes in zero or more streams of input data and emits a single
+stream of output data.  The plan describes how the data will be transformed as it
+passes through.  For example, a plan might:
+
+ * Merge two streams of data using a common column
+ * Create additional columns by evaluating expressions against the existing columns
+ * Consume a stream of data by writing it to disk in a partitioned layout

Review Comment:
   ```suggestion
   * Merge two streams of data using a common column
   * Create additional columns by evaluating expressions against the existing columns
   * Consume a stream of data by writing it to disk in a partitioned layout
   ```
   
   (lists don't need to be intended in RST, otherwise it's interpreted as both quoted and a list)



##########
docs/source/cpp/acero/layers.svg:
##########


Review Comment:
   This figures says that ExecSpan is not used by Acero, but I would have expected that it is used in practice (although maybe indirectly), since the implementations of the individual compute functions are based on ExecSpans? (so even if you pass it an ExecBatch or Arrays, under the hood it might convert those to Spans? (not familiar with the details here though)



##########
docs/source/cpp/acero/expression_ast.svg:
##########


Review Comment:
   Very minor nit, but in case you would still be updating those figures: the sentence at the bottom ("Expressions are a part of the compute module") could fit on a single line (and it's also not centered right now)



##########
docs/source/cpp/acero/overview.rst:
##########
@@ -0,0 +1,262 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==============
+Acero Overview
+==============
+
+This page gives an overview of the basic Acero concepts and helps distinguish Acero
+from other modules in the Arrow code base.  It's intended for users, developers,
+potential contributors, and for those that would like to extend Acero, either for
+research or for business use.  This page assumes the reader is already familiar with
+core Arrow concepts.  This page does not expect any existing knowledge in relational
+algebra.
+
+What is Acero?
+==============
+
+Acero is a C++ library that can be used to analyze large (potentially infinite) streams
+of data.  Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`).
+An execution plan takes in zero or more streams of input data and emits a single
+stream of output data.  The plan describes how the data will be transformed as it
+passes through.  For example, a plan might:
+
+ * Merge two streams of data using a common column
+ * Create additional columns by evaluating expressions against the existing columns
+ * Consume a stream of data by writing it to disk in a partitioned layout
+
+.. image:: simple_graph.svg
+   :alt: A sample execution plan that joins three streams of data and writes to disk
+
+Acero is not...
+---------------
+
+A Library for Data Scientists
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Acero is not intended to be used directly by data scientists.  It is expected that
+end users will typically be using some kind of frontend.  For example, Pandas, Ibis,
+or SQL.  The API for Acero is focused around capabilities and available algorithms.
+However, such users may be intersted in knowing more about how Acero works so that
+they can better understand how the backend processing for their libraries operates.
+
+A Database
+^^^^^^^^^^
+
+A database (or DBMS) is typically a much more expansive application and often packaged
+as a standalone service.  Acero could be a component in a database (most databases have
+some kind of execution engine) or could be a component in some other data processing
+application that hardly resembles a database.  Acero does not concern itself with
+user management, external communication, isolation, durability, or consistency.  In
+addition, Acero is focused primarily on the read path, and the write utilities lack
+any sort of transaction support.
+
+An Optimizer
+^^^^^^^^^^^^
+
+Acero does not have an SQL parser.  It does not have a query planner.  It does not have
+any sort of optimizer.  Acero expects to be given very detailed and low-level instructions
+on how to manipulate data and then it will perform that manipulation exactly as described.
+
+Creating the best execution plan is very hard.  Small details can have a big impact on
+performance.  We do think optimizers are important and we hope that tools will emerge
+someday which can provide these capabilities using standards such as Substrait.
+
+Distributed
+^^^^^^^^^^^
+
+Acero does not provide distributed execution.  However, Acero aims to be usable by a distributed
+query execution engine.  In other words, Acero will not configure and coordinate workers but
+it does except to be used as a worker.  Sometimes, the distinction is a bit fuzzy.  For example,
+an Acero source may be a smart storage device that is capable of performing filtering or other
+advanced analytics.  One might consider this a distributed plan.  The key distinction is Acero
+does not have the capability of transforming a logical plan into a distributed execution plan.
+That step will need to be done elsewhere.
+
+Acero vs...
+-----------
+
+Arrow Compute
+^^^^^^^^^^^^^
+
+This is described in more detail in the overview but the key difference is that Acero handles

Review Comment:
   It's not super clear to me what "the overview" refers to (this whole page is called overview?)



##########
docs/source/cpp/acero/overview.rst:
##########
@@ -0,0 +1,262 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==============
+Acero Overview
+==============
+
+This page gives an overview of the basic Acero concepts and helps distinguish Acero
+from other modules in the Arrow code base.  It's intended for users, developers,
+potential contributors, and for those that would like to extend Acero, either for
+research or for business use.  This page assumes the reader is already familiar with
+core Arrow concepts.  This page does not expect any existing knowledge in relational
+algebra.
+
+What is Acero?
+==============
+
+Acero is a C++ library that can be used to analyze large (potentially infinite) streams
+of data.  Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`).
+An execution plan takes in zero or more streams of input data and emits a single
+stream of output data.  The plan describes how the data will be transformed as it
+passes through.  For example, a plan might:
+
+ * Merge two streams of data using a common column
+ * Create additional columns by evaluating expressions against the existing columns
+ * Consume a stream of data by writing it to disk in a partitioned layout
+
+.. image:: simple_graph.svg
+   :alt: A sample execution plan that joins three streams of data and writes to disk
+
+Acero is not...
+---------------
+
+A Library for Data Scientists
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Acero is not intended to be used directly by data scientists.  It is expected that
+end users will typically be using some kind of frontend.  For example, Pandas, Ibis,
+or SQL.  The API for Acero is focused around capabilities and available algorithms.
+However, such users may be intersted in knowing more about how Acero works so that
+they can better understand how the backend processing for their libraries operates.
+
+A Database
+^^^^^^^^^^
+
+A database (or DBMS) is typically a much more expansive application and often packaged
+as a standalone service.  Acero could be a component in a database (most databases have
+some kind of execution engine) or could be a component in some other data processing
+application that hardly resembles a database.  Acero does not concern itself with
+user management, external communication, isolation, durability, or consistency.  In
+addition, Acero is focused primarily on the read path, and the write utilities lack
+any sort of transaction support.
+
+An Optimizer
+^^^^^^^^^^^^
+
+Acero does not have an SQL parser.  It does not have a query planner.  It does not have
+any sort of optimizer.  Acero expects to be given very detailed and low-level instructions
+on how to manipulate data and then it will perform that manipulation exactly as described.
+
+Creating the best execution plan is very hard.  Small details can have a big impact on
+performance.  We do think optimizers are important and we hope that tools will emerge
+someday which can provide these capabilities using standards such as Substrait.
+
+Distributed
+^^^^^^^^^^^
+
+Acero does not provide distributed execution.  However, Acero aims to be usable by a distributed
+query execution engine.  In other words, Acero will not configure and coordinate workers but
+it does except to be used as a worker.  Sometimes, the distinction is a bit fuzzy.  For example,
+an Acero source may be a smart storage device that is capable of performing filtering or other
+advanced analytics.  One might consider this a distributed plan.  The key distinction is Acero
+does not have the capability of transforming a logical plan into a distributed execution plan.
+That step will need to be done elsewhere.
+
+Acero vs...
+-----------
+
+Arrow Compute
+^^^^^^^^^^^^^
+
+This is described in more detail in the overview but the key difference is that Acero handles
+streams of data and Arrow Compute handles situations where all the data is in memory.
+
+Arrow Datasets
+^^^^^^^^^^^^^^
+
+The Arrow datasets library provides some basic routines for discovering, scanning, and
+writing collections of files.  The datasets module depends on Acero.  Both scanning and
+writing datasets uses Acero.  The scan node and the write node are part of the datasets
+module.  This helps to keep the complexity of file formats and filesystems out of the core
+Acero logic.
+
+Substrait
+^^^^^^^^^
+
+Substrait is a project establishing standards for query plans.  Acero executes query plans
+and generates data.  This makes Acero a Substrait consumer.  There are more details on the
+Substrait capabilities below.
+
+Datafusion / DuckDb / Velox / Etc.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are many columnar data engines emerging. We view this as a good thing and encourage
+projects like Substrait to help allow switching between engines as needed.  We generally
+discourage comparative benchmarks as they are almost inevitably going to be workload-driven
+and rarely manage to capture an apples-vs-apples comparison.  Discussions of the pros and
+cons of each is beyond the scope of this guide.
+
+Relation to Arrow C++
+=====================
+
+The Acero module is part of the Arrow C++ implementation.  It is built as a separate
+module but it depends on core Arrow modules and does not stand alone.  Acero uses
+and extends the capabilities from the core Arrow module and the Arrow compute kernels.
+
+.. image:: layers.svg
+   :alt: A diagram of layers with core on the left, compute in the middle, and acero on the right
+
+The core Arrow library provides containers for buffers and arrays that are laid out according
+to the Arrow columnar format.  With few exceptions the core Arrow library does not examine
+or modify the contents of buffers.  For example, converting a string array from lowercase
+strings to uppercase strings would not be a part of the core Arrow library because that would
+require examining the contents of the array.
+
+The compute module expands on the core library and provides functions which analyze and
+transform data.  The compute module's capabilites are all exposed via a function registry.
+An Arrow "function" accepts zero or more arrays, batches, or tables, and produces an array,
+batch, or table.  In addition, function calls can be combined, along with field references
+and literals, to form an expression (a tree of function calls) which the compute module can
+evaluate.  For example, calculating ``x + (y * 3)`` given a table with columns ``x`` and ``y``.
+
+.. image:: expression_ast.svg
+   :alt: A sample expression tree
+
+Acero expands on these capabilities by adding compute operations for streams of data.  For
+example, a project node can apply a compute expression on a stream of batches.  This will
+create a new stream of batches with the result of the expression added as a new column.  These
+nodes can be combined into a graph to form a more complex execution plan.  This is very similar
+to the way functions are combined into a tree to form a complex expression.
+
+.. image:: simple_plan.svg
+   :alt: A simple plan that uses compute expressions
+
+.. note::
+   Acero does not use the :class:`arrow::Table` or :class:`arrow::ChunkedArray` containers
+   from the core Arrow library.  This is because Acero operates on streams of batches and
+   so there is no need for a multi-batch container of data.  This helps to reduce the
+   complexity of Acero and avoids tricky situations that can arise from tables whose
+   columns have different chunk sizes.  Acero will often use :class:`arrow::Datum`
+   which is a variant from the core module that can hold many different types.  Within
+   Acero, a datum will always hold either an :class:`arrow::Array` or a :class:`arrow::Scalar`.
+
+Core Concepts
+=============
+
+ExecNode
+--------
+
+The most basic concept in Acero is the ExecNode.  An ExecNode has zero or more inputs and
+zero or one outputs.  If an ExecNode has zero inputs we call it a source and if an ExecNode
+does not have an output then we call it a sink.  There are many different kinds of nodes and
+each one transforms is inputs in different ways.  For example:
+
+ * A scan node is a source node that reads data from files
+ * An aggregate node accumulates batches of data to compute summary statistics
+ * A filter node removes rows from the data according to a filter expression
+ * A table sink node accumulates data into a table

Review Comment:
   ```suggestion
   * A scan node is a source node that reads data from files
   * An aggregate node accumulates batches of data to compute summary statistics
   * A filter node removes rows from the data according to a filter expression
   * A table sink node accumulates data into a table
   ```



##########
docs/source/cpp/acero/user_guide.rst:
##########
@@ -0,0 +1,735 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==================
+Acero User's Guide
+==================
+
+This page describes how to use Acero.  It's recommended that you read the
+overview first and familiarize yourself with the basic concepts.
+
+Using Acero
+===========
+
+The basic workflow for Acero is this:
+
+#. First, create a graph of :class:`Declaration` objects describing the plan
+ 
+#. Call one of the DeclarationToXyz methods to execute the Declaration.
+
+   a. A new ExecPlan is created from the graph of Declarations.  Each Declaration will correspond to one
+      ExecNode in the plan.  In addition, a sink node will be added, depending on which DeclarationToXyz method
+      was used.
+
+   b. The ExecPlan is executed.  Typically this happens as part of the DeclarationToXyz call but in 
+      DeclarationToReader the reader is returned before the plan is finished executing.
+
+   c. Once the plan is finished it is destroyed
+
+Creating a Plan
+===============
+
+Using Substrait
+---------------
+
+Substrait is the preferred mechanism for creating a plan (graph of :class:`Declaration`).  There are a few
+reasons for this:
+
+* Substrait producers spend a lot of time and energy in creating user-friendly APIs for producing complex
+  execution plans in a simple way.  For example, the ``pivot_wider`` operation can be achieved using a complex
+  series of ``aggregate`` nodes.  Rather than create all of those ``aggregate`` nodes by hand a producer will
+  give you a much simpler API.
+
+* If you are using Substrait then you can easily switch out to any other Substrait-consuming engine should you
+  at some point find that it serves your needs better than Acero.
+
+* We hope that tools will eventually emerge for Substrait-based optimizers and planners.  By using Substrait
+  you will be making it much easier to use these tools in the future.
+
+You could create the Substrait plan yourself but you'll probably have a much easier time finding an existing
+Susbstrait producer.  For example, you could use `ibis-substrait <https://github.com/ibis-project/ibis-substrait>`_
+to easily create Substrait plans from python expressions.  There are a few different tools that are able to create
+Substrait plans from SQL.  Eventually, we hope that C++ based Substrait producers will emerge.  However, we
+are not aware of any at this time.
+
+Detailed instructions on creating an execution plan from Substrait can be found in
+:ref:`the Substrait page<acero-substrait>`
+
+Programmatic Plan Creation
+--------------------------
+
+Creating an execution plan programmatically is simpler than creating a plan from Substrait, though loses some of
+the flexibility and future-proofing guarantees.  The simplest way to create a Declaration is to simply instantiate
+one.  You will need the name of the declaration, a vector of inputs, and an options object.  For example:
+
+.. literalinclude:: ../../../../cpp/examples/arrow/execution_plan_documentation_examples.cc
+  :language: cpp
+  :start-after: (Doc section: Project Example)
+  :end-before: (Doc section: Project Example)
+  :linenos:
+  :lineno-match:
+
+The above code creates a scan declaration (which has no inputs) and a project declaration (using the scan as
+input).  This is simple enough but we can make it slightly easier.  If you are creating a linear sequence of
+declarations (like in the above example) then you can also use the :func:`Declaration::Sequence` function.
+
+.. literalinclude:: ../../../../cpp/examples/arrow/execution_plan_documentation_examples.cc
+  :language: cpp
+  :start-after: (Doc section: Project Sequence Example)
+  :end-before: (Doc section: Project Sequence Example)
+  :linenos:
+  :lineno-match:
+
+There are many more examples of programmatic plan creation later in this document.
+
+Executing a Plan
+================
+
+There are a number of different methods that can be used to execute a declaration.  Each one provides the
+data in a slightly different form.  Since all of these methods start with ``DeclarationTo...`` this guide
+will often refer to these methods as the ``DeclarationToXyz`` methods.
+
+DeclarationToTable
+------------------
+
+The :func:`DeclarationToTable` method will accumulate all of the results into a single :class:`arrow::Table`.
+This is perhaps the simplest way to collect results from Acero.  The main disadvantage to this approach is
+that it requires accumulating all results into memory.
+
+.. note::
+
+   Acero processes large datasets in small chunks.  This is described in more detail in the developer's guide.
+   As a result, you may be surprised to find that a table collected with DeclarationToTable is chunked
+   differently than your input.  For example, your input might be a large table with a single chunk with 2
+   million rows.  Your output table might then have 64 chunks with 32Ki rows each.  There is a current request
+   to specify the chunk size for the output in `GH-15155 <https://github.com/apache/arrow/issues/15155>`_.
+
+DeclarationToReader
+-------------------
+
+The :func:`DeclarationToReader` method allows you to iteratively consume the results.  It will create an
+:class:`arrow::RecordBatchReader` which you can read from at your liesure.  If you do not read from the
+reader quickly enough then backpressure will be applied and the execution plan will pause.  Closing the
+reader will cancel the running execution plan and the reader's destructor will wait for the execution plan
+to finish whatever it is doing and so it may block.
+
+DeclarationToStatus
+-------------------
+
+The :func:`DeclarationToStatus` method is useful if you want to run the plan but do not actually want to
+consume the results.  For example, this is useful when benchmarking or when the plan has side effects such
+as a dataset write node.  If the plan generates any results then they will be immediately discarded.
+
+Running a Plan Directly
+-----------------------
+
+If one of the ``DeclarationToXyz`` methods is not sufficient for some reason then it is possible to run a plan
+directly.  This should only be needed if you are doing something unique.  For example, if you have created a
+custom sink node or if you need a plan that has multiple outputs.
+
+.. note::
+   In academic literature and many existing systems there is a general assumption that an execution plan has
+   at most one output.  There are some things in Acero, such as the DeclarationToXyz methods, which will expect
+   this.  However, there is nothing in the design that strictly prevents having multiple sink nodes.
+
+Detailed instructions on how to do this are out of scope for this guide but the rough steps are:
+
+1. Create a new :class:`ExecPlan` object.
+2. Add sink nodes to your graph of :class:`Declaration` objects (this is the only type you will need
+   to create declarations for sink nodes)
+3. Use :func:`Declaration::AddToPlan` to add your declaration to your plan (if you have more than one output
+   then you will not be able to use this method and will need to add your nodes one at a time)
+4. Validate the plan with :func:`ExecPlan::Validate`
+5. Start the plan with :func:`ExecPlan::StartProducing`
+6. Wait for the future returned by :func:`ExecPlan::finished` to complete.
+
+Providing Input
+===============
+
+Input data for an exec plan can come from a variety of sources.  It is often read from files stored on some
+kind of filesystem.  It is also common for input to come from in-memory data.  In-memory data is typical, for
+example, in a pandas-like frontend.  Input could also come from network streams like a Flight request.  Acero
+can support all of these cases and can even support unique and custom situations not mentioned here.
+
+There are pre-defined source nodes that cover the most common input scenarios.  These are listed below.  However,
+if your source data is unique then you will need to use the generic ``source`` node.  This node expects you to
+provide an asycnhronous stream of batches and is covered in more detail :ref:`here <stream_execution_source_docs>`.
+
+.. _ExecNode List:
+
+Available ``ExecNode`` Implementations
+======================================
+
+The following tables quickly summarize the available operators.
+
+Sources
+-------
+
+These nodes can be used as sources of data
+
+.. list-table:: Source Nodes
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Factory Name
+     - Options
+     - Brief Description
+   * - ``source``
+     - :class:`SourceNodeOptions`
+     - A generic source node that wraps an asynchronous stream of data (:ref:`example <stream_execution_source_docs>`)
+   * - ``table_source``
+     - :class:`TableSourceNodeOptions`
+     - Generates data from an :class:`arrow::Table` (:ref:`example <stream_execution_table_source_docs>`)
+   * - ``record_batch_source``
+     - :class:`RecordBatchSourceNodeOptions`
+     - Generates data from an iterator of :class:`arrow::RecordBatch`
+   * - ``record_batch_reader_source``
+     - :class:`RecordBatchReaderSourceNodeOptions`
+     - Generates data from an :class:`arrow::RecordBatchReader`
+   * - ``exec_batch_source``
+     - :class:`ExecBatchSourceNodeOptions`
+     - Generates data from an iterator of :class:`arrow::compute::ExecBatch`
+   * - ``array_vector_source``
+     - :class:`ArrayVectorSourceNodeOptions`
+     - Generates data from an iterator of vectors of :class:`arrow::Array`
+   * - ``scan``
+     - :class:`arrow::dataset::ScanNodeOptions`
+     - Generates data from an `arrow::dataset::Dataset` (requires the datasets module)
+       (:ref:`example <stream_execution_scan_docs>`)
+
+Compute Nodes
+-------------
+
+These nodes perform computations on data and may transform or reshape the data
+
+.. list-table:: Compute Nodes
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Factory Name
+     - Options
+     - Brief Description
+   * - ``filter``
+     - :class:`FilterNodeOptions`
+     - Removes rows that do not match a given filter expression

Review Comment:
   ```suggestion
        - Removes rows that do not match a given filter expression
          (:ref:`example <stream_execution_filter_docs>`)
   ```



##########
docs/source/cpp/acero/user_guide.rst:
##########
@@ -0,0 +1,735 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==================
+Acero User's Guide
+==================
+
+This page describes how to use Acero.  It's recommended that you read the
+overview first and familiarize yourself with the basic concepts.
+
+Using Acero
+===========
+
+The basic workflow for Acero is this:
+
+#. First, create a graph of :class:`Declaration` objects describing the plan
+ 
+#. Call one of the DeclarationToXyz methods to execute the Declaration.
+
+   a. A new ExecPlan is created from the graph of Declarations.  Each Declaration will correspond to one
+      ExecNode in the plan.  In addition, a sink node will be added, depending on which DeclarationToXyz method
+      was used.
+
+   b. The ExecPlan is executed.  Typically this happens as part of the DeclarationToXyz call but in 
+      DeclarationToReader the reader is returned before the plan is finished executing.
+
+   c. Once the plan is finished it is destroyed
+
+Creating a Plan
+===============
+
+Using Substrait
+---------------
+
+Substrait is the preferred mechanism for creating a plan (graph of :class:`Declaration`).  There are a few
+reasons for this:
+
+* Substrait producers spend a lot of time and energy in creating user-friendly APIs for producing complex
+  execution plans in a simple way.  For example, the ``pivot_wider`` operation can be achieved using a complex
+  series of ``aggregate`` nodes.  Rather than create all of those ``aggregate`` nodes by hand a producer will
+  give you a much simpler API.
+
+* If you are using Substrait then you can easily switch out to any other Substrait-consuming engine should you
+  at some point find that it serves your needs better than Acero.
+
+* We hope that tools will eventually emerge for Substrait-based optimizers and planners.  By using Substrait
+  you will be making it much easier to use these tools in the future.
+
+You could create the Substrait plan yourself but you'll probably have a much easier time finding an existing
+Susbstrait producer.  For example, you could use `ibis-substrait <https://github.com/ibis-project/ibis-substrait>`_
+to easily create Substrait plans from python expressions.  There are a few different tools that are able to create
+Substrait plans from SQL.  Eventually, we hope that C++ based Substrait producers will emerge.  However, we
+are not aware of any at this time.
+
+Detailed instructions on creating an execution plan from Substrait can be found in
+:ref:`the Substrait page<acero-substrait>`
+
+Programmatic Plan Creation
+--------------------------
+
+Creating an execution plan programmatically is simpler than creating a plan from Substrait, though loses some of
+the flexibility and future-proofing guarantees.  The simplest way to create a Declaration is to simply instantiate
+one.  You will need the name of the declaration, a vector of inputs, and an options object.  For example:
+
+.. literalinclude:: ../../../../cpp/examples/arrow/execution_plan_documentation_examples.cc
+  :language: cpp
+  :start-after: (Doc section: Project Example)
+  :end-before: (Doc section: Project Example)
+  :linenos:
+  :lineno-match:
+
+The above code creates a scan declaration (which has no inputs) and a project declaration (using the scan as
+input).  This is simple enough but we can make it slightly easier.  If you are creating a linear sequence of
+declarations (like in the above example) then you can also use the :func:`Declaration::Sequence` function.
+
+.. literalinclude:: ../../../../cpp/examples/arrow/execution_plan_documentation_examples.cc
+  :language: cpp
+  :start-after: (Doc section: Project Sequence Example)
+  :end-before: (Doc section: Project Sequence Example)
+  :linenos:
+  :lineno-match:
+
+There are many more examples of programmatic plan creation later in this document.
+
+Executing a Plan
+================
+
+There are a number of different methods that can be used to execute a declaration.  Each one provides the
+data in a slightly different form.  Since all of these methods start with ``DeclarationTo...`` this guide
+will often refer to these methods as the ``DeclarationToXyz`` methods.
+
+DeclarationToTable
+------------------
+
+The :func:`DeclarationToTable` method will accumulate all of the results into a single :class:`arrow::Table`.
+This is perhaps the simplest way to collect results from Acero.  The main disadvantage to this approach is
+that it requires accumulating all results into memory.
+
+.. note::
+
+   Acero processes large datasets in small chunks.  This is described in more detail in the developer's guide.
+   As a result, you may be surprised to find that a table collected with DeclarationToTable is chunked
+   differently than your input.  For example, your input might be a large table with a single chunk with 2
+   million rows.  Your output table might then have 64 chunks with 32Ki rows each.  There is a current request
+   to specify the chunk size for the output in `GH-15155 <https://github.com/apache/arrow/issues/15155>`_.
+
+DeclarationToReader
+-------------------
+
+The :func:`DeclarationToReader` method allows you to iteratively consume the results.  It will create an
+:class:`arrow::RecordBatchReader` which you can read from at your liesure.  If you do not read from the

Review Comment:
   ```suggestion
   :class:`arrow::RecordBatchReader` which you can read from at your leisure.  If you do not read from the
   ```



##########
docs/source/cpp/acero/user_guide.rst:
##########
@@ -0,0 +1,735 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==================
+Acero User's Guide
+==================
+
+This page describes how to use Acero.  It's recommended that you read the
+overview first and familiarize yourself with the basic concepts.
+
+Using Acero
+===========
+
+The basic workflow for Acero is this:
+
+#. First, create a graph of :class:`Declaration` objects describing the plan
+ 
+#. Call one of the DeclarationToXyz methods to execute the Declaration.
+
+   a. A new ExecPlan is created from the graph of Declarations.  Each Declaration will correspond to one
+      ExecNode in the plan.  In addition, a sink node will be added, depending on which DeclarationToXyz method
+      was used.
+
+   b. The ExecPlan is executed.  Typically this happens as part of the DeclarationToXyz call but in 
+      DeclarationToReader the reader is returned before the plan is finished executing.
+
+   c. Once the plan is finished it is destroyed
+
+Creating a Plan
+===============
+
+Using Substrait
+---------------
+
+Substrait is the preferred mechanism for creating a plan (graph of :class:`Declaration`).  There are a few
+reasons for this:
+
+* Substrait producers spend a lot of time and energy in creating user-friendly APIs for producing complex
+  execution plans in a simple way.  For example, the ``pivot_wider`` operation can be achieved using a complex
+  series of ``aggregate`` nodes.  Rather than create all of those ``aggregate`` nodes by hand a producer will
+  give you a much simpler API.
+
+* If you are using Substrait then you can easily switch out to any other Substrait-consuming engine should you
+  at some point find that it serves your needs better than Acero.
+
+* We hope that tools will eventually emerge for Substrait-based optimizers and planners.  By using Substrait
+  you will be making it much easier to use these tools in the future.
+
+You could create the Substrait plan yourself but you'll probably have a much easier time finding an existing
+Susbstrait producer.  For example, you could use `ibis-substrait <https://github.com/ibis-project/ibis-substrait>`_
+to easily create Substrait plans from python expressions.  There are a few different tools that are able to create
+Substrait plans from SQL.  Eventually, we hope that C++ based Substrait producers will emerge.  However, we
+are not aware of any at this time.
+
+Detailed instructions on creating an execution plan from Substrait can be found in
+:ref:`the Substrait page<acero-substrait>`
+
+Programmatic Plan Creation
+--------------------------
+
+Creating an execution plan programmatically is simpler than creating a plan from Substrait, though loses some of
+the flexibility and future-proofing guarantees.  The simplest way to create a Declaration is to simply instantiate
+one.  You will need the name of the declaration, a vector of inputs, and an options object.  For example:
+
+.. literalinclude:: ../../../../cpp/examples/arrow/execution_plan_documentation_examples.cc
+  :language: cpp
+  :start-after: (Doc section: Project Example)
+  :end-before: (Doc section: Project Example)
+  :linenos:
+  :lineno-match:
+
+The above code creates a scan declaration (which has no inputs) and a project declaration (using the scan as
+input).  This is simple enough but we can make it slightly easier.  If you are creating a linear sequence of
+declarations (like in the above example) then you can also use the :func:`Declaration::Sequence` function.
+
+.. literalinclude:: ../../../../cpp/examples/arrow/execution_plan_documentation_examples.cc
+  :language: cpp
+  :start-after: (Doc section: Project Sequence Example)
+  :end-before: (Doc section: Project Sequence Example)
+  :linenos:
+  :lineno-match:
+
+There are many more examples of programmatic plan creation later in this document.
+
+Executing a Plan
+================
+
+There are a number of different methods that can be used to execute a declaration.  Each one provides the
+data in a slightly different form.  Since all of these methods start with ``DeclarationTo...`` this guide
+will often refer to these methods as the ``DeclarationToXyz`` methods.
+
+DeclarationToTable
+------------------
+
+The :func:`DeclarationToTable` method will accumulate all of the results into a single :class:`arrow::Table`.
+This is perhaps the simplest way to collect results from Acero.  The main disadvantage to this approach is
+that it requires accumulating all results into memory.
+
+.. note::
+
+   Acero processes large datasets in small chunks.  This is described in more detail in the developer's guide.
+   As a result, you may be surprised to find that a table collected with DeclarationToTable is chunked
+   differently than your input.  For example, your input might be a large table with a single chunk with 2
+   million rows.  Your output table might then have 64 chunks with 32Ki rows each.  There is a current request
+   to specify the chunk size for the output in `GH-15155 <https://github.com/apache/arrow/issues/15155>`_.
+
+DeclarationToReader
+-------------------
+
+The :func:`DeclarationToReader` method allows you to iteratively consume the results.  It will create an
+:class:`arrow::RecordBatchReader` which you can read from at your liesure.  If you do not read from the
+reader quickly enough then backpressure will be applied and the execution plan will pause.  Closing the
+reader will cancel the running execution plan and the reader's destructor will wait for the execution plan
+to finish whatever it is doing and so it may block.
+
+DeclarationToStatus
+-------------------
+
+The :func:`DeclarationToStatus` method is useful if you want to run the plan but do not actually want to
+consume the results.  For example, this is useful when benchmarking or when the plan has side effects such
+as a dataset write node.  If the plan generates any results then they will be immediately discarded.
+
+Running a Plan Directly
+-----------------------
+
+If one of the ``DeclarationToXyz`` methods is not sufficient for some reason then it is possible to run a plan
+directly.  This should only be needed if you are doing something unique.  For example, if you have created a
+custom sink node or if you need a plan that has multiple outputs.
+
+.. note::
+   In academic literature and many existing systems there is a general assumption that an execution plan has
+   at most one output.  There are some things in Acero, such as the DeclarationToXyz methods, which will expect
+   this.  However, there is nothing in the design that strictly prevents having multiple sink nodes.
+
+Detailed instructions on how to do this are out of scope for this guide but the rough steps are:
+
+1. Create a new :class:`ExecPlan` object.
+2. Add sink nodes to your graph of :class:`Declaration` objects (this is the only type you will need
+   to create declarations for sink nodes)
+3. Use :func:`Declaration::AddToPlan` to add your declaration to your plan (if you have more than one output
+   then you will not be able to use this method and will need to add your nodes one at a time)
+4. Validate the plan with :func:`ExecPlan::Validate`
+5. Start the plan with :func:`ExecPlan::StartProducing`
+6. Wait for the future returned by :func:`ExecPlan::finished` to complete.
+
+Providing Input
+===============
+
+Input data for an exec plan can come from a variety of sources.  It is often read from files stored on some
+kind of filesystem.  It is also common for input to come from in-memory data.  In-memory data is typical, for
+example, in a pandas-like frontend.  Input could also come from network streams like a Flight request.  Acero
+can support all of these cases and can even support unique and custom situations not mentioned here.
+
+There are pre-defined source nodes that cover the most common input scenarios.  These are listed below.  However,
+if your source data is unique then you will need to use the generic ``source`` node.  This node expects you to
+provide an asycnhronous stream of batches and is covered in more detail :ref:`here <stream_execution_source_docs>`.
+
+.. _ExecNode List:
+
+Available ``ExecNode`` Implementations
+======================================
+
+The following tables quickly summarize the available operators.
+
+Sources
+-------
+
+These nodes can be used as sources of data
+
+.. list-table:: Source Nodes
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Factory Name
+     - Options
+     - Brief Description
+   * - ``source``
+     - :class:`SourceNodeOptions`
+     - A generic source node that wraps an asynchronous stream of data (:ref:`example <stream_execution_source_docs>`)
+   * - ``table_source``
+     - :class:`TableSourceNodeOptions`
+     - Generates data from an :class:`arrow::Table` (:ref:`example <stream_execution_table_source_docs>`)
+   * - ``record_batch_source``
+     - :class:`RecordBatchSourceNodeOptions`
+     - Generates data from an iterator of :class:`arrow::RecordBatch`
+   * - ``record_batch_reader_source``
+     - :class:`RecordBatchReaderSourceNodeOptions`
+     - Generates data from an :class:`arrow::RecordBatchReader`
+   * - ``exec_batch_source``
+     - :class:`ExecBatchSourceNodeOptions`
+     - Generates data from an iterator of :class:`arrow::compute::ExecBatch`
+   * - ``array_vector_source``
+     - :class:`ArrayVectorSourceNodeOptions`
+     - Generates data from an iterator of vectors of :class:`arrow::Array`
+   * - ``scan``
+     - :class:`arrow::dataset::ScanNodeOptions`
+     - Generates data from an `arrow::dataset::Dataset` (requires the datasets module)

Review Comment:
   ```suggestion
        - Generates data from an :class:`arrow::dataset::Dataset` (requires the datasets module)
   ```



##########
docs/source/cpp/acero/substrait.rst:
##########
@@ -0,0 +1,248 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::engine::substrait
+
+.. _acero-substrait:
+
+==========================
+Using Acero with Substrait
+==========================
+
+In order to use Acero you will need to create an execution plan.  This is the
+model that describes the computation you want to apply to your data.  Acero has
+its own internal representation for execution plans but most users should not
+interact with this directly as it will couple their code to Acero.
+
+`Substrait <https://substrait.io>`_ is an open standard for execution plans.
+Acero implements the Substrait "consumer" interface.  This means that Acero can
+accept a Substrait plan and fulfill the plan, loading the requested data and
+applying the desired computation.  By using Substrait plans users can easily
+switch out to a different execution engine at a later time.
+
+Substrait Conformance
+---------------------
+
+Substrait defines a broad set of operators and functions for many different
+situations and it is unlikely that Acero will ever completely satisfy all
+defined Substrait operators and functions.  To help understand what features
+are available the following sections define which features have been currently
+implemented in Acero and any caveats that apply.
+
+Plans
+^^^^^
+
+ * A plan should have a single top-level relation.
+ * The consumer is currently based on version 0.20.0 of Substrait.
+   Any features added that are newer will not be supported.
+ * Due to a breaking change in 0.20.0 any Substrait plan older than 0.20.0
+   will be rejected.
+
+Extensions
+^^^^^^^^^^
+
+ * If a plan contains any extension type variations it will be rejected.
+ * Advanced extensions can be provided by supplying a custom implementation of
+   :class:`arrow::engine::ExtensionProvider`.
+
+Relations (in general)
+^^^^^^^^^^^^^^^^^^^^^^
+
+ * Any relation not explicitly listed below will not be supported
+   and will cause the plan to be rejected.
+
+Read Relations
+^^^^^^^^^^^^^^
+
+ * The ``projection`` property is not supported and plans containing this
+   property will be rejected.
+ * The ``VirtualTable`` and ``ExtensionTable`` read types are not supported.
+   Plans containing these types will be rejected.
+ * Only the parquet and arrow file formats are currently supported.
+ * All URIs must use the ``file`` scheme
+ * ``partition_index``, ``start``, and ``length`` are not supported.  Plans containing
+   non-default values for these properties will be rejected.
+ * The Substrait spec requires that a ``filter`` be completely satisfied by a read
+   relation.  However, Acero only uses a read filter for pushdown projection and
+   it may not be fully satisfied.  Users should generally attach an additional
+   filter relation with the same filter expression after the read relation.

Review Comment:
   Side question, but wondering while reading this: if that's what the spec of substrait requires, shouldn't the translation of a substrait plan to Acero ExecPlan inject such an additional filter node to ensure we actually fulfill that requirement? (instead of requiring the user to do that explicitly in their Substrait plan)



##########
docs/source/cpp/acero/substrait.rst:
##########
@@ -0,0 +1,248 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::engine::substrait
+
+.. _acero-substrait:
+
+==========================
+Using Acero with Substrait
+==========================
+
+In order to use Acero you will need to create an execution plan.  This is the
+model that describes the computation you want to apply to your data.  Acero has
+its own internal representation for execution plans but most users should not
+interact with this directly as it will couple their code to Acero.
+
+`Substrait <https://substrait.io>`_ is an open standard for execution plans.
+Acero implements the Substrait "consumer" interface.  This means that Acero can
+accept a Substrait plan and fulfill the plan, loading the requested data and
+applying the desired computation.  By using Substrait plans users can easily
+switch out to a different execution engine at a later time.
+
+Substrait Conformance
+---------------------
+
+Substrait defines a broad set of operators and functions for many different
+situations and it is unlikely that Acero will ever completely satisfy all
+defined Substrait operators and functions.  To help understand what features
+are available the following sections define which features have been currently
+implemented in Acero and any caveats that apply.
+
+Plans
+^^^^^
+
+ * A plan should have a single top-level relation.
+ * The consumer is currently based on version 0.20.0 of Substrait.
+   Any features added that are newer will not be supported.
+ * Due to a breaking change in 0.20.0 any Substrait plan older than 0.20.0
+   will be rejected.

Review Comment:
   ```suggestion
   * A plan should have a single top-level relation.
   * The consumer is currently based on version 0.20.0 of Substrait.
     Any features added that are newer will not be supported.
   * Due to a breaking change in 0.20.0 any Substrait plan older than 0.20.0
     will be rejected.
   ```
   
   (and the other lists in this file as well)



##########
docs/source/cpp/acero/substrait.rst:
##########
@@ -0,0 +1,248 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::engine::substrait
+
+.. _acero-substrait:
+
+==========================
+Using Acero with Substrait
+==========================
+
+In order to use Acero you will need to create an execution plan.  This is the
+model that describes the computation you want to apply to your data.  Acero has
+its own internal representation for execution plans but most users should not
+interact with this directly as it will couple their code to Acero.
+
+`Substrait <https://substrait.io>`_ is an open standard for execution plans.
+Acero implements the Substrait "consumer" interface.  This means that Acero can
+accept a Substrait plan and fulfill the plan, loading the requested data and
+applying the desired computation.  By using Substrait plans users can easily
+switch out to a different execution engine at a later time.
+
+Substrait Conformance
+---------------------
+
+Substrait defines a broad set of operators and functions for many different
+situations and it is unlikely that Acero will ever completely satisfy all
+defined Substrait operators and functions.  To help understand what features
+are available the following sections define which features have been currently
+implemented in Acero and any caveats that apply.
+
+Plans
+^^^^^
+
+ * A plan should have a single top-level relation.
+ * The consumer is currently based on version 0.20.0 of Substrait.
+   Any features added that are newer will not be supported.
+ * Due to a breaking change in 0.20.0 any Substrait plan older than 0.20.0
+   will be rejected.
+
+Extensions
+^^^^^^^^^^
+
+ * If a plan contains any extension type variations it will be rejected.
+ * Advanced extensions can be provided by supplying a custom implementation of
+   :class:`arrow::engine::ExtensionProvider`.
+
+Relations (in general)
+^^^^^^^^^^^^^^^^^^^^^^
+
+ * Any relation not explicitly listed below will not be supported
+   and will cause the plan to be rejected.
+
+Read Relations
+^^^^^^^^^^^^^^
+
+ * The ``projection`` property is not supported and plans containing this
+   property will be rejected.
+ * The ``VirtualTable`` and ``ExtensionTable`` read types are not supported.
+   Plans containing these types will be rejected.
+ * Only the parquet and arrow file formats are currently supported.
+ * All URIs must use the ``file`` scheme
+ * ``partition_index``, ``start``, and ``length`` are not supported.  Plans containing
+   non-default values for these properties will be rejected.
+ * The Substrait spec requires that a ``filter`` be completely satisfied by a read
+   relation.  However, Acero only uses a read filter for pushdown projection and
+   it may not be fully satisfied.  Users should generally attach an additional
+   filter relation with the same filter expression after the read relation.
+
+Filter Relations
+^^^^^^^^^^^^^^^^
+
+ * No known caveats
+
+Project Relations
+^^^^^^^^^^^^^^^^^
+
+ * No known caveats
+
+Join Relations
+^^^^^^^^^^^^^^
+
+ * The join type ``JOIN_TYPE_SINGLE`` is not supported and plans containing this
+   will be rejected.
+ * The join expression must be a call to either the ``equal`` or ``is_not_distinct_from``
+   functions.  Both arguments to the call must be direct references.  Only a single
+   join key is supported.
+ * The ``post_join_filter`` property is not supported and will be ignored.
+
+Aggregate Relations
+^^^^^^^^^^^^^^^^^^^
+
+ * At most one grouping set is supported.
+ * Each grouping expression must be a direct reference.
+ * Each measure's arguments must be direct references.
+ * A measure may not have a filter
+ * A measure may not have sorts
+ * A measure's invocation must be AGGREGATION_INVOCATION_ALL or 
+   AGGREGATION_INVOCATION_UNSPECIFIED
+ * A measure's phase must be AGGREGATION_PHASE_INITIAL_TO_RESULT
+
+Expressions (general)
+^^^^^^^^^^^^^^^^^^^^^
+
+ * Various places in the Substrait spec allow for expressions to be used outside
+   of a filter or project relation.  For example, a join expression or an aggregate
+   grouping set.  Acero typically expects these expressions to be direct references.
+   Planners should extract the implicit projection into a formal project relation
+   before delivering the plan to Acero.
+
+Literals
+^^^^^^^^
+
+ * A literal with non-default nullability will cause a plan to be rejected.
+
+Types
+^^^^^
+
+ * Acero does not have full support for non-nullable types and may allow input
+   to have nulls without rejecting it.
+ * The table below shows the mapping between Arrow types and Substrait type
+   classes that are currently supported
+
+.. list-table:: Substrait / Arrow Type Mapping
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Substrait Type
+     - Arrow Type
+     - Caveat
+   * - boolean
+     - boolean
+     - 
+   * - i8
+     - int8
+     - 
+   * - i16
+     - int16
+     - 
+   * - i32
+     - int32
+     - 
+   * - i64
+     - int64
+     - 
+   * - fp32
+     - float32
+     - 
+   * - fp64
+     - float64
+     - 
+   * - string
+     - string
+     - 
+   * - binary
+     - binary
+     - 
+   * - timestamp
+     - timestamp<MICRO,"">
+     - 
+   * - timestamp_tz
+     - timestamp<MICRO,"UTC">
+     - 
+   * - date
+     - date32<DAY>
+     - 
+   * - time
+     - time64<MICRO>
+     - 
+   * - interval_year
+     - 
+     - Not currently supported
+   * - interval_day
+     - 
+     - Not currently supported
+   * - uuid
+     - 
+     - Not currently supported
+   * - FIXEDCHAR<L>
+     - 
+     - Not currently supported
+   * - VARCHAR<L>
+     - 
+     - Not currently supported
+   * - FIXEDBINARY<L>
+     - fixed_size_binary<L>
+     - 
+   * - DECIMAL<P,S>
+     - decimal128<P,S>
+     - 
+   * - STRUCT<T1...TN>
+     - struct<T1...TN>
+     - Arrow struct fields will have no name (empty string)
+   * - NSTRUCT<N:T1...N:Tn>
+     - 
+     - Not currently supported
+   * - LIST<T>
+     - list<T>
+     - 
+   * - MAP<K,V>
+     - map<K,V>
+     - K must not be nullable
+
+Functions
+^^^^^^^^^
+
+ * The following functions have caveats or are not supported at all.  Note that
+   this is not a comprehensive list.  Functions are being added to Substrait at
+   a rapid pace and new functions may be missing.
+
+   * Acero does not support the SATURATE option for overflow
+   * Acero does not support kernels that take more than two arguments
+     for the functions ``and``, ``or``, ``xor``
+   * Acero does not support temporal arithmetic

Review Comment:
   I am wondering if it is clear enough that it is the Acero Substrait->Acero::ExecPlan conversion that doesn't support this, because Acero itself does support temporal arithmetic, as far as I know? (since we have scalar compute functions for temporal types, you should be able to use those in Acero)



##########
docs/source/cpp/acero/overview.rst:
##########
@@ -0,0 +1,262 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==============
+Acero Overview
+==============
+
+This page gives an overview of the basic Acero concepts and helps distinguish Acero
+from other modules in the Arrow code base.  It's intended for users, developers,
+potential contributors, and for those that would like to extend Acero, either for
+research or for business use.  This page assumes the reader is already familiar with
+core Arrow concepts.  This page does not expect any existing knowledge in relational
+algebra.
+
+What is Acero?
+==============
+
+Acero is a C++ library that can be used to analyze large (potentially infinite) streams
+of data.  Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`).
+An execution plan takes in zero or more streams of input data and emits a single
+stream of output data.  The plan describes how the data will be transformed as it
+passes through.  For example, a plan might:
+
+ * Merge two streams of data using a common column
+ * Create additional columns by evaluating expressions against the existing columns
+ * Consume a stream of data by writing it to disk in a partitioned layout
+
+.. image:: simple_graph.svg
+   :alt: A sample execution plan that joins three streams of data and writes to disk
+
+Acero is not...
+---------------
+
+A Library for Data Scientists
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Acero is not intended to be used directly by data scientists.  It is expected that
+end users will typically be using some kind of frontend.  For example, Pandas, Ibis,
+or SQL.  The API for Acero is focused around capabilities and available algorithms.
+However, such users may be intersted in knowing more about how Acero works so that
+they can better understand how the backend processing for their libraries operates.
+
+A Database
+^^^^^^^^^^
+
+A database (or DBMS) is typically a much more expansive application and often packaged
+as a standalone service.  Acero could be a component in a database (most databases have
+some kind of execution engine) or could be a component in some other data processing
+application that hardly resembles a database.  Acero does not concern itself with
+user management, external communication, isolation, durability, or consistency.  In
+addition, Acero is focused primarily on the read path, and the write utilities lack
+any sort of transaction support.
+
+An Optimizer
+^^^^^^^^^^^^
+
+Acero does not have an SQL parser.  It does not have a query planner.  It does not have
+any sort of optimizer.  Acero expects to be given very detailed and low-level instructions
+on how to manipulate data and then it will perform that manipulation exactly as described.
+
+Creating the best execution plan is very hard.  Small details can have a big impact on
+performance.  We do think optimizers are important and we hope that tools will emerge
+someday which can provide these capabilities using standards such as Substrait.
+
+Distributed
+^^^^^^^^^^^
+
+Acero does not provide distributed execution.  However, Acero aims to be usable by a distributed
+query execution engine.  In other words, Acero will not configure and coordinate workers but
+it does except to be used as a worker.  Sometimes, the distinction is a bit fuzzy.  For example,
+an Acero source may be a smart storage device that is capable of performing filtering or other
+advanced analytics.  One might consider this a distributed plan.  The key distinction is Acero
+does not have the capability of transforming a logical plan into a distributed execution plan.
+That step will need to be done elsewhere.
+
+Acero vs...
+-----------
+
+Arrow Compute
+^^^^^^^^^^^^^
+
+This is described in more detail in the overview but the key difference is that Acero handles
+streams of data and Arrow Compute handles situations where all the data is in memory.
+
+Arrow Datasets
+^^^^^^^^^^^^^^
+
+The Arrow datasets library provides some basic routines for discovering, scanning, and
+writing collections of files.  The datasets module depends on Acero.  Both scanning and
+writing datasets uses Acero.  The scan node and the write node are part of the datasets
+module.  This helps to keep the complexity of file formats and filesystems out of the core
+Acero logic.
+
+Substrait
+^^^^^^^^^
+
+Substrait is a project establishing standards for query plans.  Acero executes query plans
+and generates data.  This makes Acero a Substrait consumer.  There are more details on the
+Substrait capabilities below.

Review Comment:
   ```suggestion
   Substrait capabilities in :ref:`acero-substrait`.
   ```



##########
docs/source/cpp/acero/overview.rst:
##########
@@ -0,0 +1,262 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==============
+Acero Overview
+==============
+
+This page gives an overview of the basic Acero concepts and helps distinguish Acero
+from other modules in the Arrow code base.  It's intended for users, developers,
+potential contributors, and for those that would like to extend Acero, either for
+research or for business use.  This page assumes the reader is already familiar with
+core Arrow concepts.  This page does not expect any existing knowledge in relational
+algebra.
+
+What is Acero?
+==============
+
+Acero is a C++ library that can be used to analyze large (potentially infinite) streams
+of data.  Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`).
+An execution plan takes in zero or more streams of input data and emits a single
+stream of output data.  The plan describes how the data will be transformed as it
+passes through.  For example, a plan might:
+
+ * Merge two streams of data using a common column
+ * Create additional columns by evaluating expressions against the existing columns
+ * Consume a stream of data by writing it to disk in a partitioned layout
+
+.. image:: simple_graph.svg
+   :alt: A sample execution plan that joins three streams of data and writes to disk
+
+Acero is not...
+---------------
+
+A Library for Data Scientists
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Acero is not intended to be used directly by data scientists.  It is expected that
+end users will typically be using some kind of frontend.  For example, Pandas, Ibis,
+or SQL.  The API for Acero is focused around capabilities and available algorithms.
+However, such users may be intersted in knowing more about how Acero works so that
+they can better understand how the backend processing for their libraries operates.
+
+A Database
+^^^^^^^^^^
+
+A database (or DBMS) is typically a much more expansive application and often packaged
+as a standalone service.  Acero could be a component in a database (most databases have
+some kind of execution engine) or could be a component in some other data processing
+application that hardly resembles a database.  Acero does not concern itself with
+user management, external communication, isolation, durability, or consistency.  In
+addition, Acero is focused primarily on the read path, and the write utilities lack
+any sort of transaction support.
+
+An Optimizer
+^^^^^^^^^^^^
+
+Acero does not have an SQL parser.  It does not have a query planner.  It does not have
+any sort of optimizer.  Acero expects to be given very detailed and low-level instructions
+on how to manipulate data and then it will perform that manipulation exactly as described.
+
+Creating the best execution plan is very hard.  Small details can have a big impact on
+performance.  We do think optimizers are important and we hope that tools will emerge
+someday which can provide these capabilities using standards such as Substrait.
+
+Distributed
+^^^^^^^^^^^
+
+Acero does not provide distributed execution.  However, Acero aims to be usable by a distributed
+query execution engine.  In other words, Acero will not configure and coordinate workers but
+it does except to be used as a worker.  Sometimes, the distinction is a bit fuzzy.  For example,
+an Acero source may be a smart storage device that is capable of performing filtering or other
+advanced analytics.  One might consider this a distributed plan.  The key distinction is Acero
+does not have the capability of transforming a logical plan into a distributed execution plan.
+That step will need to be done elsewhere.
+
+Acero vs...
+-----------
+
+Arrow Compute
+^^^^^^^^^^^^^
+
+This is described in more detail in the overview but the key difference is that Acero handles

Review Comment:
   Maybe it should point to the "Relation to Arrow C++" section below?



##########
docs/source/cpp/acero/overview.rst:
##########
@@ -0,0 +1,262 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==============
+Acero Overview
+==============
+
+This page gives an overview of the basic Acero concepts and helps distinguish Acero
+from other modules in the Arrow code base.  It's intended for users, developers,
+potential contributors, and for those that would like to extend Acero, either for
+research or for business use.  This page assumes the reader is already familiar with
+core Arrow concepts.  This page does not expect any existing knowledge in relational
+algebra.
+
+What is Acero?
+==============
+
+Acero is a C++ library that can be used to analyze large (potentially infinite) streams
+of data.  Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`).
+An execution plan takes in zero or more streams of input data and emits a single
+stream of output data.  The plan describes how the data will be transformed as it
+passes through.  For example, a plan might:
+
+ * Merge two streams of data using a common column
+ * Create additional columns by evaluating expressions against the existing columns
+ * Consume a stream of data by writing it to disk in a partitioned layout
+
+.. image:: simple_graph.svg
+   :alt: A sample execution plan that joins three streams of data and writes to disk
+
+Acero is not...
+---------------
+
+A Library for Data Scientists
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Acero is not intended to be used directly by data scientists.  It is expected that
+end users will typically be using some kind of frontend.  For example, Pandas, Ibis,
+or SQL.  The API for Acero is focused around capabilities and available algorithms.
+However, such users may be intersted in knowing more about how Acero works so that
+they can better understand how the backend processing for their libraries operates.
+
+A Database
+^^^^^^^^^^
+
+A database (or DBMS) is typically a much more expansive application and often packaged
+as a standalone service.  Acero could be a component in a database (most databases have
+some kind of execution engine) or could be a component in some other data processing
+application that hardly resembles a database.  Acero does not concern itself with
+user management, external communication, isolation, durability, or consistency.  In
+addition, Acero is focused primarily on the read path, and the write utilities lack
+any sort of transaction support.
+
+An Optimizer
+^^^^^^^^^^^^
+
+Acero does not have an SQL parser.  It does not have a query planner.  It does not have
+any sort of optimizer.  Acero expects to be given very detailed and low-level instructions
+on how to manipulate data and then it will perform that manipulation exactly as described.
+
+Creating the best execution plan is very hard.  Small details can have a big impact on
+performance.  We do think optimizers are important and we hope that tools will emerge
+someday which can provide these capabilities using standards such as Substrait.
+
+Distributed
+^^^^^^^^^^^
+
+Acero does not provide distributed execution.  However, Acero aims to be usable by a distributed
+query execution engine.  In other words, Acero will not configure and coordinate workers but
+it does except to be used as a worker.  Sometimes, the distinction is a bit fuzzy.  For example,
+an Acero source may be a smart storage device that is capable of performing filtering or other
+advanced analytics.  One might consider this a distributed plan.  The key distinction is Acero
+does not have the capability of transforming a logical plan into a distributed execution plan.
+That step will need to be done elsewhere.
+
+Acero vs...
+-----------
+
+Arrow Compute
+^^^^^^^^^^^^^
+
+This is described in more detail in the overview but the key difference is that Acero handles
+streams of data and Arrow Compute handles situations where all the data is in memory.
+
+Arrow Datasets
+^^^^^^^^^^^^^^
+
+The Arrow datasets library provides some basic routines for discovering, scanning, and
+writing collections of files.  The datasets module depends on Acero.  Both scanning and
+writing datasets uses Acero.  The scan node and the write node are part of the datasets
+module.  This helps to keep the complexity of file formats and filesystems out of the core
+Acero logic.
+
+Substrait
+^^^^^^^^^
+
+Substrait is a project establishing standards for query plans.  Acero executes query plans
+and generates data.  This makes Acero a Substrait consumer.  There are more details on the
+Substrait capabilities below.
+
+Datafusion / DuckDb / Velox / Etc.
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+There are many columnar data engines emerging. We view this as a good thing and encourage
+projects like Substrait to help allow switching between engines as needed.  We generally
+discourage comparative benchmarks as they are almost inevitably going to be workload-driven
+and rarely manage to capture an apples-vs-apples comparison.  Discussions of the pros and
+cons of each is beyond the scope of this guide.
+
+Relation to Arrow C++
+=====================
+
+The Acero module is part of the Arrow C++ implementation.  It is built as a separate
+module but it depends on core Arrow modules and does not stand alone.  Acero uses
+and extends the capabilities from the core Arrow module and the Arrow compute kernels.
+
+.. image:: layers.svg
+   :alt: A diagram of layers with core on the left, compute in the middle, and acero on the right
+
+The core Arrow library provides containers for buffers and arrays that are laid out according
+to the Arrow columnar format.  With few exceptions the core Arrow library does not examine
+or modify the contents of buffers.  For example, converting a string array from lowercase
+strings to uppercase strings would not be a part of the core Arrow library because that would
+require examining the contents of the array.
+
+The compute module expands on the core library and provides functions which analyze and
+transform data.  The compute module's capabilites are all exposed via a function registry.
+An Arrow "function" accepts zero or more arrays, batches, or tables, and produces an array,
+batch, or table.  In addition, function calls can be combined, along with field references
+and literals, to form an expression (a tree of function calls) which the compute module can
+evaluate.  For example, calculating ``x + (y * 3)`` given a table with columns ``x`` and ``y``.
+
+.. image:: expression_ast.svg
+   :alt: A sample expression tree
+
+Acero expands on these capabilities by adding compute operations for streams of data.  For
+example, a project node can apply a compute expression on a stream of batches.  This will
+create a new stream of batches with the result of the expression added as a new column.  These
+nodes can be combined into a graph to form a more complex execution plan.  This is very similar
+to the way functions are combined into a tree to form a complex expression.
+
+.. image:: simple_plan.svg
+   :alt: A simple plan that uses compute expressions
+
+.. note::
+   Acero does not use the :class:`arrow::Table` or :class:`arrow::ChunkedArray` containers
+   from the core Arrow library.  This is because Acero operates on streams of batches and
+   so there is no need for a multi-batch container of data.  This helps to reduce the
+   complexity of Acero and avoids tricky situations that can arise from tables whose
+   columns have different chunk sizes.  Acero will often use :class:`arrow::Datum`
+   which is a variant from the core module that can hold many different types.  Within
+   Acero, a datum will always hold either an :class:`arrow::Array` or a :class:`arrow::Scalar`.
+
+Core Concepts
+=============
+
+ExecNode
+--------
+
+The most basic concept in Acero is the ExecNode.  An ExecNode has zero or more inputs and
+zero or one outputs.  If an ExecNode has zero inputs we call it a source and if an ExecNode
+does not have an output then we call it a sink.  There are many different kinds of nodes and
+each one transforms is inputs in different ways.  For example:
+
+ * A scan node is a source node that reads data from files
+ * An aggregate node accumulates batches of data to compute summary statistics
+ * A filter node removes rows from the data according to a filter expression
+ * A table sink node accumulates data into a table
+
+.. note::
+   A full list of the available compute modules is included in the :ref:`user's guide<ExecNode List>`
+
+.. _exec-batch:
+
+ExecBatch
+---------
+
+Batches of data are represented by the ExecBatch class.  An ExecBatch is a 2D structure that
+is very similar to a RecordBatch.  It can have zero or more columns and all of the columns
+must have the same length.  There are a few key differences from ExecBatch:
+
+.. figure:: rb_vs_eb.svg
+   
+   Both the record batch and the exec batch have strong ownership of the arrays & buffers
+
+* An `ExecBatch` does not have a schema.  This is because an `ExecBatch` is assumed to be
+  part of a stream of batches and the stream is assumed to have a consistent schema.  So
+  the schema for an `ExecBatch` is typically stored in the ExecNode.
+* Columns in an `ExecBatch` are either an `Array` or a `Scalar`.  When a column is a `Scalar`
+  this means that the column has a single value for every row in the batch.  An `ExecBatch`
+  also has a length property which describes how many rows are in a batch.  So another way to
+  view a `Scalar` is a constant array with `length` elements.
+* An `ExecBatch` contains additional information used by the exec plan.  For example, an
+  `index` can be used to describe a batche's position in an ordered stream.  We expect 

Review Comment:
   ```suggestion
     `index` can be used to describe a batch's position in an ordered stream.  We expect 
   ```



##########
docs/source/cpp/acero/overview.rst:
##########
@@ -0,0 +1,262 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==============
+Acero Overview
+==============
+
+This page gives an overview of the basic Acero concepts and helps distinguish Acero
+from other modules in the Arrow code base.  It's intended for users, developers,
+potential contributors, and for those that would like to extend Acero, either for
+research or for business use.  This page assumes the reader is already familiar with
+core Arrow concepts.  This page does not expect any existing knowledge in relational
+algebra.
+
+What is Acero?
+==============
+
+Acero is a C++ library that can be used to analyze large (potentially infinite) streams
+of data.  Acero allows computation to be expressed as an "execution plan" (:class:`ExecPlan`).
+An execution plan takes in zero or more streams of input data and emits a single
+stream of output data.  The plan describes how the data will be transformed as it
+passes through.  For example, a plan might:
+
+ * Merge two streams of data using a common column
+ * Create additional columns by evaluating expressions against the existing columns
+ * Consume a stream of data by writing it to disk in a partitioned layout
+
+.. image:: simple_graph.svg
+   :alt: A sample execution plan that joins three streams of data and writes to disk
+
+Acero is not...
+---------------
+
+A Library for Data Scientists
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Acero is not intended to be used directly by data scientists.  It is expected that
+end users will typically be using some kind of frontend.  For example, Pandas, Ibis,
+or SQL.  The API for Acero is focused around capabilities and available algorithms.
+However, such users may be intersted in knowing more about how Acero works so that
+they can better understand how the backend processing for their libraries operates.
+
+A Database
+^^^^^^^^^^
+
+A database (or DBMS) is typically a much more expansive application and often packaged
+as a standalone service.  Acero could be a component in a database (most databases have
+some kind of execution engine) or could be a component in some other data processing
+application that hardly resembles a database.  Acero does not concern itself with
+user management, external communication, isolation, durability, or consistency.  In
+addition, Acero is focused primarily on the read path, and the write utilities lack
+any sort of transaction support.
+
+An Optimizer
+^^^^^^^^^^^^
+
+Acero does not have an SQL parser.  It does not have a query planner.  It does not have
+any sort of optimizer.  Acero expects to be given very detailed and low-level instructions
+on how to manipulate data and then it will perform that manipulation exactly as described.
+
+Creating the best execution plan is very hard.  Small details can have a big impact on
+performance.  We do think optimizers are important and we hope that tools will emerge
+someday which can provide these capabilities using standards such as Substrait.
+
+Distributed
+^^^^^^^^^^^
+
+Acero does not provide distributed execution.  However, Acero aims to be usable by a distributed
+query execution engine.  In other words, Acero will not configure and coordinate workers but
+it does except to be used as a worker.  Sometimes, the distinction is a bit fuzzy.  For example,
+an Acero source may be a smart storage device that is capable of performing filtering or other
+advanced analytics.  One might consider this a distributed plan.  The key distinction is Acero
+does not have the capability of transforming a logical plan into a distributed execution plan.
+That step will need to be done elsewhere.
+
+Acero vs...
+-----------
+
+Arrow Compute
+^^^^^^^^^^^^^
+
+This is described in more detail in the overview but the key difference is that Acero handles
+streams of data and Arrow Compute handles situations where all the data is in memory.
+
+Arrow Datasets
+^^^^^^^^^^^^^^
+
+The Arrow datasets library provides some basic routines for discovering, scanning, and
+writing collections of files.  The datasets module depends on Acero.  Both scanning and
+writing datasets uses Acero.  The scan node and the write node are part of the datasets
+module.  This helps to keep the complexity of file formats and filesystems out of the core
+Acero logic.
+
+Substrait
+^^^^^^^^^
+
+Substrait is a project establishing standards for query plans.  Acero executes query plans
+and generates data.  This makes Acero a Substrait consumer.  There are more details on the
+Substrait capabilities below.

Review Comment:
   The "below" now refers to a separate page, I think, and so no longer below on this page. You can probably use a link



##########
docs/source/cpp/acero/user_guide.rst:
##########
@@ -0,0 +1,735 @@
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements.  See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership.  The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License.  You may obtain a copy of the License at
+
+..   http://www.apache.org/licenses/LICENSE-2.0
+
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied.  See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+
+.. default-domain:: cpp
+.. highlight:: cpp
+.. cpp:namespace:: arrow::acero
+
+==================
+Acero User's Guide
+==================
+
+This page describes how to use Acero.  It's recommended that you read the
+overview first and familiarize yourself with the basic concepts.
+
+Using Acero
+===========
+
+The basic workflow for Acero is this:
+
+#. First, create a graph of :class:`Declaration` objects describing the plan
+ 
+#. Call one of the DeclarationToXyz methods to execute the Declaration.
+
+   a. A new ExecPlan is created from the graph of Declarations.  Each Declaration will correspond to one
+      ExecNode in the plan.  In addition, a sink node will be added, depending on which DeclarationToXyz method
+      was used.
+
+   b. The ExecPlan is executed.  Typically this happens as part of the DeclarationToXyz call but in 
+      DeclarationToReader the reader is returned before the plan is finished executing.
+
+   c. Once the plan is finished it is destroyed
+
+Creating a Plan
+===============
+
+Using Substrait
+---------------
+
+Substrait is the preferred mechanism for creating a plan (graph of :class:`Declaration`).  There are a few
+reasons for this:
+
+* Substrait producers spend a lot of time and energy in creating user-friendly APIs for producing complex
+  execution plans in a simple way.  For example, the ``pivot_wider`` operation can be achieved using a complex
+  series of ``aggregate`` nodes.  Rather than create all of those ``aggregate`` nodes by hand a producer will
+  give you a much simpler API.
+
+* If you are using Substrait then you can easily switch out to any other Substrait-consuming engine should you
+  at some point find that it serves your needs better than Acero.
+
+* We hope that tools will eventually emerge for Substrait-based optimizers and planners.  By using Substrait
+  you will be making it much easier to use these tools in the future.
+
+You could create the Substrait plan yourself but you'll probably have a much easier time finding an existing
+Susbstrait producer.  For example, you could use `ibis-substrait <https://github.com/ibis-project/ibis-substrait>`_
+to easily create Substrait plans from python expressions.  There are a few different tools that are able to create
+Substrait plans from SQL.  Eventually, we hope that C++ based Substrait producers will emerge.  However, we
+are not aware of any at this time.
+
+Detailed instructions on creating an execution plan from Substrait can be found in
+:ref:`the Substrait page<acero-substrait>`
+
+Programmatic Plan Creation
+--------------------------
+
+Creating an execution plan programmatically is simpler than creating a plan from Substrait, though loses some of
+the flexibility and future-proofing guarantees.  The simplest way to create a Declaration is to simply instantiate
+one.  You will need the name of the declaration, a vector of inputs, and an options object.  For example:
+
+.. literalinclude:: ../../../../cpp/examples/arrow/execution_plan_documentation_examples.cc
+  :language: cpp
+  :start-after: (Doc section: Project Example)
+  :end-before: (Doc section: Project Example)
+  :linenos:
+  :lineno-match:
+
+The above code creates a scan declaration (which has no inputs) and a project declaration (using the scan as
+input).  This is simple enough but we can make it slightly easier.  If you are creating a linear sequence of
+declarations (like in the above example) then you can also use the :func:`Declaration::Sequence` function.
+
+.. literalinclude:: ../../../../cpp/examples/arrow/execution_plan_documentation_examples.cc
+  :language: cpp
+  :start-after: (Doc section: Project Sequence Example)
+  :end-before: (Doc section: Project Sequence Example)
+  :linenos:
+  :lineno-match:
+
+There are many more examples of programmatic plan creation later in this document.
+
+Executing a Plan
+================
+
+There are a number of different methods that can be used to execute a declaration.  Each one provides the
+data in a slightly different form.  Since all of these methods start with ``DeclarationTo...`` this guide
+will often refer to these methods as the ``DeclarationToXyz`` methods.
+
+DeclarationToTable
+------------------
+
+The :func:`DeclarationToTable` method will accumulate all of the results into a single :class:`arrow::Table`.
+This is perhaps the simplest way to collect results from Acero.  The main disadvantage to this approach is
+that it requires accumulating all results into memory.
+
+.. note::
+
+   Acero processes large datasets in small chunks.  This is described in more detail in the developer's guide.
+   As a result, you may be surprised to find that a table collected with DeclarationToTable is chunked
+   differently than your input.  For example, your input might be a large table with a single chunk with 2
+   million rows.  Your output table might then have 64 chunks with 32Ki rows each.  There is a current request
+   to specify the chunk size for the output in `GH-15155 <https://github.com/apache/arrow/issues/15155>`_.
+
+DeclarationToReader
+-------------------
+
+The :func:`DeclarationToReader` method allows you to iteratively consume the results.  It will create an
+:class:`arrow::RecordBatchReader` which you can read from at your liesure.  If you do not read from the
+reader quickly enough then backpressure will be applied and the execution plan will pause.  Closing the
+reader will cancel the running execution plan and the reader's destructor will wait for the execution plan
+to finish whatever it is doing and so it may block.
+
+DeclarationToStatus
+-------------------
+
+The :func:`DeclarationToStatus` method is useful if you want to run the plan but do not actually want to
+consume the results.  For example, this is useful when benchmarking or when the plan has side effects such
+as a dataset write node.  If the plan generates any results then they will be immediately discarded.
+
+Running a Plan Directly
+-----------------------
+
+If one of the ``DeclarationToXyz`` methods is not sufficient for some reason then it is possible to run a plan
+directly.  This should only be needed if you are doing something unique.  For example, if you have created a
+custom sink node or if you need a plan that has multiple outputs.
+
+.. note::
+   In academic literature and many existing systems there is a general assumption that an execution plan has
+   at most one output.  There are some things in Acero, such as the DeclarationToXyz methods, which will expect
+   this.  However, there is nothing in the design that strictly prevents having multiple sink nodes.
+
+Detailed instructions on how to do this are out of scope for this guide but the rough steps are:
+
+1. Create a new :class:`ExecPlan` object.
+2. Add sink nodes to your graph of :class:`Declaration` objects (this is the only type you will need
+   to create declarations for sink nodes)
+3. Use :func:`Declaration::AddToPlan` to add your declaration to your plan (if you have more than one output
+   then you will not be able to use this method and will need to add your nodes one at a time)
+4. Validate the plan with :func:`ExecPlan::Validate`
+5. Start the plan with :func:`ExecPlan::StartProducing`
+6. Wait for the future returned by :func:`ExecPlan::finished` to complete.
+
+Providing Input
+===============
+
+Input data for an exec plan can come from a variety of sources.  It is often read from files stored on some
+kind of filesystem.  It is also common for input to come from in-memory data.  In-memory data is typical, for
+example, in a pandas-like frontend.  Input could also come from network streams like a Flight request.  Acero
+can support all of these cases and can even support unique and custom situations not mentioned here.
+
+There are pre-defined source nodes that cover the most common input scenarios.  These are listed below.  However,
+if your source data is unique then you will need to use the generic ``source`` node.  This node expects you to
+provide an asycnhronous stream of batches and is covered in more detail :ref:`here <stream_execution_source_docs>`.
+
+.. _ExecNode List:
+
+Available ``ExecNode`` Implementations
+======================================
+
+The following tables quickly summarize the available operators.
+
+Sources
+-------
+
+These nodes can be used as sources of data
+
+.. list-table:: Source Nodes
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Factory Name
+     - Options
+     - Brief Description
+   * - ``source``
+     - :class:`SourceNodeOptions`
+     - A generic source node that wraps an asynchronous stream of data (:ref:`example <stream_execution_source_docs>`)
+   * - ``table_source``
+     - :class:`TableSourceNodeOptions`
+     - Generates data from an :class:`arrow::Table` (:ref:`example <stream_execution_table_source_docs>`)
+   * - ``record_batch_source``
+     - :class:`RecordBatchSourceNodeOptions`
+     - Generates data from an iterator of :class:`arrow::RecordBatch`
+   * - ``record_batch_reader_source``
+     - :class:`RecordBatchReaderSourceNodeOptions`
+     - Generates data from an :class:`arrow::RecordBatchReader`
+   * - ``exec_batch_source``
+     - :class:`ExecBatchSourceNodeOptions`
+     - Generates data from an iterator of :class:`arrow::compute::ExecBatch`
+   * - ``array_vector_source``
+     - :class:`ArrayVectorSourceNodeOptions`
+     - Generates data from an iterator of vectors of :class:`arrow::Array`
+   * - ``scan``
+     - :class:`arrow::dataset::ScanNodeOptions`
+     - Generates data from an `arrow::dataset::Dataset` (requires the datasets module)
+       (:ref:`example <stream_execution_scan_docs>`)
+
+Compute Nodes
+-------------
+
+These nodes perform computations on data and may transform or reshape the data
+
+.. list-table:: Compute Nodes
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Factory Name
+     - Options
+     - Brief Description
+   * - ``filter``
+     - :class:`FilterNodeOptions`
+     - Removes rows that do not match a given filter expression
+   * - ``project``
+     - :class:`ProjectNodeOptions`
+     - Creates new columns by evaluating compute expressions.  Can also drop and reorder columns
+       (:ref:`example <stream_execution_project_docs>`)
+   * - ``aggregate``
+     - :class:`AggregateNodeOptions`
+     - Calculates summary statistics across the entire input stream or on groups of data
+       (:ref:`example <stream_execution_aggregate_docs>`)
+   * - ``pivot_longer``
+     - :class:`PivotLongerNodeOptions`
+     - Reshapes data by converting some columns into additional rows
+
+Arrangement Nodes
+-----------------
+
+These nodes reorder, combine, or slice streams of data
+
+.. list-table:: Arrangement Nodes
+   :widths: 25 25 50
+   :header-rows: 1
+
+   * - Factory Name
+     - Options
+     - Brief Description
+   * - ``hash_join``
+     - :class:`HashJoinNodeOptions`
+     - Joins two inputs based on common columns (:ref:`example <stream_execution_hashjoin_docs>`)

Review Comment:
   AsOfJoin is missing here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org