You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/04/19 11:00:39 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #6056: Update DataFusion architecture documentation

alamb opened a new pull request, #6056:
URL: https://github.com/apache/arrow-datafusion/pull/6056

   # Which issue does this PR close?
   
   Part of https://github.com/apache/arrow-datafusion/issues/5501
   
   # Rationale for this change
   I am trying to make it easier to work on and use the DataFusion project
   
   
   # What changes are included in this PR?
   
   1. Update the architecture guide 
   
   # Are these changes tested?
   
   No (though I hope the links will be verified by the changes in https://github.com/apache/arrow-datafusion/pull/6044) 
   
   # Are there any user-facing changes?
   
   Better docs


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #6056: Update DataFusion architecture documentation

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb merged PR #6056:
URL: https://github.com/apache/arrow-datafusion/pull/6056


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] Ted-Jiang commented on a diff in pull request #6056: Update DataFusion architecture documentation

Posted by "Ted-Jiang (via GitHub)" <gi...@apache.org>.

Ted-Jiang commented on code in PR #6056:
URL: https://github.com/apache/arrow-datafusion/pull/6056#discussion_r1173494979


##########
datafusion/core/src/lib.rs:
##########
@@ -163,104 +168,238 @@
 //!   - _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
 //!   - _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30)
 //!   - _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) and [slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg)
-//! - [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
 //! - [July 2022]: DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine: [recording](https://www.youtube.com/watch?v=Rii1VTn3seQ) and [slides](https://docs.google.com/presentation/d/1q1bPibvu64k2b7LPi7Yyb0k3gA1BiUYiUbEklqW1Ckc/view#slide=id.g11054eeab4c_0_1165)
 //! - [March 2021]: The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
-//! - [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+//! - [February 2021]: How DataFusion is used within the Ballista Project is described in _Ballista: Distributed Compute with Rust and Apache Arrow_: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
 //!
-//! ## Architecture
+//! ## Query Planning and Execution Overview
 //!
-//! DataFusion is a fully fledged query engine capable of performing complex operations.
-//! Specifically, when DataFusion receives an SQL query, there are different steps
-//! that it passes through until a result is obtained. Broadly, they are:
+//! ### SQL
 //!
-//! 1. The string is parsed to an Abstract syntax tree (AST) using [sqlparser].
-//! 2. The planner [`SqlToRel`] converts logical expressions on the AST to logical expressions [`Expr`]s.
-//! 3. The planner [`SqlToRel`] converts logical nodes on the AST to a [`LogicalPlan`].
-//! 4. [`OptimizerRule`]s are applied to the [`LogicalPlan`] to optimize it.
-//! 5. The [`LogicalPlan`] is converted to an [`ExecutionPlan`] by a [`PhysicalPlanner`]
-//! 6. The [`ExecutionPlan`]is executed against data through the [`SessionContext`]
+//! ```text
+//!                 Parsed with            SqlToRel creates
+//!                 sqlparser              initial plan
+//! ┌───────────────┐           ┌─────────┐             ┌─────────────┐
+//! │   SELECT *    │           │Query {  │             │Project      │
+//! │   FROM ...    │──────────▶│..       │────────────▶│  TableScan  │
+//! │               │           │}        │             │    ...      │
+//! └───────────────┘           └─────────┘             └─────────────┘
 //!
-//! With the [`DataFrame`] API, steps 1-3 are not used as the DataFrame builds the [`LogicalPlan`] directly.
+//!   SQL String                 sqlparser               LogicalPlan
+//!                              AST nodes
+//! ```
 //!
-//! Phases 1-5 are typically cheap when compared to phase 6, and thus DataFusion puts a
-//! lot of effort to ensure that phase 6 runs efficiently and without errors.
+//! 1. The query string is parsed to an Abstract Syntax Tree (AST)
+//! [`Statement`] using [sqlparser].
 //!
-//! DataFusion's planning is divided in two main parts: logical planning and physical planning.
+//! 2. The AST is converted to a [`LogicalPlan`] and logical
+//! expressions [`Expr`]s to compute the desired result by the
+//! [`SqlToRel`] planner.
 //!
-//! ### Logical planning
+//! [`Statement`]: https://docs.rs/sqlparser/latest/sqlparser/ast/enum.Statement.html
 //!
-//! Logical planning yields [`LogicalPlan`]s and logical [`Expr`]
-//! expressions which are [`Schema`]aware and represent statements
-//! whose result is independent of how it should physically be
-//! executed.
+//! ### DataFrame
 //!
-//! A [`LogicalPlan`] is a Directed Acyclic Graph (DAG) of other
-//! [`LogicalPlan`]s, and each node contains [`Expr`]s.  All of these
-//! are located in [`datafusion_expr`] module.
+//! When executing plans using the [`DataFrame`] API, the process is
+//! identical as with SQL, except the DataFrame API builds the
+//! [`LogicalPlan`] directly using [`LogicalPlanBuilder`]. Systems
+//! that have their own custom query languages typically also build
+//! [`LogicalPlan`] directly.
+//!
+//! ### Planning
+//!
+//! ```text
+//!             AnalyzerRules and      PhysicalPlanner          PhysicalOptimizerRules
+//!             OptimizerRules         creates ExecutionPlan    improve performance
+//!             rewrite plan
+//! ┌─────────────┐        ┌─────────────┐      ┌───────────────┐        ┌───────────────┐
+//! │Project      │        │Project(x, y)│      │ProjectExec    │        │ProjectExec    │
+//! │  TableScan  │──...──▶│  TableScan  │─────▶│  ...          │──...──▶│  ...          │
+//! │    ...      │        │    ...      │      │    ParquetExec│        │    ParquetExec│
+//! └─────────────┘        └─────────────┘      └───────────────┘        └───────────────┘
+//!
+//!  LogicalPlan            LogicalPlan         ExecutionPlan             ExecutionPlan
+//! ```
+//!
+//! To process large datasets with many rows as efficiently as
+//! possible, significant effort is spent planning and
+//! optimizing, in the following manner:
+//!
+//! 1. The [`LogicalPlan`] is checked and rewritten to enforce
+//! semantic rules, such as type coercion, by [`AnalyzerRule`]s
+//!
+//! 2. The [`LogicalPlan`] is rewritten by [`OptimizerRule`]s, such as
+//! projection and filter pushdown, to improve its efficiency.
+//!
+//! 3. The [`LogicalPlan`] is converted to an [`ExecutionPlan`] by a
+//! [`PhysicalPlanner`]
+//!
+//! 4. The [`ExecutionPlan`] is rewrittten by

Review Comment:
   ```suggestion
   //! 4. The [`ExecutionPlan`] is rewritten by
   ```



##########
datafusion/core/src/lib.rs:
##########
@@ -163,104 +168,238 @@
 //!   - _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
 //!   - _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30)
 //!   - _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) and [slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg)
-//! - [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
 //! - [July 2022]: DataFusion and Arrow: Supercharge Your Data Analytical Tool with a Rusty Query Engine: [recording](https://www.youtube.com/watch?v=Rii1VTn3seQ) and [slides](https://docs.google.com/presentation/d/1q1bPibvu64k2b7LPi7Yyb0k3gA1BiUYiUbEklqW1Ckc/view#slide=id.g11054eeab4c_0_1165)
 //! - [March 2021]: The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts [~ 15 minutes in](https://www.youtube.com/watch?v=K6eCAVEk4kU&t=875s)) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
-//! - [February 2021]: How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
+//! - [February 2021]: How DataFusion is used within the Ballista Project is described in _Ballista: Distributed Compute with Rust and Apache Arrow_: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
 //!
-//! ## Architecture
+//! ## Query Planning and Execution Overview
 //!
-//! DataFusion is a fully fledged query engine capable of performing complex operations.
-//! Specifically, when DataFusion receives an SQL query, there are different steps
-//! that it passes through until a result is obtained. Broadly, they are:
+//! ### SQL
 //!
-//! 1. The string is parsed to an Abstract syntax tree (AST) using [sqlparser].
-//! 2. The planner [`SqlToRel`] converts logical expressions on the AST to logical expressions [`Expr`]s.
-//! 3. The planner [`SqlToRel`] converts logical nodes on the AST to a [`LogicalPlan`].
-//! 4. [`OptimizerRule`]s are applied to the [`LogicalPlan`] to optimize it.
-//! 5. The [`LogicalPlan`] is converted to an [`ExecutionPlan`] by a [`PhysicalPlanner`]
-//! 6. The [`ExecutionPlan`]is executed against data through the [`SessionContext`]
+//! ```text
+//!                 Parsed with            SqlToRel creates
+//!                 sqlparser              initial plan
+//! ┌───────────────┐           ┌─────────┐             ┌─────────────┐
+//! │   SELECT *    │           │Query {  │             │Project      │
+//! │   FROM ...    │──────────▶│..       │────────────▶│  TableScan  │
+//! │               │           │}        │             │    ...      │
+//! └───────────────┘           └─────────┘             └─────────────┘
 //!
-//! With the [`DataFrame`] API, steps 1-3 are not used as the DataFrame builds the [`LogicalPlan`] directly.
+//!   SQL String                 sqlparser               LogicalPlan
+//!                              AST nodes
+//! ```
 //!
-//! Phases 1-5 are typically cheap when compared to phase 6, and thus DataFusion puts a
-//! lot of effort to ensure that phase 6 runs efficiently and without errors.
+//! 1. The query string is parsed to an Abstract Syntax Tree (AST)
+//! [`Statement`] using [sqlparser].
 //!
-//! DataFusion's planning is divided in two main parts: logical planning and physical planning.
+//! 2. The AST is converted to a [`LogicalPlan`] and logical
+//! expressions [`Expr`]s to compute the desired result by the
+//! [`SqlToRel`] planner.
 //!
-//! ### Logical planning
+//! [`Statement`]: https://docs.rs/sqlparser/latest/sqlparser/ast/enum.Statement.html
 //!
-//! Logical planning yields [`LogicalPlan`]s and logical [`Expr`]
-//! expressions which are [`Schema`]aware and represent statements
-//! whose result is independent of how it should physically be
-//! executed.
+//! ### DataFrame
 //!
-//! A [`LogicalPlan`] is a Directed Acyclic Graph (DAG) of other
-//! [`LogicalPlan`]s, and each node contains [`Expr`]s.  All of these
-//! are located in [`datafusion_expr`] module.
+//! When executing plans using the [`DataFrame`] API, the process is
+//! identical as with SQL, except the DataFrame API builds the
+//! [`LogicalPlan`] directly using [`LogicalPlanBuilder`]. Systems
+//! that have their own custom query languages typically also build
+//! [`LogicalPlan`] directly.
+//!
+//! ### Planning
+//!
+//! ```text
+//!             AnalyzerRules and      PhysicalPlanner          PhysicalOptimizerRules
+//!             OptimizerRules         creates ExecutionPlan    improve performance
+//!             rewrite plan
+//! ┌─────────────┐        ┌─────────────┐      ┌───────────────┐        ┌───────────────┐
+//! │Project      │        │Project(x, y)│      │ProjectExec    │        │ProjectExec    │
+//! │  TableScan  │──...──▶│  TableScan  │─────▶│  ...          │──...──▶│  ...          │
+//! │    ...      │        │    ...      │      │    ParquetExec│        │    ParquetExec│
+//! └─────────────┘        └─────────────┘      └───────────────┘        └───────────────┘
+//!
+//!  LogicalPlan            LogicalPlan         ExecutionPlan             ExecutionPlan
+//! ```
+//!
+//! To process large datasets with many rows as efficiently as
+//! possible, significant effort is spent planning and
+//! optimizing, in the following manner:
+//!
+//! 1. The [`LogicalPlan`] is checked and rewritten to enforce
+//! semantic rules, such as type coercion, by [`AnalyzerRule`]s
+//!
+//! 2. The [`LogicalPlan`] is rewritten by [`OptimizerRule`]s, such as
+//! projection and filter pushdown, to improve its efficiency.
+//!
+//! 3. The [`LogicalPlan`] is converted to an [`ExecutionPlan`] by a
+//! [`PhysicalPlanner`]
+//!
+//! 4. The [`ExecutionPlan`] is rewrittten by
+//! [`PhysicalOptimizerRule`]s, such as sort and join selection, to
+//! improve its efficiency.
+//!
+//! ## Data Sources
+//!
+//! ```text
+//! Planning       │
+//! requests       │            TableProvider::scan
+//! information    │            creates an
+//! such as schema │            ExecutionPlan
+//!                │
+//!                ▼
+//!   ┌─────────────────────────┐         ┌──────────────┐
+//!   │                         │         │              │
+//!   │impl TableProvider       │────────▶│ParquetExec   │
+//!   │                         │         │              │
+//!   └─────────────────────────┘         └──────────────┘
+//!         TableProvider
+//!         (built in or user provided)    ExecutionPlan
+//! ```
+//!
+//! DataFusion includes several built in data sources for common use
+//! cases, and can be extended by implementing the [`TableProvider`]
+//! trait. A [`TableProvider`] provides information for planning and
+//! an [`ExecutionPlan`]s for execution.
 //!
-//! ### Physical planning
+//! 1. [`ListingTable`]: Reads data from Parquet, JSON, CSV, or AVRO
+//! files.  Supports single files or multiple files with HIVE style
+//! partitoning, optional compression, directly reading from remote

Review Comment:
   ```suggestion
   //! partitioning, optional compression, directly reading from remote
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #6056: Update DataFusion architecture documentation

Posted by "comphead (via GitHub)" <gi...@apache.org>.

comphead commented on code in PR #6056:
URL: https://github.com/apache/arrow-datafusion/pull/6056#discussion_r1172720822


##########
datafusion/core/src/datasource/empty.rs:
##########
@@ -15,7 +15,7 @@
 // specific language governing permissions and limitations
 // under the License.
 
-//! An empty plan that is usefull for testing and generating plans without mapping them to actual data.
+//! [`EmptyTable`] usefull for testing.

Review Comment:
   ```suggestion
   //! [`EmptyTable`] useful for testing.
   ```



##########
datafusion/core/src/datasource/empty.rs:
##########
@@ -30,7 +30,8 @@ use crate::logical_expr::Expr;
 use crate::physical_plan::project_schema;
 use crate::physical_plan::{empty::EmptyExec, ExecutionPlan};
 
-/// A table with a schema but no data.
+/// An empty plan that is usefull for testing and generating plans

Review Comment:
   ```suggestion
   /// An empty plan that is useful for testing and generating plans
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org