You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/06/12 01:08:57 UTC

[GitHub] [arrow-datafusion] alamb opened a new pull request, #6639: Docs: Update roadmap to point at EPIC's, clarify project goals

alamb opened a new pull request, #6639:
URL: https://github.com/apache/arrow-datafusion/pull/6639

   # Which issue does this PR close?
   
   Closes https://github.com/apache/arrow-datafusion/issues/3935
   
   Related to https://github.com/apache/arrow-datafusion/discussions/6441
   
   # Rationale for this change
   
   Our roadmap is somewhat out of date as it refers to several projects that seem to have been completed (the relevant ticket have been closed)
   
   Also, I am working on https://github.com/apache/arrow-datafusion/issues/5812 and wanted to have an up to date roadmap to discuss
   
   Also, we recently had a discussion https://github.com/apache/arrow-datafusion/discussions/6441 about the vision of DataFusion, which should be reflected in the user facing documentation
   
   # What changes are included in this PR?
   
   1. Update the roadmap section of the docs to point at github epics
   2. Incorporate feedback from https://github.com/apache/arrow-datafusion/discussions/6441 (turns out what was on the site was already pretty close)
   
   # Are these changes tested?
   N/A
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   3. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] avantgardnerio commented on a diff in pull request #6639: Docs: Update roadmap to point at EPIC's, clarify project goals

Posted by "avantgardnerio (via GitHub)" <gi...@apache.org>.

avantgardnerio commented on code in PR #6639:
URL: https://github.com/apache/arrow-datafusion/pull/6639#discussion_r1226962856


##########
docs/source/user-guide/introduction.md:
##########
@@ -34,37 +46,47 @@ DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchm
 - Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
   other query languages, custom plan and execution nodes, optimizer passes, and more.
 - Streaming, asynchronous IO directly from popular object stores, including AWS S3,
-  Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
-  `ObjectStore` trait.
+  Azure Blob Storage, and Google Cloud Storage (Other storage systems are supported via the
+  `ObjectStore` trait).
 - [Excellent Documentation](https://docs.rs/datafusion/latest) and a
   [welcoming community](https://arrow.apache.org/datafusion/contributor-guide/communication.html).
-- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
-  automatic join reordering, expression coercion, and more.
-- Permissive Apache 2.0 License, Apache Software Foundation governance
-- Written in [Rust](https://www.rust-lang.org/), a modern system language with development
-  productivity similar to Java or Golang, the performance of C++, and
-  [loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
-- Support for [Substrait](https://substrait.io/) for query plan serialization, making it easier to integrate DataFusion
-  with other projects, and to pass plans across language boundaries.
+- A state of the art query optimizer with expression coercion and
+  simplification, projection and filter pushdown, sort and distribution
+  aware optimizations, automatic join reordering, and more.
+- Permissive Apache 2.0 License, predictable and well understood
+  [Apache Software Foundation](https://www.apache.org/) governance.
+- Implementation in [Rust](https://www.rust-lang.org/), a modern
+  system language with development productivity similar to Java or
+  Golang, the performance of C++, and [loved by programmers
+  everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
+- Support for [Substrait](https://substrait.io/) query plans, to
+  easily pass plans across language and system boundaries.
 
 ## Use Cases
 
 DataFusion can be used without modification as an embedded SQL
 engine or can be customized and used as a foundation for
-building new systems. Here are some examples of systems built using DataFusion:
+building new systems.
+
+While most current usecases are "analytic" or (throughput) some

Review Comment:
   I'm not sure I could say it any better.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6639: Docs: Update roadmap to point at EPIC's, clarify project goals

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #6639:
URL: https://github.com/apache/arrow-datafusion/pull/6639#discussion_r1225986515


##########
datafusion/core/src/lib.rs:
##########
@@ -132,7 +132,13 @@
 //!
 //! ## Customization and Extension
 //!
-//! DataFusion supports extension at many points:
+//! DataFusion is designed to be a "disaggregated" query engine.  This

Review Comment:
   This is trying to address @boazberman 's comments in https://github.com/apache/arrow-datafusion/discussions/6441#discussioncomment-6051065



##########
docs/source/contributor-guide/roadmap.md:
##########
@@ -19,100 +19,27 @@ under the License.
 
 # Roadmap
 
-This document describes high level goals of the DataFusion and
-Ballista development community. It is not meant to restrict
-possibilities, but rather help newcomers understand the broader
-context of where the community is headed, and inspire
-additional contributions.
-
-DataFusion and Ballista are part of the [Apache
-Arrow](https://arrow.apache.org/) project and governed by the Apache
-Software Foundation governance model. These projects are entirely
-driven by volunteers, and we welcome contributions for items not on
-this roadmap. However, before submitting a large PR, we strongly
-suggest you start a conversation using a github issue or the
-dev@arrow.apache.org mailing list to make review efficient and avoid
-surprises.
-
-## DataFusion
-
-DataFusion's goal is to become the embedded query engine of choice
-for new analytic applications, by leveraging the unique features of
-[Rust](https://www.rust-lang.org/) and [Apache Arrow](https://arrow.apache.org/)
-to provide:
-
-1. Best-in-class single node query performance
-2. A Declarative SQL query interface compatible with PostgreSQL
-3. A Dataframe API, similar to those offered by Pandas and Spark
-4. A Procedural API for programmatically creating and running execution plans
-5. High performance, data race free, ergonomic extensibility points at at every layer
-
-### Additional SQL Language Features
-
-- Decimal Support [#122](https://github.com/apache/arrow-datafusion/issues/122)
-- Complete support list on [status](https://github.com/apache/arrow-datafusion/blob/main/README.md#status)
-- Timestamp Arithmetic [#194](https://github.com/apache/arrow-datafusion/issues/194)
-- SQL Parser extension point [#533](https://github.com/apache/arrow-datafusion/issues/533)
-- Support for nested structures (fields, lists, structs) [#119](https://github.com/apache/arrow-datafusion/issues/119)
-- Run all queries from the TPCH benchmark (see [milestone](https://github.com/apache/arrow-datafusion/milestone/2) for more details)
-
-### Query Optimizer
-
-- More sophisticated cost based optimizer for join ordering
-- Implement advanced query optimization framework (Tokomak) [#440](https://github.com/apache/arrow-datafusion/issues/440)
-- Finer optimizations for group by and aggregate functions
-
-### Datasources
-
-- Better support for reading data from remote filesystems (e.g. S3) without caching it locally [#907](https://github.com/apache/arrow-datafusion/issues/907) [#1060](https://github.com/apache/arrow-datafusion/issues/1060)
-- Improve performances of file format datasources (parallelize file listings, async Arrow readers, file chunk prefetching capability...)
-
-### Runtime / Infrastructure
-
-- Migrate to some sort of arrow2 based implementation (see [milestone](https://github.com/apache/arrow-datafusion/milestone/3) for more details)
-- Add DataFusion to h2oai/db-benchmark [#147](https://github.com/apache/arrow-datafusion/issues/147)
-- Improve build time [#348](https://github.com/apache/arrow-datafusion/issues/348)
-
-### Resource Management
-
-- Finer grain control and limit of runtime memory [#587](https://github.com/apache/arrow-datafusion/issues/587) and CPU usage [#54](https://github.com/apache/arrow-datafusion/issues/64)
-
-### Python Interface
-
-TBD
-
-### DataFusion CLI (`datafusion-cli`)
-
-Note: There are some additional thoughts on a datafusion-cli vision on [#1096](https://github.com/apache/arrow-datafusion/issues/1096#issuecomment-939418770).
-
-- Better abstraction between REPL parsing and queries so that commands are separated and handled correctly
-- Connect to the `Statistics` subsystem and have the cli print out more stats for query debugging, etc.
-- Improved error handling for interactive use and shell scripting usage
-- publishing to apt, brew, and possible NuGet registry so that people can use it more easily
-- adopt a shorter name, like dfcli?
-
-## Ballista
-
-Ballista is a distributed compute platform based on Apache Arrow and DataFusion. It provides a query scheduler that
-breaks a physical plan into stages and tasks and then schedules tasks for execution across the available executors
-in the cluster.
-
-Having Ballista as part of the DataFusion codebase helps ensure that DataFusion remains suitable for distributed
-compute. For example, it helps ensure that physical query plans can be serialized to protobuf format and that they
-remain language-agnostic so that executors can be built in languages other than Rust.
-
-### Ballista Roadmap
-
-### Move query scheduler into DataFusion
-
-The Ballista scheduler has some advantages over DataFusion query execution because it doesn't try to eagerly execute
-the entire query at once but breaks it down into a directionally-acyclic graph (DAG) of stages and executes a
-configurable number of stages and tasks concurrently. It should be possible to push some of this logic down to
-DataFusion so that the same scheduler can be used to scale across cores in-process and across nodes in a cluster.
-
-### Implement execution-time cost-based optimizations based on statistics
-
-After the execution of a query stage, accurate statistics are available for the resulting data. These statistics
-could be leveraged by the scheduler to optimize the query during execution. For example, when performing a hash join
-it is desirable to load the smaller side of the join into memory and in some cases we cannot predict which side will
-be smaller until execution time.
+The [project introduction](../user-guide/introduction) explains the
+overview and goals of DataFusion, and our development efforts largely
+align to that vision.
+
+## Planning `EPIC`s
+
+DataFusion uses [GitHub

Review Comment:
   I began this PR by trying to summarize the outstanding work and to do so I looked at the `EPIC`s -- pretty soon I found that I was just replicating https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+epic in a markdown document that would end up out of date
   
   While a more free form version of the roadmap in text (rather than a github issue list) is probably easier to consume, unless we have a volunteer to commit to doing, keeping our efforts focused on keeping github updated seemed better. 



##########
docs/source/contributor-guide/roadmap.md:
##########
@@ -19,100 +19,27 @@ under the License.
 
 # Roadmap
 
-This document describes high level goals of the DataFusion and
-Ballista development community. It is not meant to restrict
-possibilities, but rather help newcomers understand the broader
-context of where the community is headed, and inspire
-additional contributions.
-
-DataFusion and Ballista are part of the [Apache
-Arrow](https://arrow.apache.org/) project and governed by the Apache
-Software Foundation governance model. These projects are entirely
-driven by volunteers, and we welcome contributions for items not on
-this roadmap. However, before submitting a large PR, we strongly
-suggest you start a conversation using a github issue or the
-dev@arrow.apache.org mailing list to make review efficient and avoid
-surprises.
-
-## DataFusion
-
-DataFusion's goal is to become the embedded query engine of choice
-for new analytic applications, by leveraging the unique features of
-[Rust](https://www.rust-lang.org/) and [Apache Arrow](https://arrow.apache.org/)
-to provide:
-
-1. Best-in-class single node query performance

Review Comment:
   These goals are largely redundant with the introduction, so I figured it would be better to leave a link and direct people back there rather than partially replicate the content



##########
docs/source/user-guide/introduction.md:
##########
@@ -34,37 +46,47 @@ DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchm
 - Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
   other query languages, custom plan and execution nodes, optimizer passes, and more.
 - Streaming, asynchronous IO directly from popular object stores, including AWS S3,
-  Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
-  `ObjectStore` trait.
+  Azure Blob Storage, and Google Cloud Storage (Other storage systems are supported via the
+  `ObjectStore` trait).
 - [Excellent Documentation](https://docs.rs/datafusion/latest) and a
   [welcoming community](https://arrow.apache.org/datafusion/contributor-guide/communication.html).
-- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
-  automatic join reordering, expression coercion, and more.
-- Permissive Apache 2.0 License, Apache Software Foundation governance
-- Written in [Rust](https://www.rust-lang.org/), a modern system language with development
-  productivity similar to Java or Golang, the performance of C++, and
-  [loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
-- Support for [Substrait](https://substrait.io/) for query plan serialization, making it easier to integrate DataFusion
-  with other projects, and to pass plans across language boundaries.
+- A state of the art query optimizer with expression coercion and
+  simplification, projection and filter pushdown, sort and distribution
+  aware optimizations, automatic join reordering, and more.
+- Permissive Apache 2.0 License, predictable and well understood
+  [Apache Software Foundation](https://www.apache.org/) governance.
+- Implementation in [Rust](https://www.rust-lang.org/), a modern
+  system language with development productivity similar to Java or
+  Golang, the performance of C++, and [loved by programmers
+  everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
+- Support for [Substrait](https://substrait.io/) query plans, to
+  easily pass plans across language and system boundaries.
 
 ## Use Cases
 
 DataFusion can be used without modification as an embedded SQL
 engine or can be customized and used as a foundation for
-building new systems. Here are some examples of systems built using DataFusion:
+building new systems.
+
+While most current usecases are "analytic" or (throughput) some

Review Comment:
   This is trying to channel @avantgardnerio 's suggestion on https://github.com/apache/arrow-datafusion/discussions/6441#discussioncomment-6001862 though I am not sure how faithfully I have done so



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] paddyhoran commented on a diff in pull request #6639: Docs: Update roadmap to point at EPIC's, clarify project goals

Posted by "paddyhoran (via GitHub)" <gi...@apache.org>.

paddyhoran commented on code in PR #6639:
URL: https://github.com/apache/arrow-datafusion/pull/6639#discussion_r1227170826


##########
docs/source/user-guide/introduction.md:
##########
@@ -34,37 +46,47 @@ DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchm
 - Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
   other query languages, custom plan and execution nodes, optimizer passes, and more.
 - Streaming, asynchronous IO directly from popular object stores, including AWS S3,
-  Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
-  `ObjectStore` trait.
+  Azure Blob Storage, and Google Cloud Storage (Other storage systems are supported via the
+  `ObjectStore` trait).
 - [Excellent Documentation](https://docs.rs/datafusion/latest) and a
   [welcoming community](https://arrow.apache.org/datafusion/contributor-guide/communication.html).
-- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
-  automatic join reordering, expression coercion, and more.
-- Permissive Apache 2.0 License, Apache Software Foundation governance
-- Written in [Rust](https://www.rust-lang.org/), a modern system language with development
-  productivity similar to Java or Golang, the performance of C++, and
-  [loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
-- Support for [Substrait](https://substrait.io/) for query plan serialization, making it easier to integrate DataFusion
-  with other projects, and to pass plans across language boundaries.
+- A state of the art query optimizer with expression coercion and
+  simplification, projection and filter pushdown, sort and distribution
+  aware optimizations, automatic join reordering, and more.
+- Permissive Apache 2.0 License, predictable and well understood
+  [Apache Software Foundation](https://www.apache.org/) governance.
+- Implementation in [Rust](https://www.rust-lang.org/), a modern
+  system language with development productivity similar to Java or
+  Golang, the performance of C++, and [loved by programmers
+  everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
+- Support for [Substrait](https://substrait.io/) query plans, to
+  easily pass plans across language and system boundaries.
 
 ## Use Cases
 
 DataFusion can be used without modification as an embedded SQL
 engine or can be customized and used as a foundation for
-building new systems. Here are some examples of systems built using DataFusion:
+building new systems.
+
+While most current usecases are "analytic" or (throughput) some

Review Comment:
   I don't think you need the `or` after "analytic" though?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6639: Docs: Update roadmap to point at EPIC's, clarify project goals

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on code in PR #6639:
URL: https://github.com/apache/arrow-datafusion/pull/6639#discussion_r1227338033


##########
docs/source/user-guide/introduction.md:
##########
@@ -34,37 +46,47 @@ DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchm
 - Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
   other query languages, custom plan and execution nodes, optimizer passes, and more.
 - Streaming, asynchronous IO directly from popular object stores, including AWS S3,
-  Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
-  `ObjectStore` trait.
+  Azure Blob Storage, and Google Cloud Storage (Other storage systems are supported via the
+  `ObjectStore` trait).
 - [Excellent Documentation](https://docs.rs/datafusion/latest) and a
   [welcoming community](https://arrow.apache.org/datafusion/contributor-guide/communication.html).
-- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
-  automatic join reordering, expression coercion, and more.
-- Permissive Apache 2.0 License, Apache Software Foundation governance
-- Written in [Rust](https://www.rust-lang.org/), a modern system language with development
-  productivity similar to Java or Golang, the performance of C++, and
-  [loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
-- Support for [Substrait](https://substrait.io/) for query plan serialization, making it easier to integrate DataFusion
-  with other projects, and to pass plans across language boundaries.
+- A state of the art query optimizer with expression coercion and
+  simplification, projection and filter pushdown, sort and distribution
+  aware optimizations, automatic join reordering, and more.
+- Permissive Apache 2.0 License, predictable and well understood
+  [Apache Software Foundation](https://www.apache.org/) governance.
+- Implementation in [Rust](https://www.rust-lang.org/), a modern
+  system language with development productivity similar to Java or
+  Golang, the performance of C++, and [loved by programmers
+  everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
+- Support for [Substrait](https://substrait.io/) query plans, to
+  easily pass plans across language and system boundaries.
 
 ## Use Cases
 
 DataFusion can be used without modification as an embedded SQL
 engine or can be customized and used as a foundation for
-building new systems. Here are some examples of systems built using DataFusion:
+building new systems.
+
+While most current usecases are "analytic" or (throughput) some

Review Comment:
   ```suggestion
   While most current usecases are "analytic" (throughput) some
   ```
   
   Nice catch -- 🦅 👁️ 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb commented on pull request #6639: Docs: Update roadmap to point at EPIC's, clarify project goals

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb commented on PR #6639:
URL: https://github.com/apache/arrow-datafusion/pull/6639#issuecomment-1590002646

   I plan to leave this PR open for a few more days to make sure anyone who is interested gets a chance to reply / comment. I'll try and merge it in towards the end of the week


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] alamb merged pull request #6639: Docs: Update roadmap to point at EPIC's, clarify project goals

Posted by "alamb (via GitHub)" <gi...@apache.org>.

alamb merged PR #6639:
URL: https://github.com/apache/arrow-datafusion/pull/6639


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org