You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/07/08 04:22:19 UTC

[GitHub] [arrow-datafusion] andygrove opened a new pull request, #2854: Various updates to top-level README

andygrove opened a new pull request, #2854:
URL: https://github.com/apache/arrow-datafusion/pull/2854

# Which issue does this PR close?

Closes https://github.com/apache/arrow-datafusion/issues/2850

# Rationale for this change

The project has matured a lot and we need to promote all the great work and encourage more people to try DataFusion out

# What changes are included in this PR?

- More information on features and use cases
- Update list of projects using DataFusion

# Are there any user-facing changes?

No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #2854: Various updates to top-level README

Posted by GitBox <gi...@apache.org>.

tustvold commented on code in PR #2854:
URL: https://github.com/apache/arrow-datafusion/pull/2854#discussion_r916720045


##########
README.md:
##########
@@ -21,52 +21,70 @@
 
 <img src="docs/source/_static/images/DataFusion-Logo-Background-White.svg" width="256"/>
 
-DataFusion is an extensible query execution framework, written in
+DataFusion is an extensible query planning, optimization, and execution framework, written in
 Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
 in-memory format.
 
-DataFusion supports both an SQL and a DataFrame API for building
-logical query plans as well as a query optimizer and execution engine
-capable of parallel execution against partitioned data sources (CSV
-and Parquet) using threads.
+## Features
 
-DataFusion also supports distributed query execution via the
-[Ballista](https://github.com/apache/arrow-ballista/) crate.
+- SQL query planner with support for multiple SQL dialects
+- DataFrame API
+- Parquet, CSV, JSON, and Avro file formats are supported natively. Custom
+  file formats can be supported by implementing a `TableProvider` trait.
+- Supports popular object stores, including AWS S3, Azure Blob

Review Comment:
   Yes, it also supports GCS



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] timvw commented on a diff in pull request #2854: Various updates to top-level README

Posted by GitBox <gi...@apache.org>.

timvw commented on code in PR #2854:
URL: https://github.com/apache/arrow-datafusion/pull/2854#discussion_r918323739


##########
README.md:
##########
@@ -21,52 +21,70 @@
 
 <img src="docs/source/_static/images/DataFusion-Logo-Background-White.svg" width="256"/>
 
-DataFusion is an extensible query execution framework, written in
+DataFusion is an extensible query planning, optimization, and execution framework, written in
 Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
 in-memory format.
 
-DataFusion supports both an SQL and a DataFrame API for building
-logical query plans as well as a query optimizer and execution engine
-capable of parallel execution against partitioned data sources (CSV
-and Parquet) using threads.
+## Features
 
-DataFusion also supports distributed query execution via the
-[Ballista](https://github.com/apache/arrow-ballista/) crate.
+- SQL query planner with support for multiple SQL dialects
+- DataFrame API
+- Parquet, CSV, JSON, and Avro file formats are supported natively. Custom
+  file formats can be supported by implementing a `TableProvider` trait.
+- Supports popular object stores, including AWS S3, Azure Blob
+  Storage, and Google Cloud Storage. There are extension points for implementing
+  custom object stores.
 
 ## Use Cases
 
-DataFusion is used to create modern, fast and efficient data
-pipelines, ETL processes, and database systems, which need the
-performance of Rust and Apache Arrow and want to provide their users
-the convenience of an SQL interface or a DataFrame API.
+DataFusion is modular in design with many extension points and can be
+used without modification as an embedded query engine and can also provide
+a foundation for building new systems. Here are some example use cases:
+
+- DataFusion can be used as a SQL query planner and query optimizer, providing
+  optimized logical plans that can then be mapped to other execution engines.
+- DataFusion is used to create modern, fast and efficient data
+  pipelines, ETL processes, and database systems, which need the
+  performance of Rust and Apache Arrow and want to provide their users
+  the convenience of an SQL interface or a DataFrame API.
 
 ## Why DataFusion?
 
 - _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
 - _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
-- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
+- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific use case
 - _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
 
-## Known Uses
+## DataFusion Community Extensions
 
-Projects that adapt to or serve as plugins to DataFusion:
+There are a number of community projects that extend DataFusion or provide integrations with other systems.
 
+### Language Bindings
+
+- [datafusion-c](https://github.com/datafusion-contrib/datafusion-c)
 - [datafusion-python](https://github.com/datafusion-contrib/datafusion-python)
+- [datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby)
 - [datafusion-java](https://github.com/datafusion-contrib/datafusion-java)
-- [datafusion-objectstore-s3](https://github.com/datafusion-contrib/datafusion-objectstore-s3)
-- [datafusion-objectstore-hdfs](https://github.com/datafusion-contrib/datafusion-objectstore-hdfs)
+
+### Integrations
+
 - [datafusion-bigtable](https://github.com/datafusion-contrib/datafusion-bigtable)
-- [datafusion-objectstore-azure](https://github.com/datafusion-contrib/datafusion-objectstore-azure)
+- [datafusion-catalogprovider-glue](https://github.com/datafusion-contrib/datafusion-catalogprovider-glue)
+- [datafusion-substrait](https://github.com/datafusion-contrib/datafusion-substrait)
+
+## Known Uses
 
 Here are some of the projects known to use DataFusion:
 
-- [Ballista](https://github.com/apache/arrow-ballista) Distributed Compute Platform
+- [Ballista](https://github.com/apache/arrow-ballista) Distributed SQL Query Engine
 - [Blaze](https://github.com/blaze-init/blaze) Spark accelerator with DataFusion at its core
 - [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
 - [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
-- [delta-rs](https://github.com/delta-io/delta-rs)
+- [datafusion-tui](https://github.com/datafusion-contrib/datafusion-tui) Text UI for DataFusion
+- [delta-rs](https://github.com/delta-io/delta-rs) Native Rust implementation of Delta Lake
 - [Flock](https://github.com/flock-lab/flock)
 - [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
+- [qv](https://github.com/timvw/qv) Quickly view your data

Review Comment:
   Yes. (Sorry for the late replies, enjoying holidays this month ;)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2854: Various updates to top-level README

Posted by GitBox <gi...@apache.org>.

andygrove commented on code in PR #2854:
URL: https://github.com/apache/arrow-datafusion/pull/2854#discussion_r916452705


##########
README.md:
##########
@@ -21,52 +21,70 @@
 
 <img src="docs/source/_static/images/DataFusion-Logo-Background-White.svg" width="256"/>
 
-DataFusion is an extensible query execution framework, written in
+DataFusion is an extensible query planning, optimization, and execution framework, written in
 Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
 in-memory format.
 
-DataFusion supports both an SQL and a DataFrame API for building
-logical query plans as well as a query optimizer and execution engine
-capable of parallel execution against partitioned data sources (CSV
-and Parquet) using threads.
+## Features
 
-DataFusion also supports distributed query execution via the
-[Ballista](https://github.com/apache/arrow-ballista/) crate.
+- SQL query planner with support for multiple SQL dialects
+- DataFrame API
+- Parquet, CSV, JSON, and Avro file formats are supported natively. Custom
+  file formats can be supported by implementing a `TableProvider` trait.
+- Supports popular object stores, including AWS S3, Azure Blob

Review Comment:
   @tustvold Have I stated this correctly? I am not familiar with the recent changes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] codecov-commenter commented on pull request #2854: Various updates to top-level README

Posted by GitBox <gi...@apache.org>.

codecov-commenter commented on PR #2854:
URL: https://github.com/apache/arrow-datafusion/pull/2854#issuecomment-1178547698

   # [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2854?src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) Report
   > Merging [#2854](https://codecov.io/gh/apache/arrow-datafusion/pull/2854?src=pr&el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (d0eaca4) into [master](https://codecov.io/gh/apache/arrow-datafusion/commit/0ce6f1b1fd8fbf94238db913f8bc884e3c3c6aeb?el=desc&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) (0ce6f1b) will **increase** coverage by `0.01%`.
   > The diff coverage is `n/a`.
   
   ```diff
   @@            Coverage Diff             @@
   ##           master    #2854      +/-   ##
   ==========================================
   + Coverage   85.23%   85.25%   +0.01%     
   ==========================================
     Files         275      275              
     Lines       48962    49002      +40     
   ==========================================
   + Hits        41735    41775      +40     
     Misses       7227     7227              
   ```
   
   
   | [Impacted Files](https://codecov.io/gh/apache/arrow-datafusion/pull/2854?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation) | Coverage Δ | |
   |---|---|---|
   | [datafusion/expr/src/logical\_plan/plan.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy9sb2dpY2FsX3BsYW4vcGxhbi5ycw==) | `74.31% <0.00%> (-0.20%)` | :arrow_down: |
   | [datafusion/expr/src/binary\_rule.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9leHByL3NyYy9iaW5hcnlfcnVsZS5ycw==) | `85.04% <0.00%> (+0.27%)` | :arrow_up: |
   | [datafusion/core/src/physical\_plan/metrics/value.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9waHlzaWNhbF9wbGFuL21ldHJpY3MvdmFsdWUucnM=) | `87.43% <0.00%> (+0.50%)` | :arrow_up: |
   | [datafusion/core/src/dataframe.rs](https://codecov.io/gh/apache/arrow-datafusion/pull/2854/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation#diff-ZGF0YWZ1c2lvbi9jb3JlL3NyYy9kYXRhZnJhbWUucnM=) | `87.65% <0.00%> (+1.42%)` | :arrow_up: |
   
   ------
   
   [Continue to review full report at Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2854?src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation)
   > `Δ = absolute <relative> (impact)`, `ø = not affected`, `? = missing data`
   > Powered by [Codecov](https://codecov.io/gh/apache/arrow-datafusion/pull/2854?src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Last update [0ce6f1b...d0eaca4](https://codecov.io/gh/apache/arrow-datafusion/pull/2854?src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=The+Apache+Software+Foundation).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove commented on a diff in pull request #2854: Various updates to top-level README

Posted by GitBox <gi...@apache.org>.

andygrove commented on code in PR #2854:
URL: https://github.com/apache/arrow-datafusion/pull/2854#discussion_r916451408


##########
README.md:
##########
@@ -21,52 +21,70 @@
 
 <img src="docs/source/_static/images/DataFusion-Logo-Background-White.svg" width="256"/>
 
-DataFusion is an extensible query execution framework, written in
+DataFusion is an extensible query planning, optimization, and execution framework, written in
 Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
 in-memory format.
 
-DataFusion supports both an SQL and a DataFrame API for building
-logical query plans as well as a query optimizer and execution engine
-capable of parallel execution against partitioned data sources (CSV
-and Parquet) using threads.
+## Features
 
-DataFusion also supports distributed query execution via the
-[Ballista](https://github.com/apache/arrow-ballista/) crate.
+- SQL query planner with support for multiple SQL dialects
+- DataFrame API
+- Parquet, CSV, JSON, and Avro file formats are supported natively. Custom
+  file formats can be supported by implementing a `TableProvider` trait.
+- Supports popular object stores, including AWS S3, Azure Blob
+  Storage, and Google Cloud Storage. There are extension points for implementing
+  custom object stores.
 
 ## Use Cases
 
-DataFusion is used to create modern, fast and efficient data
-pipelines, ETL processes, and database systems, which need the
-performance of Rust and Apache Arrow and want to provide their users
-the convenience of an SQL interface or a DataFrame API.
+DataFusion is modular in design with many extension points and can be
+used without modification as an embedded query engine and can also provide
+a foundation for building new systems. Here are some example use cases:
+
+- DataFusion can be used as a SQL query planner and query optimizer, providing
+  optimized logical plans that can then be mapped to other execution engines.
+- DataFusion is used to create modern, fast and efficient data
+  pipelines, ETL processes, and database systems, which need the
+  performance of Rust and Apache Arrow and want to provide their users
+  the convenience of an SQL interface or a DataFrame API.
 
 ## Why DataFusion?
 
 - _High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
 - _Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
-- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
+- _Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific use case
 - _High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
 
-## Known Uses
+## DataFusion Community Extensions
 
-Projects that adapt to or serve as plugins to DataFusion:
+There are a number of community projects that extend DataFusion or provide integrations with other systems.
 
+### Language Bindings
+
+- [datafusion-c](https://github.com/datafusion-contrib/datafusion-c)
 - [datafusion-python](https://github.com/datafusion-contrib/datafusion-python)
+- [datafusion-ruby](https://github.com/datafusion-contrib/datafusion-ruby)
 - [datafusion-java](https://github.com/datafusion-contrib/datafusion-java)
-- [datafusion-objectstore-s3](https://github.com/datafusion-contrib/datafusion-objectstore-s3)
-- [datafusion-objectstore-hdfs](https://github.com/datafusion-contrib/datafusion-objectstore-hdfs)
+
+### Integrations
+
 - [datafusion-bigtable](https://github.com/datafusion-contrib/datafusion-bigtable)
-- [datafusion-objectstore-azure](https://github.com/datafusion-contrib/datafusion-objectstore-azure)
+- [datafusion-catalogprovider-glue](https://github.com/datafusion-contrib/datafusion-catalogprovider-glue)
+- [datafusion-substrait](https://github.com/datafusion-contrib/datafusion-substrait)
+
+## Known Uses
 
 Here are some of the projects known to use DataFusion:
 
-- [Ballista](https://github.com/apache/arrow-ballista) Distributed Compute Platform
+- [Ballista](https://github.com/apache/arrow-ballista) Distributed SQL Query Engine
 - [Blaze](https://github.com/blaze-init/blaze) Spark accelerator with DataFusion at its core
 - [Cloudfuse Buzz](https://github.com/cloudfuse-io/buzz-rust)
 - [Cube Store](https://github.com/cube-js/cube.js/tree/master/rust)
-- [delta-rs](https://github.com/delta-io/delta-rs)
+- [datafusion-tui](https://github.com/datafusion-contrib/datafusion-tui) Text UI for DataFusion
+- [delta-rs](https://github.com/delta-io/delta-rs) Native Rust implementation of Delta Lake
 - [Flock](https://github.com/flock-lab/flock)
 - [InfluxDB IOx](https://github.com/influxdata/influxdb_iox) Time Series Database
+- [qv](https://github.com/timvw/qv) Quickly view your data

Review Comment:
   @timvw I assume you are ok with having this listed here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-datafusion] andygrove merged pull request #2854: Various updates to top-level README

Posted by GitBox <gi...@apache.org>.

andygrove merged PR #2854:
URL: https://github.com/apache/arrow-datafusion/pull/2854


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org