You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/11 15:56:59 UTC

[GitHub] [arrow-site] alamb opened a new pull request, #254: [WEBSITE] Blog post about DataFusion 13.0.0

alamb opened a new pull request, #254:
URL: https://github.com/apache/arrow-site/pull/254

   re https://github.com/apache/arrow-datafusion/issues/3671
   
   This blog describes what has been going on in DataFusion for the last 5 months


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] kou commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

kou commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1002774075


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,284 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in rust. This include

Review Comment:
   ```suggestion
   DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in Rust. This include
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] andygrove commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

andygrove commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1003273001


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,284 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. [dask-sql](https://dask-sql.readthedocs.io/en/latest/) or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in Rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+# Improved Support for Cloud Object Stores
+
+DataFusion now has improved support for the major cloud object stores (Amazon S3, Azure Blob Storage, and Google Cloud Storage) via the [object_store](https://crates.io/crates/object_store) crate. Using this integration, DataFusion optimizes reading parquet files by reading only the parts of the files that are needed.
+
+
+## Advanced SQL
+
+One major advance is that DataFusion now supports correlated subqueries, by rewriting them as joins. See the [Subquery](https://arrow.apache.org/datafusion/user-guide/sql/subqueries.html) page in the User Guide for more information.
+
+In addition to numerous other small improvements, here is a list of additional SQL that is now supported:
+
+- Support for `ROWS`, `RANGE`, `PRECEDING` and `FOLLOWING` in `OVER` clauses [#3570]
+- `ROLLUP` and `CUBE` grouping set expressions  [#2446]
+- SUM DISTINCT aggregate support Sum distinct support [#2405]
+- `IN` and `NOT IN` Subqueries by rewriting them to `SEMI` / `ANTI` [#2421] [#2885]
+- Non equality predicates in  `ON` clause of  `LEFT`, `RIGHT, `and `FULL` joins [#2591]
+- Exact `MEDIAN` [#3009]
+- Support for GROUPING SETS/CUBE/ROLLUP [#2716]
+
+# More DDL Support
+Just as it is important to query, it is also important to defining data sources. The following features help users do so:
+
+ - `CREATE VIEW` [#2279]
+ - `DESCRIBE <table>` [#2642]
+ - Custom / Dynamic table provider factories [#3311]
+ - `SHOW CREATE TABLE` for support for views [#2830]
+
+# Faster Execution
+Performance is always an important goal for DataFusion, and there are a number of significant new optimizations such as
+
+ - Optimizations of TopK (queries with a `LIMIT` or `OFFSET` clause):  [#3527], [#2521]
+ - Reduce `left`/`right`/`full` joins to `inner` join [#2750]
+ - Convert  cross joins to inner joins when possible [#3482]
+ - Sort preserving `SortMergeJoin` [#2699]
+ - Improvements in group by and sort performance [#2375]
+ - Adaptive `regex_replace` implementation [#3518]
+
+# Optimizer Enhancements
+Internally the optimizer has been significantly enhanced as well.
+
+- Casting / coercion now happens logical planning [#3185] [#3396] [#3636]

Review Comment:
   ```suggestion
   - Casting / coercion now happens during logical planning [#3185] [#3396] [#3636]
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] waitingkuo commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

waitingkuo commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1001363547


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+## Advanced SQL
+- Support for GROUPING SETS/CUBE/ROLLUP (#2716)
+- Custom window frame logic (support `ROWS`, `RANGE`, `PRECEDING` and `FOLLOWING` for window functions) (#3570)
+- Add support for correlated subqueries (#2885)
+- `ROLLUP` and `CUBE` grouping set expressions (#2446)
+- #2405 SUM DISTINCT aggregate support Sum distinct support
+- Support for `IN` and `NOT IN` Subqueries by rewriting them to `SEMI` / `ANTI` (#2421)
+- Support for non equality predicates in  `ON` clause of  `LEFT`, `RIGHT, `and `FULL` joins (#2591)
+- Exact `MEDIAN` (#3009)
+
+# More DDL Support
+ - `CREATE VIEW` (#2279)
+ - `DESCRIBE <table>` (#2642)
+ - Custom / Dynamic table provider factories (#3311)
+ - `SHOW CREATE TABLE` for support for views (#2830)
+
+# Faster Execution
+ - Optimizations of TopK (queries with a `LIMIT` or `OFFSET` clause):  (#3527), (#2521)
+ - Reduce `left`/`right`/`full` joins to `inner` join (#2750)
+ - Convert  cross joins to inner joins when possible (#3482)
+ - Sort preserving `SortMergeJoin` (#2699)
+ - Improvements in group by and sort performance (#2375)
+
+# Optimizer Enhancements
+- Casting / coercion now happens logical planning (#3185) (#3396) (#3636)
+- More sophisticated expression analysis and simplification
+
+# Parquet
+ - Parquet reader can now read directly from remote parquet files on object storage (#2489) (#2677) (#3051)
+ - Experimental support for “predicate pushdown” that implements late materialization after filtering during the scan (we plan another blog post on this soon).
+ - Support reading directly from S3 via `datafusion-cli ` (#3631)
+
+# DataType Support
+- Support for `IN` LIST,  support, casting, etc for `Decimal`
+- Timestamps: `date_bin` built-in function (#3034), timestamp plus minus interval (#3110), TIME literal values (#3010)
+- Binary operations (AND, XOR, etc):  (#3037) (#1619) (#3420) (#3430)
+

Review Comment:
   - Support TimestampTz ( apache/arrow-datafusion#3660 )



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1002434601


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+## Advanced SQL
+- Support for GROUPING SETS/CUBE/ROLLUP (#2716)
+- Custom window frame logic (support `ROWS`, `RANGE`, `PRECEDING` and `FOLLOWING` for window functions) (#3570)
+- Add support for correlated subqueries (#2885)
+- `ROLLUP` and `CUBE` grouping set expressions (#2446)
+- #2405 SUM DISTINCT aggregate support Sum distinct support
+- Support for `IN` and `NOT IN` Subqueries by rewriting them to `SEMI` / `ANTI` (#2421)
+- Support for non equality predicates in  `ON` clause of  `LEFT`, `RIGHT, `and `FULL` joins (#2591)
+- Exact `MEDIAN` (#3009)
+
+# More DDL Support
+ - `CREATE VIEW` (#2279)
+ - `DESCRIBE <table>` (#2642)
+ - Custom / Dynamic table provider factories (#3311)
+ - `SHOW CREATE TABLE` for support for views (#2830)
+
+# Faster Execution
+ - Optimizations of TopK (queries with a `LIMIT` or `OFFSET` clause):  (#3527), (#2521)
+ - Reduce `left`/`right`/`full` joins to `inner` join (#2750)
+ - Convert  cross joins to inner joins when possible (#3482)
+ - Sort preserving `SortMergeJoin` (#2699)
+ - Improvements in group by and sort performance (#2375)
+
+# Optimizer Enhancements
+- Casting / coercion now happens logical planning (#3185) (#3396) (#3636)
+- More sophisticated expression analysis and simplification
+
+# Parquet
+ - Parquet reader can now read directly from remote parquet files on object storage (#2489) (#2677) (#3051)
+ - Experimental support for “predicate pushdown” that implements late materialization after filtering during the scan (we plan another blog post on this soon).
+ - Support reading directly from S3 via `datafusion-cli ` (#3631)
+
+# DataType Support
+- Support for `IN` LIST,  support, casting, etc for `Decimal`
+- Timestamps: `date_bin` built-in function (#3034), timestamp plus minus interval (#3110), TIME literal values (#3010)
+- Binary operations (AND, XOR, etc):  (#3037) (#1619) (#3420) (#3430)
+
+
+
+## Upcoming Work
+With the community growing and code accelerating, there is so much great stuff on the horizon. Some features we expect to land in the next few months:
+
+- [Complete Parquet Pushdown](https://github.com/apache/arrow-datafusion/issues/3462)
+- [Additional date/time support](https://github.com/apache/arrow-datafusion/issues/3148)
+- improve / make it easier to implement FlightSQL using DataFusion (TODO LINK)
+

Review Comment:
   Absolutely -- done in 824648efdf



##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+## Advanced SQL
+- Support for GROUPING SETS/CUBE/ROLLUP (#2716)
+- Custom window frame logic (support `ROWS`, `RANGE`, `PRECEDING` and `FOLLOWING` for window functions) (#3570)
+- Add support for correlated subqueries (#2885)
+- `ROLLUP` and `CUBE` grouping set expressions (#2446)
+- #2405 SUM DISTINCT aggregate support Sum distinct support
+- Support for `IN` and `NOT IN` Subqueries by rewriting them to `SEMI` / `ANTI` (#2421)
+- Support for non equality predicates in  `ON` clause of  `LEFT`, `RIGHT, `and `FULL` joins (#2591)
+- Exact `MEDIAN` (#3009)
+
+# More DDL Support
+ - `CREATE VIEW` (#2279)
+ - `DESCRIBE <table>` (#2642)
+ - Custom / Dynamic table provider factories (#3311)
+ - `SHOW CREATE TABLE` for support for views (#2830)
+
+# Faster Execution
+ - Optimizations of TopK (queries with a `LIMIT` or `OFFSET` clause):  (#3527), (#2521)
+ - Reduce `left`/`right`/`full` joins to `inner` join (#2750)
+ - Convert  cross joins to inner joins when possible (#3482)
+ - Sort preserving `SortMergeJoin` (#2699)
+ - Improvements in group by and sort performance (#2375)
+
+# Optimizer Enhancements
+- Casting / coercion now happens logical planning (#3185) (#3396) (#3636)
+- More sophisticated expression analysis and simplification
+
+# Parquet
+ - Parquet reader can now read directly from remote parquet files on object storage (#2489) (#2677) (#3051)
+ - Experimental support for “predicate pushdown” that implements late materialization after filtering during the scan (we plan another blog post on this soon).
+ - Support reading directly from S3 via `datafusion-cli ` (#3631)
+
+# DataType Support
+- Support for `IN` LIST,  support, casting, etc for `Decimal`
+- Timestamps: `date_bin` built-in function (#3034), timestamp plus minus interval (#3110), TIME literal values (#3010)
+- Binary operations (AND, XOR, etc):  (#3037) (#1619) (#3420) (#3430)
+
+
+
+## Upcoming Work
+With the community growing and code accelerating, there is so much great stuff on the horizon. Some features we expect to land in the next few months:
+
+- [Complete Parquet Pushdown](https://github.com/apache/arrow-datafusion/issues/3462)
+- [Additional date/time support](https://github.com/apache/arrow-datafusion/issues/3148)
+- improve / make it easier to implement FlightSQL using DataFusion (TODO LINK)
+

Review Comment:
   Absolutely -- done in 824648efdf



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] js8544 commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

js8544 commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1006746462


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.

Review Comment:
   @alamb The google slides link seems to have protected access. I can't open it with my personal GMail account. The Slideshare link works for me though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1002132743


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+## Advanced SQL
+- Support for GROUPING SETS/CUBE/ROLLUP (#2716)
+- Custom window frame logic (support `ROWS`, `RANGE`, `PRECEDING` and `FOLLOWING` for window functions) (#3570)
+- Add support for correlated subqueries (#2885)
+- `ROLLUP` and `CUBE` grouping set expressions (#2446)
+- #2405 SUM DISTINCT aggregate support Sum distinct support

Review Comment:
   Yes, you are right -- I will figure out how to get the urls working



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] andygrove commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

andygrove commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1003271890


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,284 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. [dask-sql](https://dask-sql.readthedocs.io/en/latest/) or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in Rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)

Review Comment:
   Perhaps we should just link to the Ballista repo here, rather than an old doc about the migration?
   
   ```suggestion
   We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://github.com/apache/arrow-ballista)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #254:
URL: https://github.com/apache/arrow-site/pull/254#issuecomment-1286039190

   @andygrove  mentions there is a draft blog for datafusion 11 that was not published that we can use for additional content: https://docs.google.com/document/d/1tPCgeB6iQPVvbRyaXft7nKqorrhv-XDqVXuMp4pG9ns/edit?usp=sharing


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] andygrove commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

andygrove commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1003272622


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,284 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. [dask-sql](https://dask-sql.readthedocs.io/en/latest/) or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in Rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+# Improved Support for Cloud Object Stores
+
+DataFusion now has improved support for the major cloud object stores (Amazon S3, Azure Blob Storage, and Google Cloud Storage) via the [object_store](https://crates.io/crates/object_store) crate. Using this integration, DataFusion optimizes reading parquet files by reading only the parts of the files that are needed.
+
+
+## Advanced SQL
+
+One major advance is that DataFusion now supports correlated subqueries, by rewriting them as joins. See the [Subquery](https://arrow.apache.org/datafusion/user-guide/sql/subqueries.html) page in the User Guide for more information.
+
+In addition to numerous other small improvements, here is a list of additional SQL that is now supported:
+
+- Support for `ROWS`, `RANGE`, `PRECEDING` and `FOLLOWING` in `OVER` clauses [#3570]
+- `ROLLUP` and `CUBE` grouping set expressions  [#2446]
+- SUM DISTINCT aggregate support Sum distinct support [#2405]
+- `IN` and `NOT IN` Subqueries by rewriting them to `SEMI` / `ANTI` [#2421] [#2885]
+- Non equality predicates in  `ON` clause of  `LEFT`, `RIGHT, `and `FULL` joins [#2591]
+- Exact `MEDIAN` [#3009]
+- Support for GROUPING SETS/CUBE/ROLLUP [#2716]
+
+# More DDL Support
+Just as it is important to query, it is also important to defining data sources. The following features help users do so:

Review Comment:
   ```suggestion
   Just as it is important to query, it is also important to define data sources. The following features help users do so:
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] isidentical commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

isidentical commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1001044813


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+## Advanced SQL
+- Support for GROUPING SETS/CUBE/ROLLUP (#2716)
+- Custom window frame logic (support `ROWS`, `RANGE`, `PRECEDING` and `FOLLOWING` for window functions) (#3570)
+- Add support for correlated subqueries (#2885)
+- `ROLLUP` and `CUBE` grouping set expressions (#2446)
+- #2405 SUM DISTINCT aggregate support Sum distinct support
+- Support for `IN` and `NOT IN` Subqueries by rewriting them to `SEMI` / `ANTI` (#2421)
+- Support for non equality predicates in  `ON` clause of  `LEFT`, `RIGHT, `and `FULL` joins (#2591)
+- Exact `MEDIAN` (#3009)
+
+# More DDL Support
+ - `CREATE VIEW` (#2279)
+ - `DESCRIBE <table>` (#2642)
+ - Custom / Dynamic table provider factories (#3311)
+ - `SHOW CREATE TABLE` for support for views (#2830)
+
+# Faster Execution
+ - Optimizations of TopK (queries with a `LIMIT` or `OFFSET` clause):  (#3527), (#2521)
+ - Reduce `left`/`right`/`full` joins to `inner` join (#2750)
+ - Convert  cross joins to inner joins when possible (#3482)
+ - Sort preserving `SortMergeJoin` (#2699)
+ - Improvements in group by and sort performance (#2375)
+
+# Optimizer Enhancements
+- Casting / coercion now happens logical planning (#3185) (#3396) (#3636)
+- More sophisticated expression analysis and simplification
+
+# Parquet
+ - Parquet reader can now read directly from remote parquet files on object storage (#2489) (#2677) (#3051)
+ - Experimental support for “predicate pushdown” that implements late materialization after filtering during the scan (we plan another blog post on this soon).
+ - Support reading directly from S3 via `datafusion-cli ` (#3631)
+
+# DataType Support
+- Support for `IN` LIST,  support, casting, etc for `Decimal`
+- Timestamps: `date_bin` built-in function (#3034), timestamp plus minus interval (#3110), TIME literal values (#3010)
+- Binary operations (AND, XOR, etc):  (#3037) (#1619) (#3420) (#3430)
+

Review Comment:
   - `IS TRUE/FALSE` and `IS [NOT] UNKNOWN` (#3235), (#3246) 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] waitingkuo commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

waitingkuo commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1001361473


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->

Review Comment:
   it was the result from 12.0. i'll update it soon with the latest version



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1003748517


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.

Review Comment:
   updating to use google slides link



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1006888176


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.

Review Comment:
   Thanks for the report -- I will add the slideshare link as an alternate one: https://github.com/apache/arrow-site/pull/263



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] github-actions[bot] commented on pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

github-actions[bot] commented on PR #254:
URL: https://github.com/apache/arrow-site/pull/254#issuecomment-1274924595

   <!--
     Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at
   
       http://www.apache.org/licenses/LICENSE-2.0
   
     Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.
   -->
   
   Thanks for opening a pull request!
   
   Could you open an issue for this pull request on JIRA?
   https://issues.apache.org/jira/browse/ARROW
   
   Then could you also rename pull request title in the following format?
   
       ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}
   
   See also:
   
     * [Other pull requests](https://github.com/apache/arrow-site/pulls/)
     * [Contribution Guidelines - How to contribute patches](https://arrow.apache.org/docs/developers/contributing.html#how-to-contribute-patches)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] isidentical commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

isidentical commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1001036554


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->

Review Comment:
   > I wonder if we should link to clickbench??
   
   Did the clickbench results got updated with 13.0? AFAIK we should be much faster than we were compared to the initial integration time (there were a lot of slowness coming from SelectK queries and a few other optimizations like `regex_replace`, which we should handle much better now). CC: @waitingkuo 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] isidentical commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

isidentical commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1001049154


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+## Advanced SQL
+- Support for GROUPING SETS/CUBE/ROLLUP (#2716)
+- Custom window frame logic (support `ROWS`, `RANGE`, `PRECEDING` and `FOLLOWING` for window functions) (#3570)
+- Add support for correlated subqueries (#2885)
+- `ROLLUP` and `CUBE` grouping set expressions (#2446)
+- #2405 SUM DISTINCT aggregate support Sum distinct support
+- Support for `IN` and `NOT IN` Subqueries by rewriting them to `SEMI` / `ANTI` (#2421)
+- Support for non equality predicates in  `ON` clause of  `LEFT`, `RIGHT, `and `FULL` joins (#2591)
+- Exact `MEDIAN` (#3009)
+
+# More DDL Support
+ - `CREATE VIEW` (#2279)
+ - `DESCRIBE <table>` (#2642)
+ - Custom / Dynamic table provider factories (#3311)
+ - `SHOW CREATE TABLE` for support for views (#2830)
+
+# Faster Execution
+ - Optimizations of TopK (queries with a `LIMIT` or `OFFSET` clause):  (#3527), (#2521)
+ - Reduce `left`/`right`/`full` joins to `inner` join (#2750)
+ - Convert  cross joins to inner joins when possible (#3482)
+ - Sort preserving `SortMergeJoin` (#2699)
+ - Improvements in group by and sort performance (#2375)
+
+# Optimizer Enhancements
+- Casting / coercion now happens logical planning (#3185) (#3396) (#3636)
+- More sophisticated expression analysis and simplification
+
+# Parquet
+ - Parquet reader can now read directly from remote parquet files on object storage (#2489) (#2677) (#3051)
+ - Experimental support for “predicate pushdown” that implements late materialization after filtering during the scan (we plan another blog post on this soon).
+ - Support reading directly from S3 via `datafusion-cli ` (#3631)
+
+# DataType Support
+- Support for `IN` LIST,  support, casting, etc for `Decimal`
+- Timestamps: `date_bin` built-in function (#3034), timestamp plus minus interval (#3110), TIME literal values (#3010)
+- Binary operations (AND, XOR, etc):  (#3037) (#1619) (#3420) (#3430)
+
+
+
+## Upcoming Work
+With the community growing and code accelerating, there is so much great stuff on the horizon. Some features we expect to land in the next few months:
+
+- [Complete Parquet Pushdown](https://github.com/apache/arrow-datafusion/issues/3462)
+- [Additional date/time support](https://github.com/apache/arrow-datafusion/issues/3148)
+- improve / make it easier to implement FlightSQL using DataFusion (TODO LINK)
+

Review Comment:
   Not sure if it's worth including, but maybe `Improved cost estimations & Nested Join Optimizations`. (#128, #3843, #3845)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb merged pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

alamb merged PR #254:
URL: https://github.com/apache/arrow-site/pull/254


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1002132354


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.

Review Comment:
   https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf
   workf for me:
   
   ![Screen Shot 2022-10-21 at 3 41 22 PM](https://user-images.githubusercontent.com/490673/197276447-47ba73e1-d3cb-4690-be6b-f60b91fe5573.png)
   
   
   However, perhaps the google slides are better; https://docs.google.com/presentation/d/1iNX_35sWUakee2q3zMFPyHE4IV2nC3lkCK_H6Y2qK84/edit#slide=id.p ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] HaoYang670 commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

HaoYang670 commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1001408407


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.

Review Comment:
   I can't access this link: `https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf`. Could anyone access it?



##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+## Advanced SQL
+- Support for GROUPING SETS/CUBE/ROLLUP (#2716)
+- Custom window frame logic (support `ROWS`, `RANGE`, `PRECEDING` and `FOLLOWING` for window functions) (#3570)
+- Add support for correlated subqueries (#2885)
+- `ROLLUP` and `CUBE` grouping set expressions (#2446)
+- #2405 SUM DISTINCT aggregate support Sum distinct support

Review Comment:
   Also, is it better to append the URLs of the PRs directly?



##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+## Advanced SQL
+- Support for GROUPING SETS/CUBE/ROLLUP (#2716)
+- Custom window frame logic (support `ROWS`, `RANGE`, `PRECEDING` and `FOLLOWING` for window functions) (#3570)
+- Add support for correlated subqueries (#2885)
+- `ROLLUP` and `CUBE` grouping set expressions (#2446)
+- #2405 SUM DISTINCT aggregate support Sum distinct support

Review Comment:
   ```suggestion
   - SUM DISTINCT aggregate support Sum distinct support (#2405)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1002434880


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+## Advanced SQL
+- Support for GROUPING SETS/CUBE/ROLLUP (#2716)
+- Custom window frame logic (support `ROWS`, `RANGE`, `PRECEDING` and `FOLLOWING` for window functions) (#3570)
+- Add support for correlated subqueries (#2885)
+- `ROLLUP` and `CUBE` grouping set expressions (#2446)
+- #2405 SUM DISTINCT aggregate support Sum distinct support
+- Support for `IN` and `NOT IN` Subqueries by rewriting them to `SEMI` / `ANTI` (#2421)
+- Support for non equality predicates in  `ON` clause of  `LEFT`, `RIGHT, `and `FULL` joins (#2591)
+- Exact `MEDIAN` (#3009)
+
+# More DDL Support
+ - `CREATE VIEW` (#2279)
+ - `DESCRIBE <table>` (#2642)
+ - Custom / Dynamic table provider factories (#3311)
+ - `SHOW CREATE TABLE` for support for views (#2830)
+
+# Faster Execution
+ - Optimizations of TopK (queries with a `LIMIT` or `OFFSET` clause):  (#3527), (#2521)
+ - Reduce `left`/`right`/`full` joins to `inner` join (#2750)
+ - Convert  cross joins to inner joins when possible (#3482)
+ - Sort preserving `SortMergeJoin` (#2699)
+ - Improvements in group by and sort performance (#2375)
+
+# Optimizer Enhancements
+- Casting / coercion now happens logical planning (#3185) (#3396) (#3636)
+- More sophisticated expression analysis and simplification
+
+# Parquet
+ - Parquet reader can now read directly from remote parquet files on object storage (#2489) (#2677) (#3051)
+ - Experimental support for “predicate pushdown” that implements late materialization after filtering during the scan (we plan another blog post on this soon).
+ - Support reading directly from S3 via `datafusion-cli ` (#3631)
+
+# DataType Support
+- Support for `IN` LIST,  support, casting, etc for `Decimal`
+- Timestamps: `date_bin` built-in function (#3034), timestamp plus minus interval (#3110), TIME literal values (#3010)
+- Binary operations (AND, XOR, etc):  (#3037) (#1619) (#3420) (#3430)
+
+
+
+## Upcoming Work
+With the community growing and code accelerating, there is so much great stuff on the horizon. Some features we expect to land in the next few months:
+
+- [Complete Parquet Pushdown](https://github.com/apache/arrow-datafusion/issues/3462)
+- [Additional date/time support](https://github.com/apache/arrow-datafusion/issues/3148)
+- improve / make it easier to implement FlightSQL using DataFusion (TODO LINK)
+
+
+# Community Growth
+
+The DataFusion 9.0.0 and 13.0.0 releases consists of 433 PRs from 64 distinct contributors. This does not count all the great works that goes into the [arrow](https://crates.io/crates/arrow),  [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store) libraries that the same community burtures.
+
+<!--
+$ git log --pretty=oneline 9.0.0..13.0.0 . | wc -l
+433
+
+$ git shortlog -sn 9.0.0..13.0.0 . | wc -l
+65
+-->
+
+
+# How to Get Involved
+
+
+Again, kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!
+
+If you are interested in contributing to DataFusion, we would love to
+have you join us on our journey of learning about state-of-the-art
+query processing!
+
+You can help by trying out DataFusion on some of your own data and
+projects and let us know how it goes or contribute a PR with
+documentation, tests or code. A list of open issues suitable for
+beginners is
+[here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
+
+Check out our new [Communication Doc](https://arrow.apache.org/datafusion/community/communication.html) on more
+ways to engage with the community.
+
+## Appendix: Contributor Shoutout
+
+To give a sense of the number of people who contribute to this project regularly, we present for your consideration the following list derived from `git shortlog -sn 9.0.0..13.0.0 .` Thank you all again!
+
+```
+    87	Andy Grove
+    71	Andrew Lamb
+    29	Kun Liu
+    18	kmitchener
+    17	Wei-Ting Kuo
+    14	Yang Jiang
+    13	dependabot[bot]
+    12	Raphael Taylor-Davies
+    11	Batuhan Taskaya
+    11	Kirk Mitchener

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1002434198


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+## Advanced SQL
+- Support for GROUPING SETS/CUBE/ROLLUP (#2716)
+- Custom window frame logic (support `ROWS`, `RANGE`, `PRECEDING` and `FOLLOWING` for window functions) (#3570)
+- Add support for correlated subqueries (#2885)
+- `ROLLUP` and `CUBE` grouping set expressions (#2446)
+- #2405 SUM DISTINCT aggregate support Sum distinct support
+- Support for `IN` and `NOT IN` Subqueries by rewriting them to `SEMI` / `ANTI` (#2421)
+- Support for non equality predicates in  `ON` clause of  `LEFT`, `RIGHT, `and `FULL` joins (#2591)
+- Exact `MEDIAN` (#3009)
+
+# More DDL Support
+ - `CREATE VIEW` (#2279)
+ - `DESCRIBE <table>` (#2642)
+ - Custom / Dynamic table provider factories (#3311)
+ - `SHOW CREATE TABLE` for support for views (#2830)
+
+# Faster Execution
+ - Optimizations of TopK (queries with a `LIMIT` or `OFFSET` clause):  (#3527), (#2521)
+ - Reduce `left`/`right`/`full` joins to `inner` join (#2750)
+ - Convert  cross joins to inner joins when possible (#3482)
+ - Sort preserving `SortMergeJoin` (#2699)
+ - Improvements in group by and sort performance (#2375)
+
+# Optimizer Enhancements
+- Casting / coercion now happens logical planning (#3185) (#3396) (#3636)
+- More sophisticated expression analysis and simplification
+
+# Parquet
+ - Parquet reader can now read directly from remote parquet files on object storage (#2489) (#2677) (#3051)
+ - Experimental support for “predicate pushdown” that implements late materialization after filtering during the scan (we plan another blog post on this soon).
+ - Support reading directly from S3 via `datafusion-cli ` (#3631)
+
+# DataType Support
+- Support for `IN` LIST,  support, casting, etc for `Decimal`
+- Timestamps: `date_bin` built-in function (#3034), timestamp plus minus interval (#3110), TIME literal values (#3010)
+- Binary operations (AND, XOR, etc):  (#3037) (#1619) (#3420) (#3430)
+

Review Comment:
   In d02bb30657 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

alamb commented on PR #254:
URL: https://github.com/apache/arrow-site/pull/254#issuecomment-1289609544

   I plan to update the dates on this PR and publish it tomorrow unless anyone needs more time to review. Please just let me know if you do so


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] kou commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

kou commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1002773849


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,284 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).

Review Comment:
   ```suggestion
   While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. [dask-sql](https://dask-sql.readthedocs.io/en/latest/) or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] andygrove commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

andygrove commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1002479373


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,284 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+# Improved Support for Cloud Object Stores
+
+DataFusion now has improved support for the major cloud object stores (Amazon S3, Azure Blog Storage, and Google Cloud Storage) via the [object_store](https://crates.io/crates/object_store) crate. Using this integration, DataFusion optimizes reading parquet files by reading only the parts of the files that are needed.

Review Comment:
   ```suggestion
   DataFusion now has improved support for the major cloud object stores (Amazon S3, Azure Blob Storage, and Google Cloud Storage) via the [object_store](https://crates.io/crates/object_store) crate. Using this integration, DataFusion optimizes reading parquet files by reading only the parts of the files that are needed.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] andygrove commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

andygrove commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1001079316


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+## Advanced SQL
+- Support for GROUPING SETS/CUBE/ROLLUP (#2716)
+- Custom window frame logic (support `ROWS`, `RANGE`, `PRECEDING` and `FOLLOWING` for window functions) (#3570)
+- Add support for correlated subqueries (#2885)
+- `ROLLUP` and `CUBE` grouping set expressions (#2446)
+- #2405 SUM DISTINCT aggregate support Sum distinct support
+- Support for `IN` and `NOT IN` Subqueries by rewriting them to `SEMI` / `ANTI` (#2421)
+- Support for non equality predicates in  `ON` clause of  `LEFT`, `RIGHT, `and `FULL` joins (#2591)
+- Exact `MEDIAN` (#3009)
+
+# More DDL Support
+ - `CREATE VIEW` (#2279)
+ - `DESCRIBE <table>` (#2642)
+ - Custom / Dynamic table provider factories (#3311)
+ - `SHOW CREATE TABLE` for support for views (#2830)
+
+# Faster Execution
+ - Optimizations of TopK (queries with a `LIMIT` or `OFFSET` clause):  (#3527), (#2521)
+ - Reduce `left`/`right`/`full` joins to `inner` join (#2750)
+ - Convert  cross joins to inner joins when possible (#3482)
+ - Sort preserving `SortMergeJoin` (#2699)
+ - Improvements in group by and sort performance (#2375)
+
+# Optimizer Enhancements
+- Casting / coercion now happens logical planning (#3185) (#3396) (#3636)
+- More sophisticated expression analysis and simplification
+
+# Parquet
+ - Parquet reader can now read directly from remote parquet files on object storage (#2489) (#2677) (#3051)
+ - Experimental support for “predicate pushdown” that implements late materialization after filtering during the scan (we plan another blog post on this soon).
+ - Support reading directly from S3 via `datafusion-cli ` (#3631)
+
+# DataType Support
+- Support for `IN` LIST,  support, casting, etc for `Decimal`
+- Timestamps: `date_bin` built-in function (#3034), timestamp plus minus interval (#3110), TIME literal values (#3010)
+- Binary operations (AND, XOR, etc):  (#3037) (#1619) (#3420) (#3430)
+
+
+
+## Upcoming Work
+With the community growing and code accelerating, there is so much great stuff on the horizon. Some features we expect to land in the next few months:
+
+- [Complete Parquet Pushdown](https://github.com/apache/arrow-datafusion/issues/3462)
+- [Additional date/time support](https://github.com/apache/arrow-datafusion/issues/3148)
+- improve / make it easier to implement FlightSQL using DataFusion (TODO LINK)
+
+
+# Community Growth
+
+The DataFusion 9.0.0 and 13.0.0 releases consists of 433 PRs from 64 distinct contributors. This does not count all the great works that goes into the [arrow](https://crates.io/crates/arrow),  [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store) libraries that the same community burtures.
+
+<!--
+$ git log --pretty=oneline 9.0.0..13.0.0 . | wc -l
+433
+
+$ git shortlog -sn 9.0.0..13.0.0 . | wc -l
+65
+-->
+
+
+# How to Get Involved
+
+
+Again, kudos to everyone in the community who contributed ideas, discussions, bug reports, documentation and code. It is exciting to be building something so cool together!
+
+If you are interested in contributing to DataFusion, we would love to
+have you join us on our journey of learning about state-of-the-art
+query processing!
+
+You can help by trying out DataFusion on some of your own data and
+projects and let us know how it goes or contribute a PR with
+documentation, tests or code. A list of open issues suitable for
+beginners is
+[here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
+
+Check out our new [Communication Doc](https://arrow.apache.org/datafusion/community/communication.html) on more
+ways to engage with the community.
+
+## Appendix: Contributor Shoutout
+
+To give a sense of the number of people who contribute to this project regularly, we present for your consideration the following list derived from `git shortlog -sn 9.0.0..13.0.0 .` Thank you all again!
+
+```
+    87	Andy Grove
+    71	Andrew Lamb
+    29	Kun Liu
+    18	kmitchener
+    17	Wei-Ting Kuo
+    14	Yang Jiang
+    13	dependabot[bot]
+    12	Raphael Taylor-Davies
+    11	Batuhan Taskaya
+    11	Kirk Mitchener

Review Comment:
   We should combine stats for `kmitchener` and `Kirk Mitchener`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] isidentical commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

isidentical commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1001039605


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+## Advanced SQL
+- Support for GROUPING SETS/CUBE/ROLLUP (#2716)
+- Custom window frame logic (support `ROWS`, `RANGE`, `PRECEDING` and `FOLLOWING` for window functions) (#3570)
+- Add support for correlated subqueries (#2885)
+- `ROLLUP` and `CUBE` grouping set expressions (#2446)
+- #2405 SUM DISTINCT aggregate support Sum distinct support
+- Support for `IN` and `NOT IN` Subqueries by rewriting them to `SEMI` / `ANTI` (#2421)
+- Support for non equality predicates in  `ON` clause of  `LEFT`, `RIGHT, `and `FULL` joins (#2591)
+- Exact `MEDIAN` (#3009)
+
+# More DDL Support
+ - `CREATE VIEW` (#2279)
+ - `DESCRIBE <table>` (#2642)
+ - Custom / Dynamic table provider factories (#3311)
+ - `SHOW CREATE TABLE` for support for views (#2830)
+
+# Faster Execution
+ - Optimizations of TopK (queries with a `LIMIT` or `OFFSET` clause):  (#3527), (#2521)
+ - Reduce `left`/`right`/`full` joins to `inner` join (#2750)
+ - Convert  cross joins to inner joins when possible (#3482)
+ - Sort preserving `SortMergeJoin` (#2699)
+ - Improvements in group by and sort performance (#2375)
+

Review Comment:
   Adaptive `regex_replace` implementation: https://github.com/apache/arrow-datafusion/issues/3518



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [arrow-site] alamb commented on a diff in pull request #254: [WEBSITE] Blog post about DataFusion 13.0.0

Posted by GitBox <gi...@apache.org>.

alamb commented on code in PR #254:
URL: https://github.com/apache/arrow-site/pull/254#discussion_r1002199113


##########
_posts/2022-10-20-datafusion-13.0.0.md:
##########
@@ -0,0 +1,226 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 13.0.0 Project Update"
+date: "2022-10-20 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[Apache Arrow DataFusion](https://arrow.apache.org/datafusion/) [`13.0.0`](https://crates.io/crates/datafusion) is released, and this blog contains an update on the project for the 5 months since our [last update in May 2022](https://arrow.apache.org/blog/2022/05/16/datafusion-8.0.0/).
+
+DataFusion is an extensible and embeddable query engine, written in Rust used to create modern, fast and efficient data pipelines, ETL processes, and database systems. You may want to check out DataFusion to extend your Rust project with:
+
+- [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html),
+- [DataFrame API](https://docs.rs/datafusion/13.0.0/datafusion/dataframe/struct.DataFrame.html),
+- A custom Domain Specific Query Language
+- The ability to easily and quickly read and process Parquet, JSON, Avro or CSV data.
+- To read from remote object stores such as AWS S3, Azure Blob Storage, GCP.
+
+Even though DataFusion is 4 years "young," it has seen significant community growth in the last few months and the momentum continues to accelerate.
+
+# Background
+
+
+DataFusion is used as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability. 2022 has validated our belief in the need for such a ["LLVM for database and AI systems"](https://www.slideshare.net/AndrewLamb32/20220623-apache-arrow-and-datafusion-changing-the-game-for-implementing-database-systemspdf) with announcements such as the [release of FaceBook's Velox](https://engineering.fb.com/2022/08/31/open-source/velox/) engine, the major investments in [Acero](https://arrow.apache.org/docs/cpp/streaming_execution.html) as well as the continued popularity of [Apache Calcite](https://calcite.apache.org/) and other similar technologies.
+
+While Velox and Acero focus on execution engines, DataFusion provides the entire suite of components needed to build most analytic systems, including a SQL frontend, a dataframe API, and  extension points for just about everything. Some [DataFusion users](https://github.com/apache/arrow-datafusion#known-uses) use a subset of the features such as the frontend (e.g. (dask-sql)[https://dask-sql.readthedocs.io/en/latest/] or the execution engine, such as [Blaze](https://github.com/blaze-init/blaze), and some users use many different components to build both SQL based and customized DSL based systems such as [InfluxDB IOx](https://github.com/influxdata/influxdb_iox/pulls) and [VegaFusion](https://github.com/vegafusion/vegafusion).
+
+One of DataFusion’s advantages is its implementation in [Rust](https://www.rust-lang.org/) and thus its easy integration with the broader Rust ecosystem. Rust continues to be a major source of benefit, from the [ease of parallelization with the high quality and standardized `async` ecosystem](https://www.influxdata.com/blog/using-rustlangs-async-tokio-runtime-for-cpu-bound-tasks/) , as well as its modern dependency management system and wonderful performance. <!-- I wonder if we should link to clickbench?? -->
+<!--While we haven’t invested in the benchmarking ratings game datafusion continues to be quite speedy (todo quantity this, with some evidence) – maybe clickbench?-->
+
+<!-- I am not sure this section needed -->
+DataFusion belongs to and builds on the Apache Arrow ecosystem and its family of related technologies and projects in rust. This include
+- [Arrow](https://arrow.apache.org/): low level array representation
+- [Parquet](https://parquet.apache.org/): high performance, full featured support for the parquet file format
+
+
+
+# DataFusion in Action
+
+While DataFusion really shines as an embeddable query engine, if you want to try it out and get a feel for its power, you can use the basic[`datafusion-cli`](https://docs.rs/datafusion-cli/13.0.0/datafusion_cli/) tool to get a sense for what is possible to add in your application
+
+(TODO example here of using datafusion-cli to query from local parquet files on disk)
+
+TODO: also mention you can use the same thing to query data from S3
+
+
+
+# Summary
+
+We have increased the frequency of DataFusion releases to monthly instead of quarterly. This
+makes it easier for the increasing number of projects that now depend on DataFusion.
+
+We have also completed the "graduation" of [Ballista to its own top-level arrow-ballista repository](https://docs.google.com/document/d/1jNRbadyStSrV5kifwn0khufAwq6OnzGczG4z8oTQJP4/edit?usp=sharing)
+which decouples the two projects and allows each project to move even faster.
+
+Along with numerous other bug fixes and smaller improvements, here are some of the major advances:
+
+## Advanced SQL
+- Support for GROUPING SETS/CUBE/ROLLUP (#2716)
+- Custom window frame logic (support `ROWS`, `RANGE`, `PRECEDING` and `FOLLOWING` for window functions) (#3570)
+- Add support for correlated subqueries (#2885)
+- `ROLLUP` and `CUBE` grouping set expressions (#2446)
+- #2405 SUM DISTINCT aggregate support Sum distinct support
+- Support for `IN` and `NOT IN` Subqueries by rewriting them to `SEMI` / `ANTI` (#2421)
+- Support for non equality predicates in  `ON` clause of  `LEFT`, `RIGHT, `and `FULL` joins (#2591)
+- Exact `MEDIAN` (#3009)
+
+# More DDL Support
+ - `CREATE VIEW` (#2279)
+ - `DESCRIBE <table>` (#2642)
+ - Custom / Dynamic table provider factories (#3311)
+ - `SHOW CREATE TABLE` for support for views (#2830)
+
+# Faster Execution
+ - Optimizations of TopK (queries with a `LIMIT` or `OFFSET` clause):  (#3527), (#2521)
+ - Reduce `left`/`right`/`full` joins to `inner` join (#2750)
+ - Convert  cross joins to inner joins when possible (#3482)
+ - Sort preserving `SortMergeJoin` (#2699)
+ - Improvements in group by and sort performance (#2375)
+

Review Comment:
   added -- thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org