You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by GitBox <gi...@apache.org> on 2023/01/07 16:12:34 UTC
[GitHub] [arrow-site] alamb commented on a diff in pull request #294: [WEBSITE] DataFusion 16.0.0 blog post (WIP)

alamb commented on code in PR #294:
URL: https://github.com/apache/arrow-site/pull/294#discussion_r1064019922


##########
_posts/2023-01-07-datafusion-16.0.0.md:
##########
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably others)
+GA of InfluxDB IOx
+
+
+The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct contributors. This does not count all the work that goes into our dependencies such as [arrow](https://crates.io/crates/arrow),  [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps nurture.
+
+<!--
+$ git log --pretty=oneline 13.0.0..16.0.0 . | wc -l
+TBD
+
+
+$ git shortlog -sn 13.0.0..16.0.0 . | wc -l
+TBD
+
+
+At time of this writing:
+$ git log --pretty=oneline 13.0.0..apache/master . | wc -l
+     520
+
+(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion2$  git shortlog -sn 13.0.0..apache/master . | wc -l
+      70
+-->
+
+## Performance 🚀
+
+Performance and efficiency are core value propositions for
+DataFusion. While there is still a performance gap between DataFusion best of
+breed tightly, integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* XX% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)

Review Comment:
   @tustvold  do you have any suggstions about what numbers to use here?



##########
_posts/2023-01-07-datafusion-16.0.0.md:
##########
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably others)
+GA of InfluxDB IOx
+
+
+The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct contributors. This does not count all the work that goes into our dependencies such as [arrow](https://crates.io/crates/arrow),  [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps nurture.
+
+<!--
+$ git log --pretty=oneline 13.0.0..16.0.0 . | wc -l
+TBD
+
+
+$ git shortlog -sn 13.0.0..16.0.0 . | wc -l
+TBD
+
+
+At time of this writing:
+$ git log --pretty=oneline 13.0.0..apache/master . | wc -l
+     520
+
+(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion2$  git shortlog -sn 13.0.0..apache/master . | wc -l
+      70
+-->
+
+## Performance 🚀
+
+Performance and efficiency are core value propositions for
+DataFusion. While there is still a performance gap between DataFusion best of
+breed tightly, integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* XX% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)
+* [Advanced predicate pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/), directly on parquet, optionally directly from object storage, enabling sub millisecond filtering, directly from object storage <!-- Andrew nots: we should really get this turned on by default -->
+* Improved `IN` expressions significantly faster  Simplify InListExpr ~20-70% Faster ([#4057])
+* Sort and partition aware optimizations such as  #3969 and  #4691,  skipping potentially expensive operations
+* Basic filter selectivity analysis (#3868)
+
+
+In the coming few months, we plan work on:
+* Improved grouping performance (TODO link)
+* bloom filtering
+* investigate RLE (Run End Encoding support) (todo Arrow link)
+* Enable predicate pushdown by default for all cases
+* OTHERS?
+
+## Runtime Resource Limits
+
+Initially, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins.
+
+In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to spill to secondary storage, if available. See #3941 fore more detail.
+
+
+## SQL Window Function
+[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL))  are useful for a variety of analysis and DataFusion's implementation is close to complete now.
+
+- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2 FOLLOWING)`
+- Unbounded window frames such as `... OVER (ORDER BY ... RANGE UNBOUNDED ROWS PRECEDING)`
+- Support for the `NTILE` window function (#4676)
+- Support for `GROUPS` mode (#4155)
+
+
+# Improved Join Support
+
+Joins are often the most complicated operations to handle well in
+analytics systems and DataFusion 16.0.0 offers significant improvements
+such as
+
+- Cost based optimizer (CBO) that can automatically reorder join evaluations, algorithm (Merge / Hash), and pick build side based on available statistics join type (`INNER`, `LEFT`, etc) (#4219)
+- Fast non `column=column` equijoins such as `JOIN ON a.x + 5 = b.y`
+- Better optimized non-equijoins (nested loops rather than cross join and filter) (#4562) <!-- TODO is this a good thing to mention as any time this is usd the query is going to go slow or the data size is small -->
+
+# Streaming Execution
+
+TODO: write up some supporting text introducing this concept and explaining that DataFusion can now be used for stream processing, much like [Spark Streaming](https://www.databricks.com/glossary/what-is-spark-streaming)
+
+Support for executing infinite files and boundedness-aware join reordering rule (#4694)
+
+# Better Support for Distributed catalogs
+
+With the rise of distributed catalog / metadata stores such as
+delta.io and iceberg (TODO get links to rust projects), it is
+increasingly important to be enable asynchronous I/O to those remote
+catalogs during query planning rather than requiring up front fast
+local access to all relevant catalog information.
+
+
+DataFusion has been enhanced with better support for such asynchronous catalogs (#4607) , which also resulted in improved configuration management APIs.
+
+
+
+# Additional SQL support
+SQL support continues to improve, including some of these highlights:
+
+- Add TPC-DS query planning regression tests (#4719)
+- feat: support prepare statement (#4490)
+- Implement cast between Date and Timestamp (#4726)
+- Support type coercion for timestamp and utf8 (#4312)
+- Full support for time32 and time64 literal values (`ScalarValue`) (#4156)
+- add uuid() function to return a unique uuid per row (#4041)
+- Implement `current_time` scalar function (#4054)
+- Implement current_date scalar function (#4022)
+- Compressed CSV/JSON support (#3642)
+
+The community has also been investing in sqllogic based tests to help keep DataFusion's quality high with less work (TODO add some more detail / lnks)
+
+
+# Substrait
+
+TODO motivating introduction of substrait and why this is interesting
+
+We moved physical plan serde from Ballista to DataFusion (#4390) and we are in the proceess of adding support substrait
+
+
+<!--
+Andrew Lamb: I tried to work in a mention of the python bindings and encouraging a champion for them to step forward, however it felt more like it should be a separate post :thinking:
+-->
+
+## Looking for help:
+
+DataFusion is currently targeted at developers building other systems, not at end users (e.g. data scientists) themselves. Given the currently growing number of systems built using DataFusion, it seems well suited for this purpose.
+
+DataFusion has basic python bindings which has the potential to expand datafusion  to more end users a major missing piece are the python bindings
+
+
+# python bindings and growing the community and ecosystem

Review Comment:
   I tried to work in a mention of the python bindings and encouraging a champion for them to step forward, however it felt more like it should be a separate post 🤔  -- @andygrove  what do you think about a post describing the python bindings, why they are cool, and trying to find people to help drive that project?



##########
_posts/2023-01-07-datafusion-16.0.0.md:
##########
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably others)
+GA of InfluxDB IOx
+
+
+The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct contributors. This does not count all the work that goes into our dependencies such as [arrow](https://crates.io/crates/arrow),  [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps nurture.
+
+<!--
+$ git log --pretty=oneline 13.0.0..16.0.0 . | wc -l
+TBD
+
+
+$ git shortlog -sn 13.0.0..16.0.0 . | wc -l
+TBD
+
+
+At time of this writing:
+$ git log --pretty=oneline 13.0.0..apache/master . | wc -l
+     520
+
+(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion2$  git shortlog -sn 13.0.0..apache/master . | wc -l
+      70
+-->
+
+## Performance 🚀
+
+Performance and efficiency are core value propositions for
+DataFusion. While there is still a performance gap between DataFusion best of
+breed tightly, integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* XX% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)
+* [Advanced predicate pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/), directly on parquet, optionally directly from object storage, enabling sub millisecond filtering, directly from object storage <!-- Andrew nots: we should really get this turned on by default -->
+* Improved `IN` expressions significantly faster  Simplify InListExpr ~20-70% Faster ([#4057])
+* Sort and partition aware optimizations such as  #3969 and  #4691,  skipping potentially expensive operations
+* Basic filter selectivity analysis (#3868)
+
+
+In the coming few months, we plan work on:

Review Comment:
   Feedback from the rest of the community would be great



##########
_posts/2023-01-07-datafusion-16.0.0.md:
##########
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion.

Review Comment:
   It would be great if someone could help clean this section up and clearly explain the growth of the community; There is a wonderful story there to tell



##########
_posts/2023-01-07-datafusion-16.0.0.md:
##########
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably others)
+GA of InfluxDB IOx
+
+
+The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct contributors. This does not count all the work that goes into our dependencies such as [arrow](https://crates.io/crates/arrow),  [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps nurture.
+
+<!--
+$ git log --pretty=oneline 13.0.0..16.0.0 . | wc -l
+TBD
+
+
+$ git shortlog -sn 13.0.0..16.0.0 . | wc -l
+TBD
+
+
+At time of this writing:
+$ git log --pretty=oneline 13.0.0..apache/master . | wc -l
+     520
+
+(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion2$  git shortlog -sn 13.0.0..apache/master . | wc -l
+      70
+-->
+
+## Performance 🚀
+
+Performance and efficiency are core value propositions for
+DataFusion. While there is still a performance gap between DataFusion best of
+breed tightly, integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* XX% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)
+* [Advanced predicate pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/), directly on parquet, optionally directly from object storage, enabling sub millisecond filtering, directly from object storage <!-- Andrew nots: we should really get this turned on by default -->
+* Improved `IN` expressions significantly faster  Simplify InListExpr ~20-70% Faster ([#4057])
+* Sort and partition aware optimizations such as  #3969 and  #4691,  skipping potentially expensive operations
+* Basic filter selectivity analysis (#3868)
+
+
+In the coming few months, we plan work on:
+* Improved grouping performance (TODO link)
+* bloom filtering
+* investigate RLE (Run End Encoding support) (todo Arrow link)
+* Enable predicate pushdown by default for all cases
+* OTHERS?
+
+## Runtime Resource Limits
+
+Initially, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins.
+
+In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to spill to secondary storage, if available. See #3941 fore more detail.
+
+
+## SQL Window Function
+[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL))  are useful for a variety of analysis and DataFusion's implementation is close to complete now.
+
+- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2 FOLLOWING)`
+- Unbounded window frames such as `... OVER (ORDER BY ... RANGE UNBOUNDED ROWS PRECEDING)`
+- Support for the `NTILE` window function (#4676)
+- Support for `GROUPS` mode (#4155)
+
+
+# Improved Join Support
+
+Joins are often the most complicated operations to handle well in
+analytics systems and DataFusion 16.0.0 offers significant improvements
+such as
+
+- Cost based optimizer (CBO) that can automatically reorder join evaluations, algorithm (Merge / Hash), and pick build side based on available statistics join type (`INNER`, `LEFT`, etc) (#4219)
+- Fast non `column=column` equijoins such as `JOIN ON a.x + 5 = b.y`
+- Better optimized non-equijoins (nested loops rather than cross join and filter) (#4562) <!-- TODO is this a good thing to mention as any time this is usd the query is going to go slow or the data size is small -->
+
+# Streaming Execution
+
+TODO: write up some supporting text introducing this concept and explaining that DataFusion can now be used for stream processing, much like [Spark Streaming](https://www.databricks.com/glossary/what-is-spark-streaming)

Review Comment:
   @ozankabak perhaps you can help fill in this section -



##########
_posts/2023-01-07-datafusion-16.0.0.md:
##########
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably others)
+GA of InfluxDB IOx
+
+
+The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct contributors. This does not count all the work that goes into our dependencies such as [arrow](https://crates.io/crates/arrow),  [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps nurture.
+
+<!--
+$ git log --pretty=oneline 13.0.0..16.0.0 . | wc -l
+TBD
+
+
+$ git shortlog -sn 13.0.0..16.0.0 . | wc -l
+TBD
+
+
+At time of this writing:
+$ git log --pretty=oneline 13.0.0..apache/master . | wc -l
+     520
+
+(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion2$  git shortlog -sn 13.0.0..apache/master . | wc -l
+      70
+-->
+
+## Performance 🚀
+
+Performance and efficiency are core value propositions for
+DataFusion. While there is still a performance gap between DataFusion best of
+breed tightly, integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* XX% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)
+* [Advanced predicate pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/), directly on parquet, optionally directly from object storage, enabling sub millisecond filtering, directly from object storage <!-- Andrew nots: we should really get this turned on by default -->
+* Improved `IN` expressions significantly faster  Simplify InListExpr ~20-70% Faster ([#4057])
+* Sort and partition aware optimizations such as  #3969 and  #4691,  skipping potentially expensive operations
+* Basic filter selectivity analysis (#3868)
+
+
+In the coming few months, we plan work on:
+* Improved grouping performance (TODO link)
+* bloom filtering
+* investigate RLE (Run End Encoding support) (todo Arrow link)
+* Enable predicate pushdown by default for all cases
+* OTHERS?
+
+## Runtime Resource Limits
+
+Initially, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins.
+
+In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to spill to secondary storage, if available. See #3941 fore more detail.
+
+
+## SQL Window Function
+[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL))  are useful for a variety of analysis and DataFusion's implementation is close to complete now.
+
+- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2 FOLLOWING)`
+- Unbounded window frames such as `... OVER (ORDER BY ... RANGE UNBOUNDED ROWS PRECEDING)`
+- Support for the `NTILE` window function (#4676)
+- Support for `GROUPS` mode (#4155)
+
+
+# Improved Join Support
+
+Joins are often the most complicated operations to handle well in
+analytics systems and DataFusion 16.0.0 offers significant improvements
+such as
+
+- Cost based optimizer (CBO) that can automatically reorder join evaluations, algorithm (Merge / Hash), and pick build side based on available statistics join type (`INNER`, `LEFT`, etc) (#4219)
+- Fast non `column=column` equijoins such as `JOIN ON a.x + 5 = b.y`
+- Better optimized non-equijoins (nested loops rather than cross join and filter) (#4562) <!-- TODO is this a good thing to mention as any time this is usd the query is going to go slow or the data size is small -->
+
+# Streaming Execution
+
+TODO: write up some supporting text introducing this concept and explaining that DataFusion can now be used for stream processing, much like [Spark Streaming](https://www.databricks.com/glossary/what-is-spark-streaming)
+
+Support for executing infinite files and boundedness-aware join reordering rule (#4694)
+
+# Better Support for Distributed catalogs
+
+With the rise of distributed catalog / metadata stores such as
+delta.io and iceberg (TODO get links to rust projects), it is
+increasingly important to be enable asynchronous I/O to those remote
+catalogs during query planning rather than requiring up front fast
+local access to all relevant catalog information.
+
+
+DataFusion has been enhanced with better support for such asynchronous catalogs (#4607) , which also resulted in improved configuration management APIs.
+
+
+
+# Additional SQL support
+SQL support continues to improve, including some of these highlights:
+
+- Add TPC-DS query planning regression tests (#4719)
+- feat: support prepare statement (#4490)
+- Implement cast between Date and Timestamp (#4726)
+- Support type coercion for timestamp and utf8 (#4312)
+- Full support for time32 and time64 literal values (`ScalarValue`) (#4156)
+- add uuid() function to return a unique uuid per row (#4041)
+- Implement `current_time` scalar function (#4054)
+- Implement current_date scalar function (#4022)
+- Compressed CSV/JSON support (#3642)
+
+The community has also been investing in sqllogic based tests to help keep DataFusion's quality high with less work (TODO add some more detail / lnks)

Review Comment:
   @xudong963  I wonder if you have any thoughts on how to word this better



##########
_posts/2023-01-07-datafusion-16.0.0.md:
##########
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably others)
+GA of InfluxDB IOx
+
+
+The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct contributors. This does not count all the work that goes into our dependencies such as [arrow](https://crates.io/crates/arrow),  [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps nurture.
+
+<!--
+$ git log --pretty=oneline 13.0.0..16.0.0 . | wc -l
+TBD
+
+
+$ git shortlog -sn 13.0.0..16.0.0 . | wc -l
+TBD
+
+
+At time of this writing:
+$ git log --pretty=oneline 13.0.0..apache/master . | wc -l
+     520
+
+(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion2$  git shortlog -sn 13.0.0..apache/master . | wc -l
+      70
+-->
+
+## Performance 🚀
+
+Performance and efficiency are core value propositions for
+DataFusion. While there is still a performance gap between DataFusion best of
+breed tightly, integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* XX% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)
+* [Advanced predicate pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/), directly on parquet, optionally directly from object storage, enabling sub millisecond filtering, directly from object storage <!-- Andrew nots: we should really get this turned on by default -->
+* Improved `IN` expressions significantly faster  Simplify InListExpr ~20-70% Faster ([#4057])
+* Sort and partition aware optimizations such as  #3969 and  #4691,  skipping potentially expensive operations
+* Basic filter selectivity analysis (#3868)
+
+
+In the coming few months, we plan work on:
+* Improved grouping performance (TODO link)
+* bloom filtering
+* investigate RLE (Run End Encoding support) (todo Arrow link)
+* Enable predicate pushdown by default for all cases
+* OTHERS?
+
+## Runtime Resource Limits
+
+Initially, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins.
+
+In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to spill to secondary storage, if available. See #3941 fore more detail.
+
+
+## SQL Window Function
+[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL))  are useful for a variety of analysis and DataFusion's implementation is close to complete now.
+
+- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2 FOLLOWING)`
+- Unbounded window frames such as `... OVER (ORDER BY ... RANGE UNBOUNDED ROWS PRECEDING)`
+- Support for the `NTILE` window function (#4676)
+- Support for `GROUPS` mode (#4155)
+
+
+# Improved Join Support
+
+Joins are often the most complicated operations to handle well in
+analytics systems and DataFusion 16.0.0 offers significant improvements
+such as
+
+- Cost based optimizer (CBO) that can automatically reorder join evaluations, algorithm (Merge / Hash), and pick build side based on available statistics join type (`INNER`, `LEFT`, etc) (#4219)
+- Fast non `column=column` equijoins such as `JOIN ON a.x + 5 = b.y`
+- Better optimized non-equijoins (nested loops rather than cross join and filter) (#4562) <!-- TODO is this a good thing to mention as any time this is usd the query is going to go slow or the data size is small -->
+
+# Streaming Execution
+
+TODO: write up some supporting text introducing this concept and explaining that DataFusion can now be used for stream processing, much like [Spark Streaming](https://www.databricks.com/glossary/what-is-spark-streaming)
+
+Support for executing infinite files and boundedness-aware join reordering rule (#4694)
+
+# Better Support for Distributed catalogs
+
+With the rise of distributed catalog / metadata stores such as
+delta.io and iceberg (TODO get links to rust projects), it is
+increasingly important to be enable asynchronous I/O to those remote
+catalogs during query planning rather than requiring up front fast
+local access to all relevant catalog information.
+
+
+DataFusion has been enhanced with better support for such asynchronous catalogs (#4607) , which also resulted in improved configuration management APIs.
+
+
+
+# Additional SQL support
+SQL support continues to improve, including some of these highlights:

Review Comment:
   are there other things to highlight? @Ted-Jiang / @liukun4515 ?



##########
_posts/2023-01-07-datafusion-16.0.0.md:
##########
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 16.0.0 Project Update"
+date: "2023-01-07 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible
+query execution framework, written in [Rust](https://www.rust-lang.org/),
+that uses [Apache Arrow](https://arrow.apache.org) as its
+in-memory format. It is targeted primarily at developers creating data
+intensive analytics, and offers mature
+[SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html),
+a DataFrame API, and many extension points.
+
+DataFusion based systems perform very well on performance
+benchmarks, especially considering they operate on data in parquet
+files directly rather than first loading into a specialized format.
+Some recent highlights include [clickbench](https://benchmark.clickhouse.com/)
+and the [Cloudfuse.io standalone query engines](https://www.cloudfuse.io/dashboards/standalone-engines) page.
+
+DataFusion is part of a longer term trend, articulated clearly by [Andy Pavlo](http://www.cs.cmu.edu/~pavlo/) in his
+[2022 Databases Retrospective](https://ottertune.com/blog/2022-databases-retrospective/).
+Database frameworks are proliferating and it is likely that all OLAP DBMSs and other many data heavy applications such as machine learning, will require a vectorized, highly performant query
+engine in the next 5 years to remain relevant.
+The only practical way to make such technology so widely available
+without many millions of dollars of investment is
+though open source engine such as DataFusion or [Velox](https://github.com/facebookincubator/velox).
+
+The rest of this post describes the improvements made to DataFusion
+over the last three months and some hints if where we are heading.
+
+## Community Growth
+
+The three months since [our last update](https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0/) again saw significant growth in the DataFusion.
+TODO quantify the growth -- e.g. XXX new contributors to the project and regularly merge YYY PRs a day.
+
+Growth of new systems based on as the engine in [many open source and commercial projects](https://github.com/apache/arrow-datafusion#known-uses) and was one of the early open source projects to provide this capability.
+
+Several new databases built on datafusion (synnada.ai, greptimedb, probably others)
+GA of InfluxDB IOx
+
+
+The DataFusion 16.0.0 release consists of 520 PRs from 70 distinct contributors. This does not count all the work that goes into our dependencies such as [arrow](https://crates.io/crates/arrow),  [parquet](https://crates.io/crates/parquet), and [object_store](https://crates.io/crates/object_store), that much of the same community helps nurture.
+
+<!--
+$ git log --pretty=oneline 13.0.0..16.0.0 . | wc -l
+TBD
+
+
+$ git shortlog -sn 13.0.0..16.0.0 . | wc -l
+TBD
+
+
+At time of this writing:
+$ git log --pretty=oneline 13.0.0..apache/master . | wc -l
+     520
+
+(arrow_dev) alamb@MacBook-Pro-8:~/Software/arrow-datafusion2$  git shortlog -sn 13.0.0..apache/master . | wc -l
+      70
+-->
+
+## Performance 🚀
+
+Performance and efficiency are core value propositions for
+DataFusion. While there is still a performance gap between DataFusion best of
+breed tightly, integrated systems such as [DuckDB](https://duckdb.org)
+and [Polars](https://www.pola.rs/)https://www.pola.rs/), DataFusion is
+closing the gap quickly. Performance highlights from the last three
+months:
+
+* XX% Faster Sorting and Merging using the new [Row Format](https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/)
+* [Advanced predicate pushdown](https://arrow.apache.org/blog/2022/12/26/querying-parquet-with-millisecond-latency/), directly on parquet, optionally directly from object storage, enabling sub millisecond filtering, directly from object storage <!-- Andrew nots: we should really get this turned on by default -->
+* Improved `IN` expressions significantly faster  Simplify InListExpr ~20-70% Faster ([#4057])
+* Sort and partition aware optimizations such as  #3969 and  #4691,  skipping potentially expensive operations
+* Basic filter selectivity analysis (#3868)
+
+
+In the coming few months, we plan work on:
+* Improved grouping performance (TODO link)
+* bloom filtering
+* investigate RLE (Run End Encoding support) (todo Arrow link)
+* Enable predicate pushdown by default for all cases
+* OTHERS?
+
+## Runtime Resource Limits
+
+Initially, DataFusion could potentially use unbounded amounts of memory for certain queries that included Sorts, Grouping or Joins.
+
+In version 16.0.0, it is possible to limit DataFusion's memory usage for Sorting and Grouping. We are looking for help adding similar limiting for Joins as well as expanding our algorithms to spill to secondary storage, if available. See #3941 fore more detail.
+
+
+## SQL Window Function
+[SQL window functions](https://en.wikipedia.org/wiki/Window_function_(SQL))  are useful for a variety of analysis and DataFusion's implementation is close to complete now.
+
+- Custom window frames such as `... OVER (ORDER BY ... RANGE BETWEEN 0.2 PRECEDING AND 0.2 FOLLOWING)`
+- Unbounded window frames such as `... OVER (ORDER BY ... RANGE UNBOUNDED ROWS PRECEDING)`
+- Support for the `NTILE` window function (#4676)
+- Support for `GROUPS` mode (#4155)
+
+
+# Improved Join Support
+
+Joins are often the most complicated operations to handle well in
+analytics systems and DataFusion 16.0.0 offers significant improvements
+such as
+
+- Cost based optimizer (CBO) that can automatically reorder join evaluations, algorithm (Merge / Hash), and pick build side based on available statistics join type (`INNER`, `LEFT`, etc) (#4219)
+- Fast non `column=column` equijoins such as `JOIN ON a.x + 5 = b.y`
+- Better optimized non-equijoins (nested loops rather than cross join and filter) (#4562) <!-- TODO is this a good thing to mention as any time this is usd the query is going to go slow or the data size is small -->
+
+# Streaming Execution
+
+TODO: write up some supporting text introducing this concept and explaining that DataFusion can now be used for stream processing, much like [Spark Streaming](https://www.databricks.com/glossary/what-is-spark-streaming)
+
+Support for executing infinite files and boundedness-aware join reordering rule (#4694)
+
+# Better Support for Distributed catalogs
+
+With the rise of distributed catalog / metadata stores such as
+delta.io and iceberg (TODO get links to rust projects), it is
+increasingly important to be enable asynchronous I/O to those remote
+catalogs during query planning rather than requiring up front fast
+local access to all relevant catalog information.
+
+
+DataFusion has been enhanced with better support for such asynchronous catalogs (#4607) , which also resulted in improved configuration management APIs.
+
+
+
+# Additional SQL support
+SQL support continues to improve, including some of these highlights:
+
+- Add TPC-DS query planning regression tests (#4719)
+- feat: support prepare statement (#4490)
+- Implement cast between Date and Timestamp (#4726)
+- Support type coercion for timestamp and utf8 (#4312)
+- Full support for time32 and time64 literal values (`ScalarValue`) (#4156)
+- add uuid() function to return a unique uuid per row (#4041)
+- Implement `current_time` scalar function (#4054)
+- Implement current_date scalar function (#4022)
+- Compressed CSV/JSON support (#3642)
+
+The community has also been investing in sqllogic based tests to help keep DataFusion's quality high with less work (TODO add some more detail / lnks)
+
+
+# Substrait
+
+TODO motivating introduction of substrait and why this is interesting

Review Comment:
   @andygrove  perhaps you can help with content for the substrait area



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org