You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by "paddyhoran (via GitHub)" <gi...@apache.org> on 2023/06/18 12:30:39 UTC

[GitHub] [arrow-site] paddyhoran commented on a diff in pull request #370: DataFusion 26.0.0 Blog Post

paddyhoran commented on code in PR #370:
URL: https://github.com/apache/arrow-site/pull/370#discussion_r1233271145


##########
_posts/2023-06-01-datafusion-25.0.0.md:
##########
@@ -0,0 +1,303 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 25.0.0"
+date: "2023-06-02 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+It has been a whirlwind 6 months of DataFusion development since [our
+last update]: the community has grown, many features have been added,
+performance improved and we are [discussing] branching out to our own
+top level Apache Project.
+
+## Background
+
+[Apache Arrow DataFusion] is an extensible query engine and database
+toolkit, written in [Rust], that uses [Apache Arrow] as its in-memory
+format.
+
+[apache arrow datafusion]: https://arrow.apache.org/datafusion/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+DataFusion, along with [Apache Calcite], Facebook's [Velox] and
+similar technology are part of the next generation "[Deconstructed
+Database]" architectures, where new systems are built on a foundation
+of fast, modular components, rather as a single tightly integrated
+system.
+
+[apache calcite]: https://calcite.apache.org
+[velox]: https://github.com/facebookincubator/velox
+[deconstructed database]: https://www.usenix.org/publications/login/winter2018/khurana
+[spark]: https://spark.apache.org/
+[duckdb]: https://duckdb.org
+[pola.rs]: https://www.pola.rs/
+
+
+While single tightly integrated systems such as [Spark], [DuckDB] and
+[Pola.rs] are great pieces of technology, our community believes that
+anyone developing new data heavy application, such as those common in
+machine learning in the next 5 years, will **require** a high
+performance, vectorized, query engine to remain relevant. The only
+practical way to gain access to such technology without investing many
+millions of dollars to build a new tightly integrated engine, is
+though open source projects such as DataFusion and the underlying
+technologies of [Apache Arrow] and [Rust].
+
+DataFusion is targeted primarily at developers creating other data
+intensive analytics, and offers:
+
+- High performance, native, parallel streaming execution engine
+- Mature [SQL support], featuring  subqueries, window functions, grouping sets, and more
+- Built in support for Parquet, Avro, CSV, JSON and Arrow formats and easy extension for others
+- Native DataFrame API and [python bindings]
+- [Well documented] source code and architecture, designed to be customized to suit downstream project needs
+- High quality, easy to use code [released every 2 weeks to crates.io]
+- Welcoming, open community, governed by the highly regarded and well understood [Apache Software Foundation]
+
+The rest of this post highlights some of the improvements we have made
+to DataFusion over the last 6 months and a preview of where we are
+heading. You can see a list of all changes in the detailed
+[CHANGELOG].
+
+[SQL support]: https://arrow.apache.org/datafusion/user-guide/sql/index.html
+[apache software foundation]: https://www.apache.org/
+[well documented]: https://docs.rs/datafusion/latest/datafusion/index.html
+[python bindings]: https://arrow.apache.org/datafusion-python/
+[changelog]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md
+[released every 2 weeks to crates.io]: https://crates.io/crates/datafusion/versions
+
+## (Even) Better Performance
+
+[Various] benchmarks show DataFusion to be quite close or [even
+faster] to the state of the art in analytic performance (at the moment
+this seems to be DuckDB). We continually work on improving performance
+(see [#5546] for a list) and would love additional help in this area.
+
+[various]: https://voltrondata.com/resources/speeds-and-feeds-hardware-and-software-matter
+[even faster]: https://github.com/tustvold/access-log-bench
+[#5546]: https://github.com/apache/arrow-datafusion/issues/5546
+
+DataFusion now read single large Parquet files significantly faster by

Review Comment:
   ```suggestion
   DataFusion now reads single large Parquet files significantly faster by
   ```



##########
_posts/2023-06-01-datafusion-25.0.0.md:
##########
@@ -0,0 +1,303 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 25.0.0"
+date: "2023-06-02 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+It has been a whirlwind 6 months of DataFusion development since [our
+last update]: the community has grown, many features have been added,
+performance improved and we are [discussing] branching out to our own
+top level Apache Project.
+
+## Background
+
+[Apache Arrow DataFusion] is an extensible query engine and database
+toolkit, written in [Rust], that uses [Apache Arrow] as its in-memory
+format.
+
+[apache arrow datafusion]: https://arrow.apache.org/datafusion/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+DataFusion, along with [Apache Calcite], Facebook's [Velox] and
+similar technology are part of the next generation "[Deconstructed
+Database]" architectures, where new systems are built on a foundation
+of fast, modular components, rather as a single tightly integrated
+system.
+
+[apache calcite]: https://calcite.apache.org
+[velox]: https://github.com/facebookincubator/velox
+[deconstructed database]: https://www.usenix.org/publications/login/winter2018/khurana
+[spark]: https://spark.apache.org/
+[duckdb]: https://duckdb.org
+[pola.rs]: https://www.pola.rs/
+
+
+While single tightly integrated systems such as [Spark], [DuckDB] and
+[Pola.rs] are great pieces of technology, our community believes that
+anyone developing new data heavy application, such as those common in
+machine learning in the next 5 years, will **require** a high
+performance, vectorized, query engine to remain relevant. The only
+practical way to gain access to such technology without investing many
+millions of dollars to build a new tightly integrated engine, is
+though open source projects such as DataFusion and the underlying
+technologies of [Apache Arrow] and [Rust].
+
+DataFusion is targeted primarily at developers creating other data
+intensive analytics, and offers:
+
+- High performance, native, parallel streaming execution engine
+- Mature [SQL support], featuring  subqueries, window functions, grouping sets, and more
+- Built in support for Parquet, Avro, CSV, JSON and Arrow formats and easy extension for others
+- Native DataFrame API and [python bindings]
+- [Well documented] source code and architecture, designed to be customized to suit downstream project needs
+- High quality, easy to use code [released every 2 weeks to crates.io]
+- Welcoming, open community, governed by the highly regarded and well understood [Apache Software Foundation]
+
+The rest of this post highlights some of the improvements we have made
+to DataFusion over the last 6 months and a preview of where we are
+heading. You can see a list of all changes in the detailed
+[CHANGELOG].
+
+[SQL support]: https://arrow.apache.org/datafusion/user-guide/sql/index.html
+[apache software foundation]: https://www.apache.org/
+[well documented]: https://docs.rs/datafusion/latest/datafusion/index.html
+[python bindings]: https://arrow.apache.org/datafusion-python/
+[changelog]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md
+[released every 2 weeks to crates.io]: https://crates.io/crates/datafusion/versions
+
+## (Even) Better Performance
+
+[Various] benchmarks show DataFusion to be quite close or [even
+faster] to the state of the art in analytic performance (at the moment
+this seems to be DuckDB). We continually work on improving performance
+(see [#5546] for a list) and would love additional help in this area.
+
+[various]: https://voltrondata.com/resources/speeds-and-feeds-hardware-and-software-matter
+[even faster]: https://github.com/tustvold/access-log-bench
+[#5546]: https://github.com/apache/arrow-datafusion/issues/5546
+
+DataFusion now read single large Parquet files significantly faster by
+[parallelizing across multiple cores]. Native speeds for reading JSON
+and CSV files are also up to 2.5x faster thanks to improvements
+upstream in arrow-rs [JSON reader] and [CSV reader].
+
+[parallelizing across multiple cores]: https://github.com/apache/arrow-datafusion/pull/5057
+[json reader]: https://github.com/apache/arrow-rs/pull/3479#issuecomment-1384353159
+[csv reader]: https://github.com/apache/arrow-rs/pull/3365
+
+Also, we have integrated the [arrow-rs Row Format] into DataFusion resulting in up to [2-3x faster sorting and merging].
+
+[arrow-rs row format]: https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/
+[2-3x faster sorting and merging]: https://github.com/apache/arrow-datafusion/pull/6163
+
+## Improved Documentation and Website
+
+Part of growing the DataFusion community is ensuring that DataFusion's
+features are understood and that it is easy to contribute and
+participate. To that end the [website] has been cleaned up, [the
+architecture guide] expanded, the [roadmap] updated, and several
+overview talks created:
+
+- Apr 2023 _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
+- April 2023 _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30)
+- April 2023 _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) and [slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg)
+
+[website]: https://arrow.apache.org/datafusion/
+[the architecture guide]: https://docs.rs/datafusion/latest/datafusion/index.html#architecture
+[roadmap]: https://arrow.apache.org/datafusion/contributor-guide/roadmap.html
+
+## New Features
+
+### More Streaming, Less Memory
+
+We have made significany progress on the [streaming execution roadmap]
+such as [unbounded datasources], [streaming group by], sophisticated
+[sort] and [repartitioning] improvements in the optimizer, and support
+for [symmetric hash join] (read more about that in the great [Synnada
+Blog Post] on the topic). Together, these features both 1) make it
+easier to build streaming systems using DataFusion that can
+incrementally generate output before (or ever) seeing the end of the
+input and 2) allow general queries to use less memory and generate their
+results faster.
+
+We have also improved the runtime [memory management] system so that
+DataFusion now stays within its declared memory budget [generate
+runtime errors].
+
+[sort]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/global_sort_selection/index.html
+[repartitioning]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/repartition/index.html
+[streaming execution roadmap]: https://github.com/apache/arrow-datafusion/issues/4285
+[unbounded datasources]: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#method.unbounded_output
+[streaming group by]: https://docs.rs/datafusion/latest/datafusion/physical_plan/aggregates/enum.GroupByOrderMode.html
+[symmetric hash join]: https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.SymmetricHashJoinExec.html
+[synnada blog post]: https://github.com/apache/arrow-datafusion/issues/4285
+[memory management]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/index.html
+[generate runtime errors]: https://github.com/apache/arrow-datafusion/issues/3941
+
+### DML Support (`INSERT`, `DELETE`, `UPDATE`, etc)
+
+Part of building high performance data systems includes writing data,
+and DataFusion supports several features for creating new files:
+
+- `INSERT INTO` and `SELECT ... INTO ` support for memory backed and CSV tables
+- New [API for writing data into TableProviders]
+
+We are working on easier to use [COPY INTO] syntax, better support
+for writing parquet, JSON, and AVRO, and more -- see our [tracking epic]
+for more details.
+
+[tracking epic]: https://github.com/apache/arrow-datafusion/issues/6569
+[api for writing data into tableproviders]: https://docs.rs/datafusion/latest/datafusion/physical_plan/insert/trait.DataSink.html
+[tracking epic]: https://github.com/apache/arrow-datafusion/issues/6569
+[copy into]: https://github.com/apache/arrow-datafusion/issues/5654
+
+### Timestamp and Intervals
+
+One mark of the maturity of a SQL engine is how it handles the tricky
+world of timestamp, date, times and interval arithmetic. DataFusion is
+feature complete in this area and behaves as you would expect,
+supporting queries such as
+
+```sql
+SELECT now() + '1 month' FROM my_table;
+```
+
+We still have a long tail of [date and time improvements], which we are working on as well.
+
+[date and time improvements]: https://github.com/apache/arrow-datafusion/issues/3148
+
+### Querying Structured Types (`List` and `Struct`s)
+
+Arrow and Parquet [support nested data] well and DataFusion lets you
+easily query such `Struct` and `List`. For example, you can use
+DataFusion to read and query the [JSON Datasets for Exploratory OLAP -
+Mendeley Data] like this:
+
+[support nested data]: https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/
+[json datasets for exploratory olap - mendeley data]: https://data.mendeley.com/datasets/ct8f9skv97
+
+```sql
+----------
+-- Explore structured data using SQL
+----------
+SELECT delete FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL limit 10;
++---------------------------------------------------------------------------------------------------------------------------+
+| delete                                                                                                                    |
++---------------------------------------------------------------------------------------------------------------------------+
+| {status: {id: {$numberLong: 135037425050320896}, id_str: 135037425050320896, user_id: 334902461, user_id_str: 334902461}} |
+| {status: {id: {$numberLong: 134703982051463168}, id_str: 134703982051463168, user_id: 405383453, user_id_str: 405383453}} |
+| {status: {id: {$numberLong: 134773741740765184}, id_str: 134773741740765184, user_id: 64823441, user_id_str: 64823441}}   |
+| {status: {id: {$numberLong: 132543659655704576}, id_str: 132543659655704576, user_id: 45917834, user_id_str: 45917834}}   |
+| {status: {id: {$numberLong: 133786431926697984}, id_str: 133786431926697984, user_id: 67229952, user_id_str: 67229952}}   |
+| {status: {id: {$numberLong: 134619093570560002}, id_str: 134619093570560002, user_id: 182430773, user_id_str: 182430773}} |
+| {status: {id: {$numberLong: 134019857527214080}, id_str: 134019857527214080, user_id: 257396311, user_id_str: 257396311}} |
+| {status: {id: {$numberLong: 133931546469076993}, id_str: 133931546469076993, user_id: 124539548, user_id_str: 124539548}} |
+| {status: {id: {$numberLong: 134397743350296576}, id_str: 134397743350296576, user_id: 139836391, user_id_str: 139836391}} |
+| {status: {id: {$numberLong: 127833661767823360}, id_str: 127833661767823360, user_id: 244442687, user_id_str: 244442687}} |
++---------------------------------------------------------------------------------------------------------------------------+
+
+----------
+-- Select some deeply nested fields
+----------
+SELECT
+  delete['status']['id']['$numberLong'] as delete_id,
+  delete['status']['user_id'] as delete_user_id
+FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL LIMIT 10;
+
++--------------------+----------------+
+| delete_id          | delete_user_id |
++--------------------+----------------+
+| 135037425050320896 | 334902461      |
+| 134703982051463168 | 405383453      |
+| 134773741740765184 | 64823441       |
+| 132543659655704576 | 45917834       |
+| 133786431926697984 | 67229952       |
+| 134619093570560002 | 182430773      |
+| 134019857527214080 | 257396311      |
+| 133931546469076993 | 124539548      |
+| 134397743350296576 | 139836391      |
+| 127833661767823360 | 244442687      |
++--------------------+----------------+
+```
+
+### Subqueries All the Way Down
+
+DataFusion can run many different subqueries by rewriting them to
+joins. It has been able to run the full suite of TPC-H queries for at
+least the last year, but recently we have implemented significant
+improvements to this logic, sufficient to run almost all queries in
+the TPC-DS benchmark as well.
+
+## Community and Project Growth
+
+The six months since [our last update] saw significant growth in
+the DataFusion community .Between versions `17.0.0` and `26.0.0`,

Review Comment:
   ```suggestion
   the DataFusion community. Between versions `17.0.0` and `26.0.0`,
   ```



##########
_posts/2023-06-01-datafusion-25.0.0.md:
##########
@@ -0,0 +1,303 @@
+---
+layout: post
+title: "Apache Arrow DataFusion 25.0.0"
+date: "2023-06-02 00:00:00"
+author: pmc
+categories: [release]
+---
+
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+It has been a whirlwind 6 months of DataFusion development since [our
+last update]: the community has grown, many features have been added,
+performance improved and we are [discussing] branching out to our own
+top level Apache Project.
+
+## Background
+
+[Apache Arrow DataFusion] is an extensible query engine and database
+toolkit, written in [Rust], that uses [Apache Arrow] as its in-memory
+format.
+
+[apache arrow datafusion]: https://arrow.apache.org/datafusion/
+[apache arrow]: https://arrow.apache.org
+[rust]: https://www.rust-lang.org/
+
+DataFusion, along with [Apache Calcite], Facebook's [Velox] and
+similar technology are part of the next generation "[Deconstructed
+Database]" architectures, where new systems are built on a foundation
+of fast, modular components, rather as a single tightly integrated
+system.
+
+[apache calcite]: https://calcite.apache.org
+[velox]: https://github.com/facebookincubator/velox
+[deconstructed database]: https://www.usenix.org/publications/login/winter2018/khurana
+[spark]: https://spark.apache.org/
+[duckdb]: https://duckdb.org
+[pola.rs]: https://www.pola.rs/
+
+
+While single tightly integrated systems such as [Spark], [DuckDB] and
+[Pola.rs] are great pieces of technology, our community believes that
+anyone developing new data heavy application, such as those common in
+machine learning in the next 5 years, will **require** a high
+performance, vectorized, query engine to remain relevant. The only
+practical way to gain access to such technology without investing many
+millions of dollars to build a new tightly integrated engine, is
+though open source projects such as DataFusion and the underlying
+technologies of [Apache Arrow] and [Rust].
+
+DataFusion is targeted primarily at developers creating other data
+intensive analytics, and offers:
+
+- High performance, native, parallel streaming execution engine
+- Mature [SQL support], featuring  subqueries, window functions, grouping sets, and more
+- Built in support for Parquet, Avro, CSV, JSON and Arrow formats and easy extension for others
+- Native DataFrame API and [python bindings]
+- [Well documented] source code and architecture, designed to be customized to suit downstream project needs
+- High quality, easy to use code [released every 2 weeks to crates.io]
+- Welcoming, open community, governed by the highly regarded and well understood [Apache Software Foundation]
+
+The rest of this post highlights some of the improvements we have made
+to DataFusion over the last 6 months and a preview of where we are
+heading. You can see a list of all changes in the detailed
+[CHANGELOG].
+
+[SQL support]: https://arrow.apache.org/datafusion/user-guide/sql/index.html
+[apache software foundation]: https://www.apache.org/
+[well documented]: https://docs.rs/datafusion/latest/datafusion/index.html
+[python bindings]: https://arrow.apache.org/datafusion-python/
+[changelog]: https://github.com/apache/arrow-datafusion/blob/main/datafusion/CHANGELOG.md
+[released every 2 weeks to crates.io]: https://crates.io/crates/datafusion/versions
+
+## (Even) Better Performance
+
+[Various] benchmarks show DataFusion to be quite close or [even
+faster] to the state of the art in analytic performance (at the moment
+this seems to be DuckDB). We continually work on improving performance
+(see [#5546] for a list) and would love additional help in this area.
+
+[various]: https://voltrondata.com/resources/speeds-and-feeds-hardware-and-software-matter
+[even faster]: https://github.com/tustvold/access-log-bench
+[#5546]: https://github.com/apache/arrow-datafusion/issues/5546
+
+DataFusion now read single large Parquet files significantly faster by
+[parallelizing across multiple cores]. Native speeds for reading JSON
+and CSV files are also up to 2.5x faster thanks to improvements
+upstream in arrow-rs [JSON reader] and [CSV reader].
+
+[parallelizing across multiple cores]: https://github.com/apache/arrow-datafusion/pull/5057
+[json reader]: https://github.com/apache/arrow-rs/pull/3479#issuecomment-1384353159
+[csv reader]: https://github.com/apache/arrow-rs/pull/3365
+
+Also, we have integrated the [arrow-rs Row Format] into DataFusion resulting in up to [2-3x faster sorting and merging].
+
+[arrow-rs row format]: https://arrow.apache.org/blog/2022/11/07/multi-column-sorts-in-arrow-rust-part-1/
+[2-3x faster sorting and merging]: https://github.com/apache/arrow-datafusion/pull/6163
+
+## Improved Documentation and Website
+
+Part of growing the DataFusion community is ensuring that DataFusion's
+features are understood and that it is easy to contribute and
+participate. To that end the [website] has been cleaned up, [the
+architecture guide] expanded, the [roadmap] updated, and several
+overview talks created:
+
+- Apr 2023 _Query Engine_: [recording](https://youtu.be/NVKujPxwSBA) and [slides](https://docs.google.com/presentation/d/1D3GDVas-8y0sA4c8EOgdCvEjVND4s2E7I6zfs67Y4j8/edit#slide=id.p)
+- April 2023 _Logical Plan and Expressions_: [recording](https://youtu.be/EzZTLiSJnhY) and [slides](https://docs.google.com/presentation/d/1ypylM3-w60kVDW7Q6S99AHzvlBgciTdjsAfqNP85K30)
+- April 2023 _Physical Plan and Execution_: [recording](https://youtu.be/2jkWU3_w6z0) and [slides](https://docs.google.com/presentation/d/1cA2WQJ2qg6tx6y4Wf8FH2WVSm9JQ5UgmBWATHdik0hg)
+
+[website]: https://arrow.apache.org/datafusion/
+[the architecture guide]: https://docs.rs/datafusion/latest/datafusion/index.html#architecture
+[roadmap]: https://arrow.apache.org/datafusion/contributor-guide/roadmap.html
+
+## New Features
+
+### More Streaming, Less Memory
+
+We have made significany progress on the [streaming execution roadmap]
+such as [unbounded datasources], [streaming group by], sophisticated
+[sort] and [repartitioning] improvements in the optimizer, and support
+for [symmetric hash join] (read more about that in the great [Synnada
+Blog Post] on the topic). Together, these features both 1) make it
+easier to build streaming systems using DataFusion that can
+incrementally generate output before (or ever) seeing the end of the
+input and 2) allow general queries to use less memory and generate their
+results faster.
+
+We have also improved the runtime [memory management] system so that
+DataFusion now stays within its declared memory budget [generate
+runtime errors].
+
+[sort]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/global_sort_selection/index.html
+[repartitioning]: https://docs.rs/datafusion/latest/datafusion/physical_optimizer/repartition/index.html
+[streaming execution roadmap]: https://github.com/apache/arrow-datafusion/issues/4285
+[unbounded datasources]: https://docs.rs/datafusion/latest/datafusion/physical_plan/trait.ExecutionPlan.html#method.unbounded_output
+[streaming group by]: https://docs.rs/datafusion/latest/datafusion/physical_plan/aggregates/enum.GroupByOrderMode.html
+[symmetric hash join]: https://docs.rs/datafusion/latest/datafusion/physical_plan/joins/struct.SymmetricHashJoinExec.html
+[synnada blog post]: https://github.com/apache/arrow-datafusion/issues/4285
+[memory management]: https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/index.html
+[generate runtime errors]: https://github.com/apache/arrow-datafusion/issues/3941
+
+### DML Support (`INSERT`, `DELETE`, `UPDATE`, etc)
+
+Part of building high performance data systems includes writing data,
+and DataFusion supports several features for creating new files:
+
+- `INSERT INTO` and `SELECT ... INTO ` support for memory backed and CSV tables
+- New [API for writing data into TableProviders]
+
+We are working on easier to use [COPY INTO] syntax, better support
+for writing parquet, JSON, and AVRO, and more -- see our [tracking epic]
+for more details.
+
+[tracking epic]: https://github.com/apache/arrow-datafusion/issues/6569
+[api for writing data into tableproviders]: https://docs.rs/datafusion/latest/datafusion/physical_plan/insert/trait.DataSink.html
+[tracking epic]: https://github.com/apache/arrow-datafusion/issues/6569
+[copy into]: https://github.com/apache/arrow-datafusion/issues/5654
+
+### Timestamp and Intervals
+
+One mark of the maturity of a SQL engine is how it handles the tricky
+world of timestamp, date, times and interval arithmetic. DataFusion is
+feature complete in this area and behaves as you would expect,
+supporting queries such as
+
+```sql
+SELECT now() + '1 month' FROM my_table;
+```
+
+We still have a long tail of [date and time improvements], which we are working on as well.
+
+[date and time improvements]: https://github.com/apache/arrow-datafusion/issues/3148
+
+### Querying Structured Types (`List` and `Struct`s)
+
+Arrow and Parquet [support nested data] well and DataFusion lets you
+easily query such `Struct` and `List`. For example, you can use
+DataFusion to read and query the [JSON Datasets for Exploratory OLAP -
+Mendeley Data] like this:
+
+[support nested data]: https://arrow.apache.org/blog/2022/10/08/arrow-parquet-encoding-part-2/
+[json datasets for exploratory olap - mendeley data]: https://data.mendeley.com/datasets/ct8f9skv97
+
+```sql
+----------
+-- Explore structured data using SQL
+----------
+SELECT delete FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL limit 10;
++---------------------------------------------------------------------------------------------------------------------------+
+| delete                                                                                                                    |
++---------------------------------------------------------------------------------------------------------------------------+
+| {status: {id: {$numberLong: 135037425050320896}, id_str: 135037425050320896, user_id: 334902461, user_id_str: 334902461}} |
+| {status: {id: {$numberLong: 134703982051463168}, id_str: 134703982051463168, user_id: 405383453, user_id_str: 405383453}} |
+| {status: {id: {$numberLong: 134773741740765184}, id_str: 134773741740765184, user_id: 64823441, user_id_str: 64823441}}   |
+| {status: {id: {$numberLong: 132543659655704576}, id_str: 132543659655704576, user_id: 45917834, user_id_str: 45917834}}   |
+| {status: {id: {$numberLong: 133786431926697984}, id_str: 133786431926697984, user_id: 67229952, user_id_str: 67229952}}   |
+| {status: {id: {$numberLong: 134619093570560002}, id_str: 134619093570560002, user_id: 182430773, user_id_str: 182430773}} |
+| {status: {id: {$numberLong: 134019857527214080}, id_str: 134019857527214080, user_id: 257396311, user_id_str: 257396311}} |
+| {status: {id: {$numberLong: 133931546469076993}, id_str: 133931546469076993, user_id: 124539548, user_id_str: 124539548}} |
+| {status: {id: {$numberLong: 134397743350296576}, id_str: 134397743350296576, user_id: 139836391, user_id_str: 139836391}} |
+| {status: {id: {$numberLong: 127833661767823360}, id_str: 127833661767823360, user_id: 244442687, user_id_str: 244442687}} |
++---------------------------------------------------------------------------------------------------------------------------+
+
+----------
+-- Select some deeply nested fields
+----------
+SELECT
+  delete['status']['id']['$numberLong'] as delete_id,
+  delete['status']['user_id'] as delete_user_id
+FROM 'twitter-sample-head-100000.parquet' WHERE delete IS NOT NULL LIMIT 10;
+
++--------------------+----------------+
+| delete_id          | delete_user_id |
++--------------------+----------------+
+| 135037425050320896 | 334902461      |
+| 134703982051463168 | 405383453      |
+| 134773741740765184 | 64823441       |
+| 132543659655704576 | 45917834       |
+| 133786431926697984 | 67229952       |
+| 134619093570560002 | 182430773      |
+| 134019857527214080 | 257396311      |
+| 133931546469076993 | 124539548      |
+| 134397743350296576 | 139836391      |
+| 127833661767823360 | 244442687      |
++--------------------+----------------+
+```
+
+### Subqueries All the Way Down
+
+DataFusion can run many different subqueries by rewriting them to
+joins. It has been able to run the full suite of TPC-H queries for at
+least the last year, but recently we have implemented significant
+improvements to this logic, sufficient to run almost all queries in
+the TPC-DS benchmark as well.
+
+## Community and Project Growth
+
+The six months since [our last update] saw significant growth in
+the DataFusion community .Between versions `17.0.0` and `26.0.0`,
+DataFusion merged 711 PRs from 107 distinct contributors, not
+including all the work that goes into our core dependencies such as
+[arrow](https://crates.io/crates/arrow),
+[parquet](https://crates.io/crates/parquet), and
+[object_store](https://crates.io/crates/object_store), that much of
+the same community helps support.
+
+In addition, we have added 7 new committers and 1 new PMC member to
+the Apache Arrow project, largely focused on DataFusion, and we
+learned about some of the cool [new systems] which are using
+DataFusion. Given the growth of the community and interest in the
+project, we also clarified the [mission statement] and are
+[discussing] "graduate"ing DataFusion to a new top level
+Apache Software Foundation project.
+
+[our last update]: https://arrow.apache.org/blog/2023/01/19/datafusion-16.0.0
+[new systems]: https://arrow.apache.org/datafusion/user-guide/introduction.html#known-users
+[mission statement]: https://github.com/apache/arrow-datafusion/discussions/6441
+[discussing]: https://github.com/apache/arrow-datafusion/discussions/6475
+
+<!--
+$ git log --pretty=oneline 17.0.0..26.0.0 . | wc -l
+     711
+
+$ git shortlog -sn 17.0.0..26.0.0 . | wc -l
+      107
+-->
+
+# How to Get Involved
+
+Kudos to everyone in the community who has contributed ideas,
+discussions, bug reports, documentation and code. It is exciting to be
+innvating on the next generation of database architectures together!

Review Comment:
   ```suggestion
   innovating on the next generation of database architectures together!
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org