You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by al...@apache.org on 2022/02/28 14:18:45 UTC
[arrow-site] branch master updated: ARROW-15683: [Website] [DataFusion] DataFusion 7.0.0 blog post (#193)
This is an automated email from the ASF dual-hosted git repository.
alamb pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git
The following commit(s) were added to refs/heads/master by this push:
new cc0b0b8 ARROW-15683: [Website] [DataFusion] DataFusion 7.0.0 blog post (#193)
cc0b0b8 is described below
commit cc0b0b8521ba11189d0118be7de3d99d5915eb1b
Author: Matthew Turner <ma...@outlook.com>
AuthorDate: Mon Feb 28 09:18:32 2022 -0500
ARROW-15683: [Website] [DataFusion] DataFusion 7.0.0 blog post (#193)
* v1 blog
* Fix intro
* Update for feedback
* More feedback
* Parquet schema evolution comment
* Update date
* Update wording
* Update _posts/2022-02-17-datafusion-7.0.0.md
Co-authored-by: Andrew Lamb <an...@nerdnetworks.org>
---
_posts/2022-02-17-datafusion-7.0.0.md | 153 ++++++++++++++++++++++++++++++++++
1 file changed, 153 insertions(+)
diff --git a/_posts/2022-02-17-datafusion-7.0.0.md b/_posts/2022-02-17-datafusion-7.0.0.md
new file mode 100644
index 0000000..3ddd721
--- /dev/null
+++ b/_posts/2022-02-17-datafusion-7.0.0.md
@@ -0,0 +1,153 @@
+---
+layout: post
+title: Apache Arrow DataFusion 7.0.0 Release
+date: "2022-02-28 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+# Introduction
+
+[DataFusion](https://arrow.apache.org/datafusion/) is an extensible query execution framework, written in Rust, that uses Apache Arrow as its in-memory format.
+
+When you want to extend your Rust project with [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/sql_status.html), a DataFrame API, or the ability to read and process Parquet, JSON, Avro or CSV data, DataFusion is definitely worth checking out.
+
+DataFusion's SQL, `DataFrame`, and manual `PlanBuilder` API let users access a sophisticated query optimizer and execution engine capable of fast, resource efficient, and parallel execution that takes optimal advantage of todays multicore hardware. Being written in Rust means DataFusion can offer *both* the safety of dynamic languages as well as the resource efficiency of a compiled language.
+
+The Apache Arrow team is pleased to announce the DataFusion 7.0.0 release. This covers 4 months of development work
+and includes 195 commits from the following 37 distinct contributors.
+
+<!--
+git log --pretty=oneline 5.0.0..6.0.0 datafusion datafusion-cli datafusion-examples | wc -l
+ 134
+
+git shortlog -sn 5.0.0..6.0.0 datafusion datafusion-cli datafusion-examples | wc -l
+ 29
+
+ Carlos and xudong963 are same individual
+-->
+
+```
+ 44 Andrew Lamb
+ 24 Kun Liu
+ 23 Jiayu Liu
+ 17 xudong.w
+ 11 Yijie Shen
+ 9 Matthew Turner
+ 7 Liang-Chi Hsieh
+ 5 Lin Ma
+ 4 Stephen Carman
+ 4 James Katz
+ 4 Dmitry Patsura
+ 4 QP Hou
+ 3 dependabot[bot]
+ 3 Remzi Yang
+ 3 Yang
+ 3 ic4y
+ 3 Daniël Heres
+ 2 Andy Grove
+ 2 Raphael Taylor-Davies
+ 2 Jason Tianyi Wang
+ 2 Dan Harris
+ 2 Sergey Melnychuk
+ 1 Nitish Tiwari
+ 1 Dom
+ 1 Eduard Karacharov
+ 1 Javier Goday
+ 1 Boaz
+ 1 Marko Mikulicic
+ 1 Max Burke
+ 1 Carol (Nichols || Goulding)
+ 1 Phillip Cloud
+ 1 Rich
+ 1 Toby Hede
+ 1 Will Jones
+ 1 r.4ntix
+ 1 rdettai
+```
+
+The following section highlights some of the improvements in this release. Of course, many other bug fixes and improvements have also been made and we refer you to the complete [changelog](https://github.com/apache/arrow-datafusion/blob/7.0.0/datafusion/CHANGELOG.md) for the full detail.
+
+# Summary
+
+- DataFusion Crate
+ - The DataFusion crate is being split into multiple crates to decrease compilation times and improve the development experience. Initially, `datafusion-common` (the core DataFusion components) and `datafusion-expr` (DataFusion expressions, functions, and operators) have been split out. There will be additional splits after the 7.0 release.
+- Performance Improvements and Optimizations
+ - Arrow’s dyn scalar kernels are now used to enable efficient operations on `DictionaryArray`s [#1685](https://github.com/apache/arrow-datafusion/pull/1685)
+ - Switch from `std::sync::Mutex` to `parking_lot::Mutex` [#1720](https://github.com/apache/arrow-datafusion/pull/1720)
+- New Features
+ - Support for memory tracking and spilling to disk
+ - MemoryMananger and DiskManager [#1526](https://github.com/apache/arrow-datafusion/pull/1526)
+ - Out of core sort [#1526](https://github.com/apache/arrow-datafusion/pull/1526)
+ - New metrics
+ - `Gauge` and `CurrentMemoryUsage` [#1682](https://github.com/apache/arrow-datafusion/pull/1682)
+ - `Spill_count` and `spilled_bytes` [#1641](https://github.com/apache/arrow-datafusion/pull/1641)
+ - New math functions
+ - `Approx_quantile` [#1529](https://github.com/apache/arrow-datafusion/pull/1539)
+ - `stddev` and `variance` (sample and population) [#1525](https://github.com/apache/arrow-datafusion/pull/1525)
+ - `corr` [#1561](https://github.com/apache/arrow-datafusion/pull/1561)
+ - Support decimal type [#1394](https://github.com/apache/arrow-datafusion/pull/1394)[#1407](https://github.com/apache/arrow-datafusion/pull/1407)[#1408](https://github.com/apache/arrow-datafusion/pull/1408)[#1431](https://github.com/apache/arrow-datafusion/pull/1431)[#1483](https://github.com/apache/arrow-datafusion/pull/1483)[#1554](https://github.com/apache/arrow-datafusion/pull/1554)[#1640](https://github.com/apache/arrow-datafusion/pull/1640)
+ - Support for reading Parquet files with evolved schemas [#1622](https://github.com/apache/arrow-datafusion/pull/1622)[#1709](https://github.com/apache/arrow-datafusion/pull/1709)
+ - Support for registering `DataFrame` as table [#1699](https://github.com/apache/arrow-datafusion/pull/1699)
+ - Support for the `substring` function [#1621](https://github.com/apache/arrow-datafusion/pull/1621)
+ - Support `array_agg(distinct ...)` [#1579](https://github.com/apache/arrow-datafusion/pull/1579)
+ - Support `sort` on unprojected columns [#1415](https://github.com/apache/arrow-datafusion/pull/1415)
+- Additional Integration Points
+ - A new public Expression simplification API [#1717](https://github.com/apache/arrow-datafusion/pull/1717)
+- [DataFusion-Contrib](https://github.com/datafusion-contrib)
+ - A new GitHub organization created as a home for both `DataFusion` extensions and as a testing ground for new features.
+ - Extensions
+ - [DataFusion-Python](https://github.com/datafusion-contrib/datafusion-python)
+ - [DataFusion-Java](https://github.com/datafusion-contrib/datafusion-java)
+ - [DataFusion-hdsfs-native](https://github.com/datafusion-contrib/datafusion-hdfs-native)
+ - [DataFusion-ObjectStore-s3](https://github.com/datafusion-contrib/datafusion-objectstore-s3)
+ - New Features
+ - [DataFusion-Streams](https://github.com/datafusion-contrib/datafusion-streams)
+- [Arrow2](https://github.com/jorgecarleitao/arrow2)
+ - An [Arrow2 Branch](https://github.com/apache/arrow-datafusion/tree/arrow2) has been created. There are ongoing discussions in [DataFusion](https://github.com/apache/arrow-datafusion/issues/1532) and [arrow-rs](https://github.com/apache/arrow-rs/issues/1176) about migrating `DataFusion` to `Arrow2`
+
+# Documentation and Roadmap
+
+We are working to consolidate the documentation into the [official site](https://arrow.apache.org/datafusion). You can find more details there on topics such as the [SQL status](https://arrow.apache.org/datafusion/user-guide/sql/index.html) and a [user guide](https://arrow.apache.org/datafusion/user-guide/introduction.html#introduction). This is also an area we would love to get help from the broader community [#1821](https://github.com/apache/arrow-datafusion/issues/1821).
+
+To provide transparency on DataFusion’s priorities to users and developers a three month roadmap will be published at the beginning of each quarter. This can be found here [here](https://arrow.apache.org/datafusion/specification/roadmap.html).
+
+# Upcoming Attractions
+
+- Ballista is gaining momentum, and several groups are now evaluating and contributing to the project.
+ - Some of the proposed improvements
+ - [Improvements Overview](https://github.com/apache/arrow-datafusion/issues/1701)
+ - [Extensibility](https://github.com/apache/arrow-datafusion/issues/1675)
+ - [File system access](https://github.com/apache/arrow-datafusion/issues/1702)
+ - [Cluster state](https://github.com/apache/arrow-datafusion/issues/1704)
+- Continued improvements for working with limited resources and large datasets
+ - Memory limited joins[#1599](https://github.com/apache/arrow-datafusion/issues/1599)
+ - Sort-merge join[#141](https://github.com/apache/arrow-datafusion/issues/141)[#1776](https://github.com/apache/arrow-datafusion/pull/1776)
+ - Introduce row based bytes representation [#1708](https://github.com/apache/arrow-datafusion/pull/1708)
+
+# How to Get Involved
+
+If you are interested in contributing to DataFusion, and learning about state of
+the art query processing, we would love to have you join us on the journey! You
+can help by trying out DataFusion on some of your own data and projects and let us know how it goes or contribute a PR with documentation, tests or code. A list of open issues suitable for beginners is [here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
+
+Check out our new [Communication Doc](https://arrow.apache.org/datafusion/community/communication.html) on more
+ways to engage with the community.