You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ag...@apache.org on 2021/08/20 13:57:46 UTC
[arrow-site] branch master updated: ARROW-13656: [Website] Blog
posts for DataFusion 5.0.0 and Ballista 0.5.0 (#140)
This is an automated email from the ASF dual-hosted git repository.
agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git
The following commit(s) were added to refs/heads/master by this push:
new 343e8d6 ARROW-13656: [Website] Blog posts for DataFusion 5.0.0 and Ballista 0.5.0 (#140)
343e8d6 is described below
commit 343e8d6dcfb923ebfb53cf6b22c2e2ab0ef9dad4
Author: Andy Grove <ag...@apache.org>
AuthorDate: Fri Aug 20 07:56:04 2021 -0600
ARROW-13656: [Website] Blog posts for DataFusion 5.0.0 and Ballista 0.5.0 (#140)
---
_posts/2021-08-18-ballista-0.5.0.md | 83 +++++++++++++++++++++++++
_posts/2021-08-18-datafusion-5.0.0.md | 114 ++++++++++++++++++++++++++++++++++
img/2021-08-18-datafusion500perf.png | Bin 0 -> 21245 bytes
3 files changed, 197 insertions(+)
diff --git a/_posts/2021-08-18-ballista-0.5.0.md b/_posts/2021-08-18-ballista-0.5.0.md
new file mode 100644
index 0000000..b131086
--- /dev/null
+++ b/_posts/2021-08-18-ballista-0.5.0.md
@@ -0,0 +1,83 @@
+---
+layout: post
+title: Apache Arrow Ballista 0.5.0 Release
+date: "2021-08-18 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Ballista extends DataFusion to provide support for distributed queries. This is the first release of Ballista since
+the project was [donated](https://arrow.apache.org/blog/2021/04/12/ballista-donation/) to the Apache Arrow project
+and includes 80 commits from 11 contributors.
+
+```
+git shortlog -sn 4.0.0..5.0.0 ballista/rust/client ballista/rust/core ballista/rust/executor ballista/rust/scheduler
+ 27 Andy Grove
+ 15 Jiayu Liu
+ 12 Andrew Lamb
+ 8 Ximo Guanter
+ 6 Daniël Heres
+ 5 QP Hou
+ 2 Jorge Leitao
+ 1 Javier Goday
+ 1 K.I. (Dennis) Jung
+ 1 Mike Seddon
+ 1 sathis
+```
+
+<!--
+$ git log --pretty=oneline 4.0.0..5.0.0 ballista/rust/client ballista/rust/core ballista/rust/executor ballista/rust/scheduler ballista-examples/ | wc -l
+80
+-->
+
+The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes
+and improvements have been made: we refer you to the [complete changelog](https://github.com/apache/arrow-datafusion/blob/5.0.0/ballista/CHANGELOG.md).
+
+# Performance and Scalability
+
+Ballista is now capable of running complex SQL queries at scale and supports scalable distributed joins. We have been
+benchmarking using individual queries from the TPC-H benchmark at scale factors up to 1000 (1 TB). When running against
+CSV files, performance is generally very close to DataFusion, and significantly faster in some cases due to the fact
+that the scheduler limits the number of concurrent tasks that run at any given time. Performance against large Parquet
+datasets is currently non ideal due to some issues ([#867](https://github.com/apache/arrow-datafusion/issues/867),
+[#868](https://github.com/apache/arrow-datafusion/issues/868)) that we hope to resolve for the next release.
+
+# New Features
+
+The main new features in this release are:
+
+- Ballista queries can now be executed by calling DataFrame.collect()
+- The shuffle mechanism has been re-implemented
+- Distributed hash-partitioned joins are now supported
+- Keda autoscaling is supported
+
+To get started with Ballista, refer to the [crate documentation](https://docs.rs/ballista/0.5.0/ballista/).
+
+Now that the basic functionality is in place, the focus for the next release will be to improve the performance and
+scalability as well as improving the documentation.
+
+# How to Get Involved
+
+If you are interested in contributing to Ballista, we would love to have you! You
+can help by trying out Ballista on some of your own data and projects and filing bug reports and helping to
+improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for
+beginners is [here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
+and the full list is [here](https://github.com/apache/arrow-datafusion/issues).
\ No newline at end of file
diff --git a/_posts/2021-08-18-datafusion-5.0.0.md b/_posts/2021-08-18-datafusion-5.0.0.md
new file mode 100644
index 0000000..53c8f58
--- /dev/null
+++ b/_posts/2021-08-18-datafusion-5.0.0.md
@@ -0,0 +1,114 @@
+---
+layout: post
+title: Apache Arrow DataFusion 5.0.0 Release
+date: "2021-08-18 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+The Apache Arrow team is pleased to announce the DataFusion 5.0.0 release. This covers 4 months of development work
+and includes 211 commits from the following 31 distinct contributors.
+
+```
+$ git shortlog -sn 4.0.0..5.0.0 datafusion datafusion-cli datafusion-examples
+ 61 Jiayu Liu
+ 47 Andrew Lamb
+ 27 Daniël Heres
+ 13 QP Hou
+ 13 Andy Grove
+ 4 Javier Goday
+ 4 sathis
+ 3 Ruan Pearce-Authers
+ 3 Raphael Taylor-Davies
+ 3 Jorge Leitao
+ 3 Cui Wenzheng
+ 3 Mike Seddon
+ 3 Edd Robinson
+ 2 思维
+ 2 Liang-Chi Hsieh
+ 2 Michael Lu
+ 2 Parth Sarthy
+ 2 Patrick More
+ 2 Rich
+ 1 Charlie Evans
+ 1 Gang Liao
+ 1 Agata Naomichi
+ 1 Ritchie Vink
+ 1 Evan Chan
+ 1 Ruihang Xia
+ 1 Todd Treece
+ 1 Yichen Wang
+ 1 baishen
+ 1 Nga Tran
+ 1 rdettai
+ 1 Marco Neumann
+```
+
+<!--
+$ git log --pretty=oneline 4.0.0..5.0.0 datafusion datafusion-cli datafusion-examples | wc -l
+ 211
+-->
+
+The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes
+and improvements have been made: we refer you to the complete
+[changelog](https://github.com/apache/arrow-datafusion/blob/5.0.0/datafusion/CHANGELOG.md).
+
+# Performance
+
+There have been numerous performance improvements in this release. The following chart shows the relative
+performance of individual TPC-H queries compared to the previous release.
+
+<i>TPC-H @ scale factor 100, in parquet format. Concurrency 24.</i>
+
+<img src="{{ site.baseurl }}/img/2021-08-18-datafusion500perf.png" />
+
+We also extended support for more TPC-H queries: q7, q8, q9 and q13 are running successfully in DataFusion 5.0.
+
+# New Features
+
+- Initial support for SQL-99 Analytics (WINDOW functions)
+- Improved JOIN support: cross join, semi-join, anti join, and fixes to null handling
+- Improved EXPLAIN support
+- Initial implementation of metrics in the physical plan
+- Support for SELECT DISTINCT
+- Support for Json and NDJson formatted inputs
+- Query column with relations
+- Added more datetime related functions: `now`, `date_trunc`, `to_timestamp_millis`, `to_timestamp_micros`, `to_timestamp_seconds`
+- Streaming Dataframe.collect
+- Support table column aliases
+- Answer count(*), min() and max() queries using only statistics
+- Non-equi-join filters in JOIN conditions
+- Modulus operation
+- Support group by column positions
+- Added constant folding query optimizer
+- Hash partitioned aggregation
+- Added `random` SQL function
+- Implemented count distinct for floats and dictionary types
+- Re-exported arrow and parquet crates in Datafusion
+- General row group pruning logic that’s agnostic to storage format
+
+# How to Get Involved
+
+If you are interested in contributing to DataFusion, we would love to have you! You
+can help by trying out DataFusion on some of your own data and projects and filing bug reports and helping to
+improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for
+beginners is [here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
+and the full list is [here](https://github.com/apache/arrow-datafusion/issues).
\ No newline at end of file
diff --git a/img/2021-08-18-datafusion500perf.png b/img/2021-08-18-datafusion500perf.png
new file mode 100644
index 0000000..4406473
Binary files /dev/null and b/img/2021-08-18-datafusion500perf.png differ