You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ag...@apache.org on 2021/08/20 13:57:46 UTC

[arrow-site] branch master updated: ARROW-13656: [Website] Blog posts for DataFusion 5.0.0 and Ballista 0.5.0 (#140)

This is an automated email from the ASF dual-hosted git repository.

agrove pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/master by this push:
     new 343e8d6  ARROW-13656: [Website] Blog posts for DataFusion 5.0.0 and Ballista 0.5.0 (#140)
343e8d6 is described below

commit 343e8d6dcfb923ebfb53cf6b22c2e2ab0ef9dad4
Author: Andy Grove <ag...@apache.org>
AuthorDate: Fri Aug 20 07:56:04 2021 -0600

    ARROW-13656: [Website] Blog posts for DataFusion 5.0.0 and Ballista 0.5.0 (#140)
---
 _posts/2021-08-18-ballista-0.5.0.md   |  83 +++++++++++++++++++++++++
 _posts/2021-08-18-datafusion-5.0.0.md | 114 ++++++++++++++++++++++++++++++++++
 img/2021-08-18-datafusion500perf.png  | Bin 0 -> 21245 bytes
 3 files changed, 197 insertions(+)

diff --git a/_posts/2021-08-18-ballista-0.5.0.md b/_posts/2021-08-18-ballista-0.5.0.md
new file mode 100644
index 0000000..b131086
--- /dev/null
+++ b/_posts/2021-08-18-ballista-0.5.0.md
@@ -0,0 +1,83 @@
+---
+layout: post
+title: Apache Arrow Ballista 0.5.0 Release
+date: "2021-08-18 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+Ballista extends DataFusion to provide support for distributed queries. This is the first release of Ballista since 
+the project was [donated](https://arrow.apache.org/blog/2021/04/12/ballista-donation/) to the Apache Arrow project 
+and includes 80 commits from 11 contributors.
+
+```
+git shortlog -sn 4.0.0..5.0.0 ballista/rust/client ballista/rust/core ballista/rust/executor ballista/rust/scheduler
+  27  Andy Grove
+  15  Jiayu Liu
+  12  Andrew Lamb
+   8  Ximo Guanter
+   6  Daniël Heres
+   5  QP Hou
+   2  Jorge Leitao
+   1  Javier Goday
+   1  K.I. (Dennis) Jung
+   1  Mike Seddon
+   1  sathis
+```
+
+<!--
+$ git log --pretty=oneline 4.0.0..5.0.0 ballista/rust/client ballista/rust/core ballista/rust/executor ballista/rust/scheduler ballista-examples/ | wc -l
+80
+-->
+
+The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes 
+and improvements have been made: we refer you to the [complete changelog](https://github.com/apache/arrow-datafusion/blob/5.0.0/ballista/CHANGELOG.md).
+
+# Performance and Scalability
+
+Ballista is now capable of running complex SQL queries at scale and supports scalable distributed joins. We have been 
+benchmarking using individual queries from the TPC-H benchmark at scale factors up to 1000 (1 TB). When running against 
+CSV files, performance is generally very close to DataFusion, and significantly faster in some cases due to the fact 
+that the scheduler limits the number of concurrent tasks that run at any given time. Performance against large Parquet 
+datasets is currently non ideal due to some issues ([#867](https://github.com/apache/arrow-datafusion/issues/867), 
+[#868](https://github.com/apache/arrow-datafusion/issues/868)) that we hope to resolve for the next release. 
+
+# New Features
+
+The main new features in this release are:
+
+- Ballista queries can now be executed by calling DataFrame.collect()
+- The shuffle mechanism has been re-implemented
+- Distributed hash-partitioned joins are now supported
+- Keda autoscaling is supported
+
+To get started with Ballista, refer to the [crate documentation](https://docs.rs/ballista/0.5.0/ballista/).
+
+Now that the basic functionality is in place, the focus for the next release will be to improve the performance and
+scalability as well as improving the documentation.
+
+# How to Get Involved
+
+If you are interested in contributing to Ballista, we would love to have you! You
+can help by trying out Ballista on some of your own data and projects and filing bug reports and helping to
+improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for
+beginners is [here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22)
+and the full list is [here](https://github.com/apache/arrow-datafusion/issues).
\ No newline at end of file
diff --git a/_posts/2021-08-18-datafusion-5.0.0.md b/_posts/2021-08-18-datafusion-5.0.0.md
new file mode 100644
index 0000000..53c8f58
--- /dev/null
+++ b/_posts/2021-08-18-datafusion-5.0.0.md
@@ -0,0 +1,114 @@
+---
+layout: post
+title: Apache Arrow DataFusion 5.0.0 Release
+date: "2021-08-18 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+The Apache Arrow team is pleased to announce the DataFusion 5.0.0 release. This covers 4 months of development work 
+and includes 211 commits from the following 31 distinct contributors.
+
+```
+$ git shortlog -sn 4.0.0..5.0.0 datafusion datafusion-cli datafusion-examples
+    61  Jiayu Liu
+    47  Andrew Lamb
+    27  Daniël Heres
+    13  QP Hou
+    13  Andy Grove
+     4  Javier Goday
+     4  sathis
+     3  Ruan Pearce-Authers
+     3  Raphael Taylor-Davies
+     3  Jorge Leitao
+     3  Cui Wenzheng
+     3  Mike Seddon
+     3  Edd Robinson
+     2  思维
+     2  Liang-Chi Hsieh
+     2  Michael Lu
+     2  Parth Sarthy
+     2  Patrick More
+     2  Rich
+     1  Charlie Evans
+     1  Gang Liao
+     1  Agata Naomichi
+     1  Ritchie Vink
+     1  Evan Chan
+     1  Ruihang Xia
+     1  Todd Treece
+     1  Yichen Wang
+     1  baishen
+     1  Nga Tran
+     1  rdettai
+     1  Marco Neumann
+```
+
+<!--
+$ git log --pretty=oneline 4.0.0..5.0.0 datafusion datafusion-cli datafusion-examples | wc -l
+     211
+-->
+
+The release notes below are not exhaustive and only expose selected highlights of the release. Many other bug fixes 
+and improvements have been made: we refer you to the complete 
+[changelog](https://github.com/apache/arrow-datafusion/blob/5.0.0/datafusion/CHANGELOG.md).
+
+# Performance
+
+There have been numerous performance improvements in this release. The following chart shows the relative 
+performance of individual TPC-H queries compared to the previous release.
+
+<i>TPC-H @ scale factor 100, in parquet format. Concurrency 24.</i>
+
+<img src="{{ site.baseurl }}/img/2021-08-18-datafusion500perf.png" />
+
+We also extended support for more TPC-H queries: q7, q8, q9 and q13 are running successfully in DataFusion 5.0.
+
+# New Features
+
+- Initial support for SQL-99 Analytics (WINDOW functions)
+- Improved JOIN support: cross join, semi-join, anti join, and fixes to null handling
+- Improved EXPLAIN support
+- Initial implementation of metrics in the physical plan
+- Support for SELECT DISTINCT
+- Support for Json and NDJson formatted inputs
+- Query column with relations
+- Added more datetime related functions: `now`, `date_trunc`, `to_timestamp_millis`, `to_timestamp_micros`, `to_timestamp_seconds`
+- Streaming Dataframe.collect
+- Support table column aliases
+- Answer count(*), min() and max() queries using only statistics
+- Non-equi-join filters in JOIN conditions
+- Modulus operation
+- Support group by column positions
+- Added constant folding query optimizer
+- Hash partitioned aggregation
+- Added `random` SQL function
+- Implemented count distinct for floats and dictionary types
+- Re-exported arrow and parquet crates in Datafusion
+- General row group pruning logic that’s agnostic to storage format
+
+# How to Get Involved
+
+If you are interested in contributing to DataFusion, we would love to have you! You 
+can help by trying out DataFusion on some of your own data and projects and filing bug reports and helping to 
+improve the documentation, or contribute to the documentation, tests or code. A list of open issues suitable for 
+beginners is [here](https://github.com/apache/arrow-datafusion/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) 
+and the full list is [here](https://github.com/apache/arrow-datafusion/issues).
\ No newline at end of file
diff --git a/img/2021-08-18-datafusion500perf.png b/img/2021-08-18-datafusion500perf.png
new file mode 100644
index 0000000..4406473
Binary files /dev/null and b/img/2021-08-18-datafusion500perf.png differ