You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by np...@apache.org on 2021/05/04 17:16:30 UTC

[arrow-site] branch master updated: 4.0 release blog post (#104)

This is an automated email from the ASF dual-hosted git repository.

npr pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/master by this push:
     new c6c092e  4.0 release blog post (#104)
c6c092e is described below

commit c6c092e2b9ae0e384c1416ebc91feb28aceb4c42
Author: Neal Richardson <ne...@gmail.com>
AuthorDate: Tue May 4 10:16:24 2021 -0700

    4.0 release blog post (#104)
    
    Co-authored-by: David Li <li...@gmail.com>
    Co-authored-by: Ian Cook <ia...@gmail.com>
    Co-authored-by: Sutou Kouhei <ko...@clear-code.com>
    Co-authored-by: emkornfield <em...@gmail.com>
    Co-authored-by: Dominik Moritz <do...@gmail.com>
    Co-authored-by: Ying Zhou <yi...@gmail.com>
    Co-authored-by: Joris Van den Bossche <jo...@gmail.com>
    Co-authored-by: Andy Grove <an...@users.noreply.github.com>
    Co-authored-by: Kenta Murata <mr...@users.noreply.github.com>
    Co-authored-by: Antoine Pitrou <an...@python.org>
---
 _posts/2021-04-26-4.0.0-release.md | 326 +++++++++++++++++++++++++++++++++++++
 1 file changed, 326 insertions(+)

diff --git a/_posts/2021-04-26-4.0.0-release.md b/_posts/2021-04-26-4.0.0-release.md
new file mode 100644
index 0000000..5d7b73a
--- /dev/null
+++ b/_posts/2021-04-26-4.0.0-release.md
@@ -0,0 +1,326 @@
+---
+layout: post
+title: "Apache Arrow 4.0.0 Release"
+date: "2021-05-03 00:00:00 -0600"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+
+The Apache Arrow team is pleased to announce the 4.0.0 release. This covers
+3 months of development work and includes [**711 resolved issues**][1]
+from [**114 distinct contributors**][2]. See the Install Page to learn how to
+get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 3.0.0 release, Yibo Cai, Ian Cook, and Jonathan Keane
+have been invited as committers to Arrow,
+and Andrew Lamb and Jorge Leitão have joined the Project Management Committee
+(PMC). Thank you for all of your contributions!
+
+## Arrow Flight RPC notes
+
+In Java, applications can now enable zero-copy optimizations when writing
+data (ARROW-11066). This potentially breaks source compatibility, so it is
+not enabled by default.
+
+Arrow Flight is now packaged for C#/.NET.
+
+## Linux packages notes
+
+There are Linux packages for C++ and C GLib. They were provided by Bintray
+but [Bintray is no longer available as of 2021-05-01][5]. They are provided
+by Artifactory now. Users needs to change the install instructions because the URL
+hsa changed. See [the install page][6] for new instructions. Here is a
+summary of needed changes.
+
+For Debian GNU Linux and Ubuntu users:
+
+  * Users need to change the `apache-arrow-archive-keyring` install instruction:
+    * Package name is changed to `apache-arrow-apt-source`.
+    * Download URL is changed to `https://apache.jfrog.io/artifactory/arrow/...` from `https://apache.bintray.com/arrow/...`.
+
+For CentOS and Red Hat Enterprise Linux users:
+
+  * Users need to change the `apache-arrow-release` install instruction:
+    * Download URL is changed to `https://apache.jfrog.io/artifactory/arrow/...` from `https://apache.bintray.com/arrow/...`.
+
+## C++ notes
+
+The Arrow C++ library now includes a [`vcpkg.json`](https://github.com/apache/arrow/blob/master/cpp/vcpkg.json)
+manifest file and a new CMake option `-DARROW_DEPENDENCY_SOURCE=VCPKG` to
+simplify installation of dependencies using the [vcpkg](https://github.com/microsoft/vcpkg)
+package manager. This provides an alternative means of installing C++ library
+dependencies on Linux, macOS, and Windows. See the
+[Building Arrow C++]({{ site.baseurl }}/docs/developers/cpp/building.html)
+and [Developing on Windows]({{ site.baseurl }}/docs/developers/cpp/windows.html)
+docs pages for details.
+
+The default memory allocator on macOS has been changed from jemalloc to mimalloc,
+yielding performance benefits on a range of macro-benchmarks (ARROW-12316).
+
+Non-monotonic dense union offsets are now disallowed as per the Arrow format
+specification, and return an error in `Array::ValidateFull` (ARROW-10580).
+
+### Compute layer
+
+Automatic implicit casting in compute kernels (ARROW-8919). For example, for
+the addition of two arrays, the arrays are first cast to their common numeric
+type instead of erroring when the types are not equal.
+
+Compute functions `quantile` (ARROW-10831) and `power` (ARROW-11070) have been
+added for numeric data.
+
+Compute functions for string processing have been added for:
+
+* Trimming characters (ARROW-9128).
+* Extracting substrings captured by a regex pattern (`extract_regex`, ARROW-10195).
+* Computing UTF8 string lengths (`utf8_length`, ARROW-11693).
+* Matching strings against regex pattern (`match_substring_regex`, ARROW-12134).
+* Replacing non-overlapping substrings that match a literal pattern or regular
+  expression (`replace_substring` and `replace_substring_regex`, ARROW-10306).
+
+It is now possible to sort decimal and fixed-width binary data (ARROW-11662).
+
+The precision of the `sum` kernel was improved (ARROW-11758).
+
+### CSV
+
+A CSV writer has been added (ARROW-2229).
+
+The CSV reader can now infer timestamp columns with fractional seconds (ARROW-12031).
+
+### Dataset
+
+Arrow Datasets received various performance improvements and new
+features. Some highlights:
+
+* New columns can be projected from arbitrary expressions at scan time
+  (ARROW-11174)
+* Read performance was improved for Parquet on high-latency
+  filesystems (ARROW-11601) and in general when there are thousands of
+  files or more (ARROW-8658)
+* Null partition keys can be written (ARROW-10438)
+* Compressed CSV files can be read (ARROW-10372)
+* Filesystems support async operations (ARROW-10846)
+* Usage and API documentation were added (ARROW-11677)
+
+### Files and filesystems
+
+Fixed some rare instances of GZip files could not be read properly (ARROW-12169).
+
+Support for setting S3 proxy parameters has been added (ARROW-8900).
+
+The HDFS filesystem is now able to write more than 2GB of data at a time
+(ARROW-11391).
+
+### IPC
+
+The IPC reader now supports reading data with dictionaries shared between
+different schema fields (ARROW-11838).
+
+The IPC reader now supports optional endian conversion when receiving IPC
+data represented with a different endianness. It is therefore possible to
+exchange Arrow data between systems with different endiannesses (ARROW-8797).
+
+The IPC file writer now optionally unifies dictionaries when writing a
+file in a single shot, instead of erroring out if unequal dictionaries are
+encountered (ARROW-10406).
+
+An interoperability issue with the C# implementation was fixed (ARROW-12100).
+
+### JSON
+
+A possible crash when reading a line-separated JSON file has been fixed (ARROW-12065).
+
+### ORC
+
+The Arrow C++ library now includes an ORC file writer. Hence it is possible
+to both read and write ORC files from/to Arrow data.
+
+### Parquet
+
+The Parquet C++ library version is now synced with the Arrow version (ARROW-7830).
+
+Parquet DECIMAL statistics were previously calculated incorrectly, this
+has now been fixed (PARQUET-1655).
+
+Initial support for high-level Parquet encryption APIs similar to those
+in parquet-mr is available (ARROW-9318).
+
+## C# notes
+
+Arrow Flight is now packaged for C#/.NET.
+
+## Go notes
+
+The go implementation now supports IPC buffer compression
+
+## Java notes
+
+Java now supports IPC buffer compression (ZSTD is recommended as the current performance of LZ4 is quite slow).
+
+## JavaScript notes
+
+* The Arrow JS module is now tree-shakeable.
+* Iterating over Tables or Vectors is ~2X faster. [Demo](https://observablehq.com/@domoritz/arrow-js-3-vs-4-iterator)
+* The default bundles use modern JS.
+## Python notes
+
+- Limited support for writing out CSV files (only types that have cast implementation to String) is now available.
+- Writing parquet list types now has the option of enabling the canonical group naming according to the Parquet specification.
+- The ORC Writer is now available.
+
+Creating a dataset with `pyarrow.dataset.write_dataset` is now possible from a
+Python iterator of record batches (ARROW-10882).
+The Dataset interface can now use custom projections using expressions when
+scanning (ARROW-11750). The expressions gained basic support for arithmetic
+operations (e.g. `ds.field('a') / ds.field('b')`) (ARROW-12058). See
+the [Dataset docs][7] for more details.
+
+See the C++ notes above for additional details.
+
+## R notes
+
+The `dplyr` interface to Arrow data gained many new features in this release, including support for `mutate()`, `relocate()`, and more. You can also call in `filter()` or `mutate()` over 100 functions supported by the Arrow C++ library, and many string functions are available both by their base R (`grepl()`, `gsub()`, etc.) and `stringr` (`str_detect()`, `str_replace()`) spellings.
+
+Datasets can now read compressed CSVs automatically, and you can also open a dataset that is based on a single file, enabling you to use `write_dataset()` to partition a very large file without having to read the whole file into memory.
+
+For more on what’s in the 4.0.0 R package, see the [R changelog][4].
+
+## C GLib and Ruby notes
+
+### C GLib
+
+In Arrow GLib version 4.0.0, the following changes are introduced in addition to the changes by Arrow C++.
+
+- gandiva-glib supports filtering by using the newly introduced `GGandivaFilter`, `GGandivaCondition`, and `GGandivaSelectableProjector`
+- The `input` property is added in `GArrowCSVReader` and `GArrowJSONReader`
+- GNU Autotools, namely `configure` script, support is dropped
+- `GADScanContext` is removed, and `use_threads` property is moved to `GADScanOptions`
+- `garrow_chunked_array_combine` function is added
+- `garrow_array_concatenate` function is added
+- `GADFragment` and its subclass `GADInMemoryFragment` are added
+- `GADScanTask` now holds the corresponding `GADFragment`
+- `gad_scan_options_replace_schema` function is removed
+- The name of `Decimal128DataType` is changed to `decimal128`
+
+### Ruby
+
+In Red Arrow version 4.0.0, the following changes are introduced in addition to the changes by Arrow GLib.
+
+- `ArrowDataset::ScanContext` is removed, and `use_threads` attribute is moved to `ArrowDataset::ScanOptions`
+- `Arrow::Array#concatenate` is added; it can concatenate not only an `Arrow::Array` but also a normal `Array`
+- `Arrow::SortKey` and `Arrow::SortOptions` are added for accepting Ruby objects as sort key and options
+- `ArrowDataset::InMemoryFragment` is added
+
+## Rust notes
+
+This release of Arrow continues to add new features and performance improvements. Much of our time this release was spent hammering out the necessary details so we can release the Rust versions to cargo at a more regular rate. In addition, we welcomed the [Ballista distributed compute project](https://ballistacompute.org/) officially to the fold.
+
+### Arrow
+
+- Improved LargeUtf8 support
+- Improved null handling in AND/OR kernels
+- Added JSON writer support (ARROW-11310)
+- JSON reader improvements
+- LargeUTF8
+  - Improved schema inference for nested list and struct types
+- Various performance improvements
+- IPC writer no longer calls finish() implicitly on drop
+- Compute kernels
+  - Support for optional `limit` in sort kernel
+  - Divide by a single scalar
+  - Support for casting to timestamps
+  - Cast: Improved support between casting List, LargeList, Int32, Int64, Date64
+  - Kernel to combine two arrays based on boolean mask
+  - Pow kernel
+- `new_null_array` for creating Arrays full of nulls.
+
+### Parquet
+
+- Added support for filtering row groups (used by DataFusion to implement filter push-down)
+- Added support for Parquet v 2.0 logical types
+
+### DataFusion
+
+New Features
+
+- SQL Support
+- - CTEs
+  - UNION
+  - HAVING
+  - EXTRACT
+  - SHOW TABLES
+  - SHOW COLUMNS
+  - INTERVAL
+  - SQL Information schema
+  - Support GROUP BY for more data types, including dictionary columns, boolean, Date32
+- Extensibility API
+  - Catalogs and schemas support
+  - Table deregistration
+  - Better support for multiple optimizers
+  - User defined functions can now provide specialized implementations for scalar values
+- Physical Plans
+- Hash Repartitioning
+- SQL Metrics
+
+- Additional Postgres compatible function library:
+  - Length functions
+  - Pad/trim functions
+  - Concat functions
+  - Ascii/Unicode functions
+  - Regex
+- Proper identifier case identification (e.g. “Foo” vs Foo vs foo)
+- Upgraded to Tokio 1.x
+
+Performance Improvements:
+
+- LIMIT pushdown
+- Constant folding
+- Partitioned hash join
+- Create hashes vectorized in hash join
+- Improve parallelism using repartitioning pass
+- Improved hash aggregate performance with large number of grouping values
+- Predicate pushdown support for table scans
+- Predicate push-down to parquet enables DataFusion to quickly eliminate entire parquet row-groups based on query filter expressions and parquet row group min/max statistics
+
+API Changes
+
+- DataFrame methods now take `Vec<Expr>` rather than `&[Expr]`
+- TableProvider now consistently uses `Arc<TableProvider>` rather than `Box<TableProvider>`
+
+### Ballista
+
+Ballista was donated shortly before the Arrow 4.0.0 release and there is no new release of Ballista as part of Arrow 4.0.0
+
+[1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%204.0.0
+[2]: {{ site.baseurl }}/release/4.0.0.html#contributors
+[3]: {{ site.baseurl }}/release/4.0.0.html
+[4]: {{ site.baseurl }}/docs/r/news/
+[5]: https://jfrog.com/blog/into-the-sunset-bintray-jcenter-gocenter-and-chartcenter/
+[6]: {{ site.baseurl }}/install/
+[7]: https://arrow.apache.org/docs/python/dataset.html#projecting-columns