You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by np...@apache.org on 2021/02/03 19:34:45 UTC

[arrow-site] branch master updated: 3.0 release blog post (#92)

This is an automated email from the ASF dual-hosted git repository.

npr pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/master by this push:
     new 593109a  3.0 release blog post (#92)
593109a is described below

commit 593109a5807dc3eb35e52d0a5201581e02f13a10
Author: Neal Richardson <ne...@gmail.com>
AuthorDate: Wed Feb 3 11:34:34 2021 -0800

    3.0 release blog post (#92)
    
    Co-authored-by: Joris Van den Bossche <jo...@gmail.com>
    Co-authored-by: Jacob Quinn <qu...@gmail.com>
    Co-authored-by: Kenta Murata <mr...@users.noreply.github.com>
    Co-authored-by: Antoine Pitrou <an...@python.org>
    Co-authored-by: David Li <li...@gmail.com>
    Co-authored-by: Eric Erhardt <er...@microsoft.com>
    Co-authored-by: Benjamin Kietzman <be...@gmail.com>
    Co-authored-by: Andy Grove <an...@users.noreply.github.com>
---
 .github/workflows/deploy.yml       |   2 +
 _posts/2021-01-25-3.0.0-release.md | 339 +++++++++++++++++++++++++++++++++++++
 2 files changed, 341 insertions(+)

diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml
index 374846a..0d113fa 100644
--- a/.github/workflows/deploy.yml
+++ b/.github/workflows/deploy.yml
@@ -30,6 +30,8 @@ jobs:
     steps:
       - uses: actions/checkout@master
       - uses: actions/setup-ruby@v1
+        with:
+          ruby-version: "2.7"
       - name: Install dependencies
         run: |
           bundle install
diff --git a/_posts/2021-01-25-3.0.0-release.md b/_posts/2021-01-25-3.0.0-release.md
new file mode 100644
index 0000000..1004b69
--- /dev/null
+++ b/_posts/2021-01-25-3.0.0-release.md
@@ -0,0 +1,339 @@
+---
+layout: post
+title: "Apache Arrow 3.0.0 Release"
+date: "2021-01-25 00:00:00 -0600"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+The Apache Arrow team is pleased to announce the 3.0.0 release. This covers
+over 3 months of development work and includes [**666 resolved issues**][1]
+from [**106 distinct contributors**][2]. See the Install Page to learn how to
+get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Columnar Format Notes
+
+The Decimal256 data type, which was already supported by the Arrow columnar
+format specification, is now implemented in C++ and Java (ARROW-9747).
+
+## Arrow Flight RPC notes
+
+Authentication in C++/Java/Python has been overhauled, allowing more flexible authentication methods and use of standard headers.
+Support for cookies has also been added.
+The C++/Java implementations are now more permissive when parsing messages in order to interoperate better with other Flight implementations.
+
+A basic Flight implementation for C#/.NET has been added.
+See the [implementation status matrix](https://arrow.apache.org/docs/status.html#flight-rpc) for details.
+
+## C++ notes
+
+The default memory pool can now be changed at runtime using the environment
+variable `ARROW_DEFAULT_MEMORY_POOL` (ARROW-11009).  The environment variable
+is inspected at process startup.  This is useful when trying to diagnose memory
+consumption issues with Arrow.
+
+STL-like iterators are now provided over concrete arrays. Those are useful for
+non-performance critical tasks, for example testing (ARROW-10776).
+
+It is now possible to concatenate dictionary arrays with unequal dictionaries.
+The dictionaries are unified when concatenating, for supported data types
+(ARROW-5336).
+
+Threads in a thread pool are now spawned lazily as needed for enqueued
+tasks, up to the configured capacity. They used to be spawned upfront on
+creation of the thread pool (ARROW-10038).
+
+### Compute layer
+
+Comprehensive documentation for compute functions is now available:
+https://arrow.apache.org/docs/cpp/compute.html
+
+Compute functions for string processing have been added for:
+
+* splitting on whitespace (ASCII and Unicode flavors) and splitting on a
+  pattern (ARROW-9991);
+* trimming characters (ARROW-9128).
+
+Behavior of the `index_in` and `is_in` compute functions with nulls has been
+changed for consistency (ARROW-10663).
+
+Multiple-column sort kernels are now available for tables and record batches
+(ARROW-8199, ARROW-10796, ARROW-10790).
+
+Performance of table filtering has been vastly improved (ARROW-10569).
+
+Scalar arguments are now accepted for more compute functions.
+
+Compute functions `quantile` (ARROW-10831) and `is_nan` (ARROW-11043) have been
+added for numeric data.
+
+Aggregation functions `any` (ARROW-1846) and `all` (ARROW-10301) have been
+added for boolean data.
+
+### Dataset
+
+The `Expression` hierarchy has simplified to a wrapper around literals, field references,
+or calls to named functions. This enables usage of any compute function while filtering
+with no boilerplate.
+
+Parquet statistics are lazily parsed in `ParquetDatasetFactory` and
+`ParquetFileFragment` for shorter construction time.
+
+### CSV
+
+Conversion of string columns is now faster thanks to faster UTF-8 validation
+of small strings (ARROW-10313).
+
+Conversion of floating-point columns is now faster thanks to optimized
+string-to-double conversion routines (ARROW-10328).
+
+Parsing of ISO8601 timestamps is now more liberal: trailing zeros can
+be omitted in the fractional part (ARROW-10337).
+
+Fixed a bug where null detection could give the wrong results on some platforms
+(ARROW-11067).
+
+Added type inference for Date32 columns for values in the form `YYYY-MM-DD`
+(ARROW-11247).
+
+### Feather
+
+Fixed reading of compressed Feather files written with Arrow 0.17 (ARROW-11163).
+
+### Filesystem layer
+
+S3 recursive tree walks now benefit from a parallel implementation, where reads
+of multiple child directories are now issued concurrently (ARROW-10788).
+
+Improved empty directory detection to be mindful of differences between Amazon
+and Minio S3 implementations (ARROW-10942).
+
+### Flight RPC
+
+IPv6 host addresses are now supported (ARROW-10475).
+
+### IPC
+
+It is now possible to emit dictionary deltas where possible using the IPC
+stream writer. This is governed by a new variable in the `IpcWriteOptions` class
+(ARROW-6883).
+
+It is now possible to read wider tables, which used to fail due to reaching a
+limit during Flatbuffers verification (ARROW-10056).
+
+### Parquet
+
+Fixed reading of LZ4-compressed Parquet columns emitted by the Java Parquet
+implementation (ARROW-11301).
+
+Fixed a bug where writing multiple batches of nullable nested strings to Parquet
+would not write any data in batches after the first one (ARROW-10493)
+
+The Decimal256 data type can be read from and written to Parquet (ARROW-10607).
+
+LargeString and LargeBinary data can now be written to Parquet (ARROW-10426).
+
+## C# notes
+
+The .NET package added initial support for Arrow Flight clients and servers. Support is enabled through two new NuGet packages [Apache.Arrow.Flight](https://www.nuget.org/packages/Apache.Arrow.Flight/) (client) and [Apache.Arrow.Flight.AspNetCore](https://www.nuget.org/packages/Apache.Arrow.Flight.AspNetCore/) (server).
+
+Also fixed an issue where ArrowStreamWriter wasn't writing schema metadata to Arrow streams.
+
+## Julia notes
+
+This is the first release to officially include
+[an implementation](https://github.com/apache/arrow/tree/master/julia/Arrow)
+for the Julia language. The pure Julia implementation includes support
+for [wide coverage of the format specification](https://arrow.apache.org/docs/status.html).
+Additional details can be found in the
+[julialang.org blog post](https://julialang.org/blog/2021/01/arrow/).
+
+## Python notes
+
+Support for Python 3.9 was added (ARROW-10224), and support for Python 3.5
+was removed (ARROW-5679).
+
+Support for building manylinux1 packages has been removed (ARROW-11212).
+PyArrow continues to be available as manylinux2010 and manylinux2014 wheels.
+
+The minimal required version for NumPy is now 1.16.6. Note that when upgrading
+NumPy to 1.20, you also need to upgrade pyarrow to 3.0.0 to ensure compatibility,
+as this pyarrow release fixed a compatibility issue with NumPy 1.20 (ARROW-10833).
+
+Compute functions are now automatically exported from C++ to the `pyarrow.compute`
+module, and they have docstrings matching their C++ definition.
+
+An `iter_batches()` method is now available for reading a Parquet file iteratively
+(ARROW-7800).
+
+Alternate memory pools (such as mimalloc, jemalloc or the C malloc-based memory
+pool) are now available from Python (ARROW-11049).
+
+Fixed a potential deadlock when importing pandas from several threads (ARROW-10519).
+
+See the C++ notes above for additional details.
+
+## R notes
+
+This release contains new features for the Flight RPC wrapper, better
+support for saving R metadata (including `sf` spatial data) to Feather and
+Parquet files, several significant improvements to speed and memory management,
+and many other enhancements.
+
+For more on what’s in the 3.0.0 R package, see the [R changelog][4].
+
+## Ruby and C GLib notes
+
+### Ruby
+
+In Ruby binding, 256-bit decimal support and `Arrow::FixedBinaryArrayBuilder` are added likewise C GLib below.
+
+### C GLib
+
+In the version 3.0.0 of C GLib consists of many new features.
+
+A chunked array, a record batch, and a table support `sort_indices` function as well as an array.
+These functions including array's support to specify sorting option.
+`garrow_array_sort_to_indices` has been renamed to `garrow_array_sort_indices` and the previous name has been deprecated.
+
+`GArrowField` supports functions to handle metadata.
+`GArrowSchema` supports `garrow_schema_has_metadata()` function.
+
+`GArrowArrayBuilder` supports to add single null, multiple nulls, single empty value, and multiple empty values.
+`GArrowFixedSizedBinaryArrayBuilder` is newly supported.
+
+256-bit decimal and extension types are newly supported.
+Filesystem module supports Mock, HDFS, S3 file systems.
+Dataset module supports CSV, IPC, and Parquet file formats.
+
+## Rust notes
+### Core Arrow Crate
+
+The development of the arrow crate was focused on four main aspects:
+
+- Make the crate usable in stable Rust
+- Bug fixing and removal of `unsafe` code
+- Extend functionality to keep up with the specification
+- Increase performance of existing kernels
+
+#### Stable Rust
+
+Possibly the biggest news for this release is that all project crates, including arrow, parquet, and datafusion now
+build with stable Rust by default. Nightly / unstable Rust is still required when enabling the SIMD feature.
+
+#### Parquet Arrow writer
+
+The Parquet Writer for Arrow arrays is now available, allowing the Rust programs to easily read and write Parquet
+files and making it easier to integrate with the overall Arrow ecosystem. The reader and writer include both basic
+and nested type support (List, Dictionary, Struct)
+
+#### First Class Arrow Flight IPC Support
+
+This release the Arrow Flight IPC implementation in Rust became fully-featured enough to participate in the regular
+cross-language integration tests, thus ensuring Rust applications written using Arrow can interoperate with the rest
+of the ecosystem
+
+#### Performance
+
+There have been numerous performance improvements in this release across the board. This includes both kernel
+operations, such as `take`, `filter`, and `cast`, as well as more fundamental parts such as bitwise comparison
+and reading and writing to CSV.
+
+#### Increased Data Type Support
+
+New DataTypes:
+- Decimal data type for fixed-width decimal values
+
+Improved operation support for nested structures Dictionary, and Lists (filter, take, etc)
+
+#### Other improvements:
+
+- Added support for Date and time on FFI
+- Added support for Binary type on FFI
+- Added support for i64 sized arrays to “take” kernel
+- Support for the i128 Decimal Type
+- Added support to cast string to date
+- Added API to create arrays out of existing arrays (e.g. for join, merge-sort, concatenate)
+- The simd feature is now also available on aarch64
+
+#### API Changes
+
+- BooleanArray is no longer a PrimitiveArray
+- ArrowNativeType no longer includes bool since arrows boolean type is represented using bitpacking
+- Several Buffer methods are now infallible instead of returning a Result
+- DataType::List now contains a Field to track metadata about the contained elements
+- PrimitiveArray::raw_values, values_slice and values methods got replaced by a values method returning a slice
+- Buffer::data and raw_data were renamed to as_slice and as_ptr
+- MutableBuffer::data_mut and freeze were renamed to as_slice_mut and into to be more consistent with the stdlib naming conventions
+- The generic type parameter for BufferBuilder was changed from ArrowPrimitiveType to ArrowNativeType
+
+### DataFusion
+
+#### SQL
+
+In this release, we clarified that DataFusion will standardize on the PostgreSQL SQL dialect.
+
+New SQL support:
+- JOIN, LEFT JOIN, RIGHT JOIN
+- COUNT DISTINCT
+- CASE WHEN
+- USING
+- BETWEEN
+- IS IN
+- Nested SELECT statements
+- Nested expressions in aggregations
+- LOWER(), UPPER(), TRIM()
+- NULLIF()
+- SHA224(), SHA256(), SHA384(), SHA512()
+- DATE_TRUNC()
+
+#### Performance
+
+There have been numerous performance improvements in this release:
+- Optimizations for JOINs such as using vectorized hashing.
+- We started with adding statistics and cost-based optimizations. We choose the smaller side of a join as the build
+  side if possible.
+- Improved parallelism when reading partitioned Parquet data sources
+- Concurrent writes of CSV and Parquet partitions to file
+
+### Parquet Crate
+
+The Parquet has the following improvements:
+
+- Nested reading
+- Support to write booleans
+- Add support to write temporal types
+
+### Roadmap for 4.0.0
+
+We have also started building up a shared community roadmap for 4.0: [Apache Arrow: Crowd Sourced Rust Roadmap for
+Arrow 4.0, January 2021][6].
+
+[1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%203.0.0
+[2]: {{ site.baseurl }}/release/3.0.0.html#contributors
+[3]: {{ site.baseurl }}/release/3.0.0.html
+[4]: {{ site.baseurl }}/docs/r/news/
+[5]: https://issues.apache.org/jira/browse/ARROW-11329?jql=project%20%3D%20ARROW%20AND%20status%20not%20in%20%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20and%20fixVersion%20%3D%203.0.0%20AND%20text%20~%20rust%20ORDER%20BY%20created%20DESC
+[6]: https://docs.google.com/document/d/1qspsOM_dknOxJKdGvKbC1aoVoO0M3i6x1CIo58mmN2Y/edit#heading=h.kstb571j5g5j