You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ko...@apache.org on 2022/05/16 01:12:44 UTC

[arrow-site] branch master updated: Version 8.0.0 release blog post (#207)

This is an automated email from the ASF dual-hosted git repository.

kou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/master by this push:
     new ac19846be9 Version 8.0.0 release blog post (#207)
ac19846be9 is described below

commit ac19846be9f1ffed886d81033e9a4312c50322af
Author: Raúl Cumplido <ra...@gmail.com>
AuthorDate: Mon May 16 03:12:40 2022 +0200

    Version 8.0.0 release blog post (#207)
    
    Co-authored-by: Raúl Cumplido <ra...@gmail.com>
    Co-authored-by: Antoine Pitrou <an...@python.org>
    Co-authored-by: Joris Van den Bossche <jo...@gmail.com>
    Co-authored-by: Matt Topol <zo...@gmail.com>
    Co-authored-by: Kenta Murata <39...@users.noreply.github.com>
    Co-authored-by: Dominik Moritz <do...@gmail.com>
    Co-authored-by: Eric Erhardt <er...@microsoft.com>
    Co-authored-by: Sutou Kouhei <ko...@clear-code.com>
---
 _posts/2022-05-15-8.0.0-release.md | 264 +++++++++++++++++++++++++++++++++++++
 1 file changed, 264 insertions(+)

diff --git a/_posts/2022-05-15-8.0.0-release.md b/_posts/2022-05-15-8.0.0-release.md
new file mode 100644
index 0000000000..e85b87aed7
--- /dev/null
+++ b/_posts/2022-05-15-8.0.0-release.md
@@ -0,0 +1,264 @@
+---
+layout: post
+title: "Apache Arrow 8.0.0 Release"
+date: "2022-05-15 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+
+The Apache Arrow team is pleased to announce the 8.0.0 release. This covers
+over 3 months of development work and includes [**586 resolved issues**][1]
+from [**127 distinct contributors**][2]. See the Install Page to learn how to
+get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 7.0.0 release, Kun Liu, Raphael Taylor-Davies Xudong Wang, Yijie Shen
+and Liang-Chi Hsieh have been invited to be committers.
+Thanks for your contributions and participation in the project!
+
+## Arrow Flight RPC notes
+
+Flight SQL has been extended with a method to get type metadata ([ARROW-15313](https://issues.apache.org/jira/browse/ARROW-15313)) and column metadata in returned schemas ([ARROW-15314](https://issues.apache.org/jira/browse/ARROW-15314), [ARROW-16064](https://issues.apache.org/jira/browse/ARROW-16064)) New documentation is available describing Flight and Flight SQL, along with several Cookbook recipes ([ARROW-14698](https://issues.apache.org/jira/browse/ARROW-14698), [ARROW-16065](https: [...]
+
+The C++ libraries now support UCX as a network transport ([ARROW-15706](https://issues.apache.org/jira/browse/ARROW-15706)), and the APIs have been refactored to allow other transports to be implemented ([ARROW-15282](https://issues.apache.org/jira/browse/ARROW-15282)). UCX support is experimental and still subject to change. Many of the APIs have been refactored to use the `arrow::Result` type, and the original variants have been deprecated ([ARROW-16032](https://issues.apache.org/jira/ [...]
+
+## C++ notes
+
+### Compute
+
+Arrow C++ can now optionally build with support for the experimental
+[Substrait](https://substrait.io/) query representation format ([ARROW-15238](https://issues.apache.org/jira/browse/ARROW-15238)).
+
+A number of compute kernels operating on temporal data have been added:
+
+* addition, subtraction and multiplication between various temporal types;
+* a new "is_dst" function to compute whether the input timestamps fall within
+daylight saving time (DST);
+* a new "is_leap_year" function to compute whether the input timestamps fall
+within a leap year.
+
+It is possible to enable a timezone database on Windows at runtime
+by calling the `arrow::Initialize()` function ([ARROW-13168](https://issues.apache.org/jira/browse/ARROW-13168)).
+
+New hash aggregations are available: "hash_one" to return one value from each
+group ([ARROW-13993](https://issues.apache.org/jira/browse/ARROW-13993)), and "hash_list" to return all values from each group
+([ARROW-15152](https://issues.apache.org/jira/browse/ARROW-15152)).  Null columns are now supported on the sum, mean and product
+hash aggregates ([ARROW-15506](https://issues.apache.org/jira/browse/ARROW-15506)).  Also, it is now possible to execute hash
+"aggregations" with only key columns ([ARROW-15609](https://issues.apache.org/jira/browse/ARROW-15609)).
+
+A new compute function "map_lookup" allows looking up a given key in a map
+array ([ARROW-15089](https://issues.apache.org/jira/browse/ARROW-15089)).
+
+New compute functions "sqrt" and "sqrt_checked" allow extracting the square
+root of their input ([ARROW-15614](https://issues.apache.org/jira/browse/ARROW-15614)).
+
+Casting between two struct types is now possible, assuming the destination field
+names all exist in the source struct type ([ARROW-1888](https://issues.apache.org/jira/browse/ARROW-1888), [ARROW-15643](https://issues.apache.org/jira/browse/ARROW-15643)).
+
+Optional OpenTelemetry tracing has been added to kernel functions and execution
+plan nodes ([ARROW-15061](https://issues.apache.org/jira/browse/ARROW-15061)).
+
+The CMake build option `ARROW_ENGINE` has been renamed to `ARROW_SUBSTRAIT`,
+to better reflect its actual effect ([ARROW-16158](https://issues.apache.org/jira/browse/ARROW-16158)).
+
+### CSV
+
+It is now possible to change the field delimiter when writing a CSV file
+([ARROW-15672](https://issues.apache.org/jira/browse/ARROW-15672)).
+
+### Dataset
+
+The ORC dataset scanner now observes the batch size parameter ([ARROW-14153](https://issues.apache.org/jira/browse/ARROW-14153)).
+
+The dataset layer now supports filename-based partitioning, where the data
+files are all laid out in the dataset's base directory, their names prefixed
+with the partition values separated by underscore characters ([ARROW-14612](https://issues.apache.org/jira/browse/ARROW-14612)).
+
+Optional OpenTelemetry tracing has been added to the dataset scanner ([ARROW-15067](https://issues.apache.org/jira/browse/ARROW-15067)).
+
+### Filesystem
+
+It is possible to instantiate a Google Cloud Storage (GCS) filesystem
+from a URI, making GCS implicitly usable in the datasets layer ([ARROW-14893](https://issues.apache.org/jira/browse/ARROW-14893)).
+Recognized URI schemes are `gs` and `gcs`.
+
+`FileSystem::DeleteDirContents` can now optionally succeed when the directory
+doesn't exist ([ARROW-16159](https://issues.apache.org/jira/browse/ARROW-16159)).
+
+### IO
+
+It is possible to override the number of IO threads using the
+environment variable `ARROW_IO_THREADS` ([ARROW-15941](https://issues.apache.org/jira/browse/ARROW-15941)).
+
+### IPC
+
+The IPC file reader and writer now allow accessing the custom metadata
+associated with record batches ([ARROW-16131](https://issues.apache.org/jira/browse/ARROW-16131)).
+
+### Miscellaneous
+
+It is possible to enable lightweight memory checks on the standard memory pools
+using a dedicated environment variable `ARROW_DEBUG_MEMORY_POOL` ([ARROW-15550](https://issues.apache.org/jira/browse/ARROW-15550)).
+These checks are not a replacement for sophisticated checkers such as Address
+Sanitizer or Valgrind, but might come up useful if those tools are not
+available.
+
+Temporal data is now validated when doing full array validation ([ARROW-10924](https://issues.apache.org/jira/browse/ARROW-10924)).
+The validation catches values not matching the specification (for example,
+a time value being outside of the span of one day).
+
+The GDB plugin now attempts to print the data of an array, in addition to its
+metadata ([ARROW-15389](https://issues.apache.org/jira/browse/ARROW-15389)).  This only works for primitive datatypes.
+
+Pretty-printing is now shorter and more customizable for nested datatypes
+([ARROW-14798](https://issues.apache.org/jira/browse/ARROW-14798)).
+
+## C# notes
+
+With [.NET Core 2.1 reaching end-of-life](https://devblogs.microsoft.com/dotnet/net-core-2-1-will-reach-end-of-support-on-august-21-2021) in August 2021, the Apache.Arrow library has been updated to target `netcoreapp3.1` and higher. It still supports `netstandard1.3`, so the library works on .NET Framework. But to get the best performance, using .NET Core 3.1, .NET 5, or later is recommended.
+
+## Go notes
+
+### Bug fixes
+
+* parquet_reader / parquet_schema no longer crash ([ARROW-15509](https://github.com/apache/arrow/pull/12303))
+* Base64 encoding of origin Arrow schema properly uses padding and decodes both with or without the padding in pqarrow.getOriginSchema ([ARROW-15544](https://github.com/apache/arrow/pull/12679))
+* ipc.Writer no longer includes unnecessary offsets when encoding sliced arrays ([ARROW-15715](https://github.com/apache/arrow/pull/12453))
+* Use base64.StdEncoding for Arrow Flight Basic Auth middleware for proper encoding/decoding ([ARROW-15772](https://github.com/apache/arrow/pull/12503))
+* Fix panic during concurrent compression of ipc body buffers due to negative WaitGroup counter ([ARROW-15792](https://github.com/apache/arrow/pull/12518))
+* Fix memory leak in pqarrow.NewColumnWriter with nested structures ([ARROW-15946](https://github.com/apache/arrow/pull/12641))
+* ipc.FileReader no longer leaks memory when using ZSTD compression ([ARROW-16163](https://github.com/apache/arrow/pull/12857))
+
+### Enhancements
+
+#### Flight RPC
+
+* *Breaking Change* gRPC version is updated and Flight Server creation has been simplified. Flight servers must now embed flight.BaseFlightServer ([ARROW-15418](https://github.com/apache/arrow/pull/12233))
+* You can now provide a full net.Listener as an alternative to just providing an address to bind to when creating a Flight server ([ARROW-16082](https://github.com/apache/arrow/pull/12768))
+
+#### Parquet
+
+* 'unpack_bool', sum_float64, and bitmap handling functions have been given optimized assembly implementation for Arm64 NEON ([ARROW-15440](https://github.com/apache/arrow/pull/12398), [ARROW-15742](https://github.com/apache/arrow/pull/12502), [ARROW-15995](https://github.com/apache/arrow/pull/12687))
+* Go Parquet handling has been simplified to only need io.ReadSeeker instead of a ReadAtSeeker interface ([ARROW-15963](https://github.com/apache/arrow/pull/12658))
+* BitSetRunReader and helper functions have been lifted to internal/bitutils package to share between arrow and parquet implementations ([ARROW-15950](https://github.com/apache/arrow/pull/12926))
+* Parquet Reader now properly obeys the buffer size read property for buffered streams ([ARROW-16187](https://github.com/apache/arrow/pull/12876))
+* parquet NewBufferedReader no longer panics ([ARROW-16283](https://github.com/apache/arrow/pull/12960))
+
+#### Arrow
+
+* Go Arrow library now supports Dictionary Arrays ([ARROW-3039](https://issues.apache.org/jira/browse/ARROW-3039), [ARROW-9378](https://issues.apache.org/jira/browse/ARROW-9378), [GH-12158](https://github.com/apache/arrow/pull/12158))
+* array.ArrayEqual, array.ArrayApproxEqual have been renamed to array.Equal and array.ApproxEqual. Aliases are provided to avoid breaking existing code which will be removed in v9. ([ARROW-5598](https://github.com/apache/arrow/pull/12877))
+* Custom cpu discovery package replaced with using golang.org/x/sys/cpu ([ARROW-16193](https://github.com/apache/arrow/pull/12764))
+* *Breaking Change* array.Interface, array.Record, array.Table, etc. were deprecated in v7 in favor of arrow.Array, arrow.Record, etc. The deprecated aliases have been removed in v8 ([ARROW-16192](https://github.com/apache/arrow/pull/12960))
+
+#### CI
+
+* staticcheck is now run as part of CI to lint the Go code ([ARROW-15296](https://github.com/apache/arrow/pull/12540))
+* Travis builds on Arm64 for Go are no longer allowed to fail ([ARROW-15400](https://github.com/apache/arrow/pull/12214))
+
+## Java notes
+
+* Java 17 is now supported as a target and has been added to tested platforms
+* When scanning datasets an `ArrowReader` is now returned, which makes easier to create `VectorSchemaRoot` from it.
+* Java Documentation had a overall improvement, with few sections added and most sections rewritten as more clear tutorials
+* Overall improvements to FlightSQL support in Java
+* [Java Cookbook](https://arrow.apache.org/cookbook/java/index.html) is now available
+
+## JavaScript notes
+
+* Tables now allow setting arbitrary symbols, which enables support for passing Arrow tables to Vega. [ARROW-16209](https://github.com/apache/arrow/pull/12907)
+* Arrow now supports `tableFromJSON` and struct vectors in `vectorFromArray`. [ARROW-16210](https://github.com/apache/arrow/pull/12908)
+* Fixed support for appending null children in a StructBuilder. [ARROW-15705](https://github.com/apache/arrow/pull/12451)
+
+## Python notes
+
+In general, the Python bindings benefit from improvements in the C++ library
+(e.g. new compute functions); see the C++ notes above for additional details.
+In addition:
+
+* Tables and Datasets now support the `join` operation to perform `left`, `right`, `full` joins of `inner` or `outer` types. The result of the join operation will be a new table ([ARROW-14293](https://issues.apache.org/jira/browse/ARROW-14293)). See https://arrow.apache.org/docs/dev/python/compute.html#table-and-dataset-joins for examples.
+* Additional legacy keywords and properties of the `ParquetDataset` class have been deprecated and will issue a warning, in favor of functionality based on the `pyarrow.dataset` functionality ([ARROW-16119](https://issues.apache.org/jira/browse/ARROW-16119)).
+* It is now possible to create references to nested fields in a Table or Dataset using `py.field("a", "b")` ([ARROW-11259](https://issues.apache.org/jira/browse/ARROW-11259)).
+* Docstrings in `Schema`, `ChunkedArray`, `Tensor`, `RecordBatch`, `parquet` and `Table` now include examples on how to use the methods and classes ([ARROW-15367](https://issues.apache.org/jira/browse/ARROW-15367)).
+* Reading and writing Parquet files now supports encryption ([ARROW-9947](https://issues.apache.org/jira/browse/ARROW-9947)). See the [docs](https://arrow.apache.org/docs/python/parquet.html#parquet-modular-encryption-columnar-encryption) for more details.
+* Support for `zoneinfo` (Python 3.9+) and `dateutil` timezones in conversion to Arrow data structures ([ARROW-5248](https://issues.apache.org/jira/browse/)).
+* Multiple bugfixes and segfaults resolved.
+
+## R notes
+
+This release includes:
+
+ * Support for over 20 additional `lubridate` and `base` date and time functions in Arrow dpylr queries,
+ * An API to allow external packages to define custom extension array types and custom conversions into Arrow types,
+ * Support for concatenating Arrays, RecordBatches, and Tables, including with `c()`, `rbind()` and `cbind()`.
+
+For more on what’s in the 8.0.0 R package, see the [R changelog][4].
+
+## Ruby and C GLib notes
+
+### Ruby
+
+* Add support for `#values` of `MonthInterval` Type ([ARROW-15749](https://issues.apache.org/jira/browse/ARROW-15749))
+* Add support for `#raw_records` of `MonthInterval` type ([ARROW-15750](https://issues.apache.org/jira/browse/ARROW-15750))
+* Add support for `#values` of `DayTimeInterval` type ([ARROW-15885](https://issues.apache.org/jira/browse/ARROW-15885))
+* Add `DayTimeIntervalArrayBuilder` to support to make `DayTimeIntervalArray` by a Hash with `:day` and `:millisecond` keys ([ARROW-15918](https://issues.apache.org/jira/browse/ARROW-15918))
+* Add support for `#raw_records` of `DayTimeInterval` type ([ARROW-15886](https://issues.apache.org/jira/browse/ARROW-15886))
+* Add support for `#values` of `MonthDayNanoInterval` type ([ARROW-15924](https://issues.apache.org/jira/browse/ARROW-15924))
+    * Also add `MonthDayNanoIntervalArrayBuilder` to support to make `MonthDayNanoIntervalArray` by a Hash with `:month`, `:day`, and `:nanosecond` keys
+* Add support for `#raw_records` of `MonthDayNanoInterval` type ([ARROW-15925](https://issues.apache.org/jira/browse/ARROW-15925))
+* Add Ruby-ish interfaces for `Parquet::BooleanStatistics` ([ARROW-16251](https://issues.apache.org/jira/browse/ARROW-16251))
+
+### C GLib
+
+* Add `gaflight_client_close` ([ARROW-15487](https://issues.apache.org/jira/browse/ARROW-15487))
+* Add `GParquetFileMetadata` and `gparquet_arrow_file_reader_get_metadata` ([ARROW-16214](https://issues.apache.org/jira/browse/ARROW-16214))
+* Fix `GArrowGIOInputStream` so that all the data is completely read ([ARROW-15626](https://issues.apache.org/jira/browse/ARROW-15626))
+* Add `garrow_string_array_builder_append_string_len` and `garrow_large_string_array_builder_append_string_len` ([ARROW-15629](https://issues.apache.org/jira/browse/ARROW-15629))
+* Add `GParquetRowGroupMetadata` ([ARROW-16245](https://issues.apache.org/jira/browse/ARROW-16245))
+* Add `GParquetColumnChunkMetadata` ([ARROW-16250](https://issues.apache.org/jira/browse/ARROW-16250))
+* Add `GArrowGCSFileSystem` ([ARROW-16247](https://issues.apache.org/jira/browse/ARROW-16247))
+* Add `GParquetStatistics` and its family ([ARROW-16251](https://issues.apache.org/jira/browse/ARROW-16251))
+    * `GParquetBooleanStatistics`
+    * `GParquetInt32Statistics`
+    * `GParquetInt64Statistics`
+    * `GParquetFloatStatistics`
+    * `GParquetDoubleStatistics`
+    * `GParquetByteArrayStatistics`
+    * `GParquetFixedLengthByteArrayStatistics`
+* Add missing casts for `GArrowRoundMode` ([ARROW-16296](https://issues.apache.org/jira/browse/ARROW-16296))
+
+## Rust notes
+
+The Rust projects have moved to separate repositories outside the
+main Arrow monorepo. For notes on the 13.0.0 release of the Rust
+implementation, see the [Arrow Rust changelog][5].
+
+[1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%208.0.0
+[2]: {{ site.baseurl }}/release/8.0.0.html#contributors
+[3]: {{ site.baseurl }}/release/8.0.0.html#changelog
+[4]: {{ site.baseurl }}/docs/r/news/
+[5]: https://github.com/apache/arrow-rs/blob/13.0.0/CHANGELOG.md