You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ra...@apache.org on 2023/05/08 10:31:44 UTC

[arrow-site] branch main updated: [Website] Version 12.0.0 blog post (#346)

This is an automated email from the ASF dual-hosted git repository.

raulcd pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/main by this push:
     new 47f69206be7 [Website] Version 12.0.0 blog post (#346)
47f69206be7 is described below

commit 47f69206be7fb50120a224409bb411624158e0f1
Author: Raúl Cumplido <ra...@gmail.com>
AuthorDate: Mon May 8 12:31:38 2023 +0200

    [Website] Version 12.0.0 blog post (#346)
    
    PR to start adding the blog post information for the Release 12.0.0
    
    The 12.0.0 milestone with all the GitHub closed issues can be found
    here:
    https://github.com/apache/arrow/milestone/51?closed=1
    
    ---------
    
    Co-authored-by: Dominik Moritz <do...@gmail.com>
    Co-authored-by: Sutou Kouhei <ko...@clear-code.com>
    Co-authored-by: David Li <li...@gmail.com>
    Co-authored-by: Matt Topol <zo...@gmail.com>
    Co-authored-by: Dewey Dunnington <de...@dunnington.ca>
    Co-authored-by: Joris Van den Bossche <jo...@gmail.com>
    Co-authored-by: Alenka Frim <Al...@users.noreply.github.com>
    Co-authored-by: Weston Pace <we...@gmail.com>
---
 _posts/2023-05-02-12.0.0-release.md | 289 ++++++++++++++++++++++++++++++++++++
 1 file changed, 289 insertions(+)

diff --git a/_posts/2023-05-02-12.0.0-release.md b/_posts/2023-05-02-12.0.0-release.md
new file mode 100644
index 00000000000..ea2f63325e0
--- /dev/null
+++ b/_posts/2023-05-02-12.0.0-release.md
@@ -0,0 +1,289 @@
+---
+layout: post
+title: "Apache Arrow 12.0.0 Release"
+date: "2023-05-02 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+The Apache Arrow team is pleased to announce the 12.0.0 release. This covers
+over 3 months of development work and includes [**476 resolved issues**][1]
+with [**531 commits from 97 distinct contributors**][2].
+See the [Install Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 11.0.0 release, Wang Mingming, Mustafa Akur and Ruihang Xia
+have been invited to be committers.
+Will Jones have joined the Project Management Committee (PMC).
+
+Thanks for your contributions and participation in the project!
+
+## Columnar Format Notes
+
+A first "canonical" extension type has been formalized: `arrow.fixed_shape_tensor` to
+represent an Array where each slot contains a tensor, with all tensors having the same
+dimension and shape, [GH-33923](https://github.com/apache/arrow/issues/33923).
+This is based on a Fixed-Size List layout as storage array
+(<https://arrow.apache.org/docs/dev/format/CanonicalExtensions.html#fixed-shape-tensor-extension>).
+
+## Arrow Flight RPC notes
+
+The JDBC driver for Arrow Flight SQL has had some bugfixes, and has been refactored into a core library (which is not distributed as an uberjar with shaded names) and a driver (which is distributed as an uberjar).
+
+The Java server builder API now offers easier access to the underlying gRPC builder.
+
+Go now implements the Flight SQL extensions for Substrait and transaction support.
+
+## Plasma notes
+
+Plasma was deprecated since 10.0.0. Plasma is removed in this
+release. [GH-33243][GH-33243]
+
+## C++ notes
+
+* Run-End Encoded Arrays have been implemented and are accessible ([GH-32104](https://github.com/apache/arrow/issues/32104))
+* The FixedShapeTensor Logical value type has been implemented using ExtensionType ([GH-15483](https://github.com/apache/arrow/issues/15483), [GH-34796](https://github.com/apache/arrow/issues/34796))
+
+### Compute
+
+* New kernel to convert timestamp with timezone to wall time ([GH-33143](https://github.com/apache/arrow/issues/33143))
+* Cast kernels are now built into libarrow by default ([GH-34388](https://github.com/apache/arrow/issues/34388))
+
+### Acero
+
+* Acero has been moved out of libarrow into it's own shared library, allowing for smaller builds of the core libarrow ([GH-15280](https://github.com/apache/arrow/issues/15280))
+* Exec nodes now can have a concept of "ordering" and will reject non-sensible plans ([GH-34136](https://github.com/apache/arrow/issues/34136))
+* New exec nodes: "pivot_longer" ([GH-34266](https://github.com/apache/arrow/issues/34266)), "order_by" ([GH-34248](https://github.com/apache/arrow/issues/34248)) and "fetch" ([GH-34059](https://github.com/apache/arrow/issues/34059))
+* _Breaking Change:_ Reorder output fields of "group_by" node so that keys/segment keys come before aggregates ([GH-33616](https://github.com/apache/arrow/issues/33616))
+
+### Substrait
+
+* Add support for the `round` function [GH-33588](https://github.com/apache/arrow/issues/33588)
+* Add support for the `cast` expression element [GH-31910](https://github.com/apache/arrow/issues/31910)
+* Added API reference documentation [GH-34011](https://github.com/apache/arrow/issues/34011)
+* Added an extension relation to support segmented aggregation [GH-34626](https://github.com/apache/arrow/issues/34626)
+* The output of the aggregate relation now conforms to the spec [GH-34786](https://github.com/apache/arrow/issues/34786)
+
+### Parquet
+
+* Added support for DeltaLengthByteArray encoding to the Parquet writer ([GH-33024](https://github.com/apache/arrow/issues/33024))
+* NaNs are correctly handled now for Parquet predicate push-downs ([GH-18481](https://github.com/apache/arrow/issues/18481))
+* Added support for reading Parquet page indexes ([GH-33596](https://github.com/apache/arrow/issues/33596)) and writing page indexes ([GH-34053](https://github.com/apache/arrow/issues/34053))
+* Parquet writer can write columns in parallel now ([GH-33655](https://github.com/apache/arrow/issues/33655))
+* Fixed incorrect number of rows in Parquet V2 page headers ([GH-34086](https://github.com/apache/arrow/issues/34086))
+* Fixed incorrect Parquet page null_count when stats are disabled ([GH-34326](https://github.com/apache/arrow/issues/34326))
+* Added support for reading BloomFilters to the Parquet Reader ([GH-34665](https://github.com/apache/arrow/issues/34665))
+* Parquet File-writer can now add additional key-value metadata after it has been opened ([GH-34888](https://github.com/apache/arrow/issues/34888))
+* _Breaking Change:_ The default row group size for the Arrow writer changed from 64Mi rows to 1Mi rows. [GH-34280](https://github.com/apache/arrow/issues/34280)
+
+### ORC
+
+* Added support for the union type in ORC writer ([GH-34262](https://github.com/apache/arrow/issues/34262))
+* Fixed ORC CHAR type mapping with Arrow ([GH-34823](https://github.com/apache/arrow/issues/34823))
+* Fixed timestamp type mapping between ORC and arrow ([GH-34590](https://github.com/apache/arrow/issues/34590))
+
+### Datasets
+
+* Added support for reading JSON datasets ([GH-33209](https://github.com/apache/arrow/issues/33209))
+* Dataset writer now supports specifying a function callback to construct the file name in addition to the existing file name template ([GH-34565](https://github.com/apache/arrow/issues/34565))
+
+### Filesystems
+
+* GcsFileSystem::OpenInputFile avoids unnecessary downloads ([GH-34051](https://github.com/apache/arrow/issues/34051))
+
+### Other changes
+
+* Convenience Append(std::optional<T>...) methods have been added to array builders ([GH-14863](https://github.com/apache/arrow/issues/14863))
+* A deprecated OpenTelemetry header was removed from the Flight library ([GH-34417](https://github.com/apache/arrow/issues/34417))
+* Fixed crash in "take" kernels on ExtensionArrays with an underlying dictionary type ([GH-34619](https://github.com/apache/arrow/issues/34619))
+* Fixed bug where the C-Data bridge did not preserve nullability of map values on import ([GH-34983](https://github.com/apache/arrow/issues/34983))
+* Added support for EqualOptions to RecordBatch::Equals ([GH-34968](https://github.com/apache/arrow/issues/34968))
+* zstd dependency upgraded to v1.5.5 ([GH-34899](https://github.com/apache/arrow/issues/34899))
+* Improved handling of "logical" nulls such as with union and RunEndEncoded arrays ([GH-34361](https://github.com/apache/arrow/issues/34361))
+* Fixed incorrect handling of uncompressed body buffers in IPC reader, added IpcWriteOptions::min_space_savings for optional compression optimizations ([GH-15102](https://github.com/apache/arrow/issues/15102))
+
+## C# notes
+
+* Support added for importing / exporting schemas and types via the
+C data interface [GH-34737](https://github.com/apache/arrow/issues/34737)
+* Support added for the half-float data type [GH-25163](https://github.com/apache/arrow/issues/25163)
+* Schemas are now allowed to have multiple fields with the same name
+[GH-34076](https://github.com/apache/arrow/issues/34076)
+* Added support for reading compressed IPC files [GH-32240](https://github.com/apache/arrow/issues/32240)
+* Add [] operator to Schema [GH-34119](https://github.com/apache/arrow/issues/32240)
+
+## Go notes
+
+* Run-End Encoded Arrays have been added to the Golang implementation ([GH-32104](https://github.com/apache/arrow/issues/32104), [GH-32946](https://github.com/apache/arrow/issues/32946), [GH-20407](https://github.com/apache/arrow/issues/20407), [GH-32949](https://github.com/apache/arrow/issues/32949))
+* The SQLite Flight SQL Example has been improved and you can now `go get` a simple SQLite Flight SQL Server mainprog using `go get github.com/apache/arrow/go/v12/arrow/flight/flightsql/example/cmd/sqlite_flightsql_server` ([GH-33840](https://github.com/apache/arrow/issues/33840))
+* Fixed bug causing builds to fail with the `noasm` build tag ([GH-34044](https://github.com/apache/arrow/issues/34044)) and added a CI test run that uses the `noasm` tag ([GH-34055](https://github.com/apache/arrow/issues/34055))
+* Fixed issue allowing building on riscv64-freebsd ([GH-34629](https://github.com/apache/arrow/issues/34629))
+* Fixed issue preventing building on 32-bit platforms ([GH-35133](https://github.com/apache/arrow/issues/35133))
+
+### Arrow
+
+* Fixed bug in C-Data handling of `ArrowArrayStream.get_next` when handling uninitialized `ArrowArrays` ([GH-33767](https://github.com/apache/arrow/issues/33767))
+* _Breaking Change:_ Added `Err()` method to `RecordReader` interface so that it can propagate errors ([GH-33789](https://github.com/apache/arrow/issues/33789))
+* Fixed potential panic in C-Data API for misusing an invalid handle ([GH-33864](https://github.com/apache/arrow/issues/33864))
+* A new cgo-based Allocator that does not depend on libarrow has been added to the memory package ([GH-33901](https://github.com/apache/arrow/issues/33901))
+* CSV Reader and Writer now support Extension type arrays ([GH-34334](https://github.com/apache/arrow/issues/34334))
+* Fixed bug preventing the reading of IPC streams/files with compression enabled but uncompressed buffers ([GH-34385](https://github.com/apache/arrow/issues/34385))
+* Added interface which can be added to an `ExtensionType` to allow Builders to be created via `NewBuilder`, enabling easy building of nested fields containing extension types ([GH-34453](https://github.com/apache/arrow/issues/34453))
+* Added utilities to perform Array diffing ([GH-34790](https://github.com/apache/arrow/issues/34790))
+* Added `SetColumn` method to `arrow.Record` ([GH-34832](https://github.com/apache/arrow/issues/34832))
+* Added `GetValue` method to `arrow.Metadata` ([GH-34855](https://github.com/apache/arrow/issues/34855))
+* Added `Pow` method to `decimal128.Num` and `decimal256.Num` ([GH-34863](https://github.com/apache/arrow/issues/34863))
+
+#### Flight
+
+* Fixed bug in `StreamChunks` for Flight SQL to correctly propagate to the gRPC client ([GH-33717](https://github.com/apache/arrow/issues/33717))
+* Fixed issue that prevented compatibility with gRPC < v1.45 ([GH-33734](https://github.com/apache/arrow/issues/33734))
+* Added support to bind a `RecordReader` for supplying parameters to a Flight SQL Prepared statement ([GH-33794](https://github.com/apache/arrow/issues/33794))
+* Prepared Statement methods for FlightSQL client now allows gRPC Call Options ([GH-33867](https://github.com/apache/arrow/issues/33867))
+* FlightSQL Extensions have been implemented (for transactions and Substrait support) ([GH-33935](https://github.com/apache/arrow/issues/33935))
+* A driver compatible with `database/sql` for FlightSQL has been added ([GH-34332](https://github.com/apache/arrow/issues/34332))
+
+#### Compute
+
+* "run_end_encode" and "run_end_decode" functions added to compute functions ([GH-20408](https://github.com/apache/arrow/issues/20408))
+* "unique" function added ([GH-34171](https://github.com/apache/arrow/issues/34171))
+
+### Parquet
+
+* `pqarrow` pkg now handles DICTIONARY fields natively ([GH-33466](https://github.com/apache/arrow/issues/33466))
+* Fixed rare panic when writing list of 8 structs ([GH-33600](https://github.com/apache/arrow/issues/33600))
+* Added support for `pqarrow` pkg to write LargeString and LargeBinary types ([GH-33875](https://github.com/apache/arrow/issues/33875))
+* Fixed bug where `pqarrow.NewSchemaManifest` created the wrong field type for Array Object fields ([GH-34101](https://github.com/apache/arrow/issues/34101))
+* Added support to Parquet Writer for Extension type Arrays ([GH-34330](https://github.com/apache/arrow/issues/34330))
+
+## Java notes
+
+* Update Java JNI modules to consider Arrow ACERO [GH-34862](https://github.com/apache/arrow/issues/34862)
+* Ability to register additional GRPC services with FlightServer [GH-34778](https://github.com/apache/arrow/issues/34778)
+* Allow sending custom headers/properties through Arrow Flight SQL JDBC [GH-33874](https://github.com/apache/arrow/issues/33874)
+
+## JavaScript notes
+
+No changes.
+
+## Python notes
+
+Compatibility notes:
+
+* Plasma has been removed in this release ([GH-33243](https://github.com/apache/arrow/issues/33243)). In additiona, the deprecated serialization module in PyArrow was also removed ([GH-29705](https://github.com/apache/arrow/issues/29705)). IPC (Inter-Process Communication) functionality of pyarrow or the standard library pickle should be used instead.
+* The deprecated `use_async` keyword has been removed from the dataset module ([GH-30774](https://github.com/apache/arrow/issues/30774))
+* Minimum Cython version to build PyArrow from source has been raised to 0.29.31 ([GH-34933](https://github.com/apache/arrow/issues/34933)). In addition, PyArrow can now be compiled using Cython 3 ([GH-34564](https://github.com/apache/arrow/issues/34564)).
+
+New features:
+
+* A new `pyarrow.acero` module with initial bindings for the Acero execution engine has been added ([GH-33976](https://github.com/apache/arrow/issues/33976))
+* A new canonical extension type for fixed shaped tensor data has been defined. This is exposed in PyArrow as the FixedShapeTensorType ([GH-34882](https://github.com/apache/arrow/issues/34882), [GH-34956](https://github.com/apache/arrow/issues/34956))
+* Run-End Encoded arrays binding has been implemented ([GH-34686](https://github.com/apache/arrow/issues/34686), [GH-34568](https://github.com/apache/arrow/issues/34568))
+* Method `is_nan` has been added to Array, ChunkedArray and Expression ([GH-34154](https://github.com/apache/arrow/issues/34154))
+* Dataframe interchange protocol has been implemented for RecordBatch ([GH-33926](https://github.com/apache/arrow/issues/33926))
+
+Other improvements:
+
+* Extension arrays can now be concatenated ([GH-31868](https://github.com/apache/arrow/issues/31868))
+* `get_partition_keys` helper function is implemented in the `dataset` module to access the partitioning field's key/value from the partition expression of a certain dataset fragment ([GH-33825](https://github.com/apache/arrow/issues/33825))
+* PyArrow Array objects can now be accepted by the `pa.array()` constructor ([GH-34411](https://github.com/apache/arrow/issues/34411))
+* The default row group size when writing parquet files has been changed ([GH-34280](https://github.com/apache/arrow/issues/34280))
+* RecordBatch has the `select()` method implemented ([GH-34359](https://github.com/apache/arrow/issues/34359))
+* New method `drop_column` on the `pyarrow.Table` supports passing a single column as a string ([GH-33377](https://github.com/apache/arrow/issues/33377))
+* User-defined tabular functions, which are a user-functions implemented in Python that return a stateful stream of tabular data, are now also supported ([GH-32916](https://github.com/apache/arrow/issues/32916))
+* [Arrow Archery tool](https://arrow.apache.org/docs/dev/developers/continuous_integration/archery.html) now includes linting of the Cython files ([GH-31905](https://github.com/apache/arrow/issues/31905))
+* _Breaking Change:_ Reorder output fields of "group_by" node so that keys/segment keys come before aggregates ([GH-33616](https://github.com/apache/arrow/issues/33616))
+
+Relevant bug fixes:
+
+* Acero can now detect and raise an error in case a join operation needs too much bytes of key data ([GH-34474](https://github.com/apache/arrow/issues/34474))
+* Fix for converting non-sequence object in `pa.array()` ([GH-34944](https://github.com/apache/arrow/issues/34944))
+* Fix erroneous table conversion to pandas if table includes an extension array that does not implement `to_pandas_dtype` ([GH-34906](https://github.com/apache/arrow/issues/34906))
+* Reading from a closed `ArrayStreamBatchReader` now returns invalid status instead of segfaulting ([GH-34165](https://github.com/apache/arrow/issues/34165))
+* `array()` now returns `pyarrow.Array` and not `pyarrow.ChunkedArray` for columns with `__arrow_array__` method and only one chunk so that the conversion of pandas dataframe with categorical column of dtype `string[pyarrow]` does not fail ([GH-33727](https://github.com/apache/arrow/issues/33727))
+* Custom type mapper in `to_pandas` now converts index dtypes together with column dtypes ([GH-34283](https://github.com/apache/arrow/issues/34283))
+
+## R notes
+
+* The `read_parquet()` and `read_feather()` functions can now accept URL
+  arguments.
+* The `json_credentials` argument in `GcsFileSystem$create()` now accepts
+  a file path containing the appropriate authentication token.
+* The `$options` member of `GcsFileSystem` objects can now be inspected.
+* The `read_csv_arrow()` and `read_json_arrow()` functions now accept literal text input wrapped in
+  `I()` to improve compatability with `readr::read_csv()`.
+* Nested fields can now be accessed using `$` and `[[` in dplyr expressions.
+
+For more on what’s in the 12.0.0 R package, see the [R changelog][4].
+
+## Ruby and C GLib notes
+
+* Flight SQL: Added support for authentication. [GH-34074][GH-34074]
+* Compute: Added `GArrowRankOptions`. [GH-34425][GH-34425]
+* Compute: Added `GArrowFilterOptions`. [GH-34650][GH-34650]
+* Compute: Added `GArrowIndexOptions`. [GH-15286][GH-15286]
+* Compute: Added `GArrowMatchSubstringOptions`. [GH-15285][GH-15285]
+* Added `GArrowDenseUnionArrayBuilder`. [GH-21429][GH-21429]
+* Added `GArrowSparseUnionArrayBuilder`. [GH-21430][GH-21430]
+
+### Ruby notes
+
+* Improved `Arrow::Table#join` API. [GH-15287][GH-15287]
+* Flight SQL: Added `ArrowFlight::RecordBatchReader#each`. [GH-15287][GH-15287]
+* Added `Arrow::DenseUnionArray#get_value`. [GH-34837][GH-34837]
+* Added `Arrow::SparseUnionArray#get_value`. [GH-34837][GH-34837]
+* Expression: Added support for
+  `table.slice {|slicer| slicer.column.match_substring(string)` and
+  related shortcuts. [GH-34819][GH-34819] [GH-34951][GH-34951]
+* _Breaking change:_ `Arrow::Table#slice` with a filter removes null
+  records. [GH-34953][GH-34953]
+
+## Rust notes
+
+The Rust projects have moved to separate repositories outside the
+main Arrow monorepo. For notes on the latest release of the Rust
+implementation, see the latest [Arrow Rust changelog][5].
+
+[1]: https://github.com/apache/arrow/milestone/51?closed=1
+[2]: {{ site.baseurl }}/release/12.0.0.html#contributors
+[3]: {{ site.baseurl }}/release/12.0.0.html#changelog
+[4]: {{ site.baseurl }}/docs/r/news/
+[5]: https://github.com/apache/arrow-rs/tags
+
+[GH-15285]: https://github.com/apache/arrow/issues/15285
+[GH-15286]: https://github.com/apache/arrow/issues/15286
+[GH-15287]: https://github.com/apache/arrow/issues/15287
+[GH-21429]: https://github.com/apache/arrow/issues/21429
+[GH-21430]: https://github.com/apache/arrow/issues/21430
+[GH-33243]: https://github.com/apache/arrow/issues/33243
+[GH-34819]: https://github.com/apache/arrow/issues/34819
+[GH-34837]: https://github.com/apache/arrow/issues/34837
+[GH-34074]: https://github.com/apache/arrow/issues/34074
+[GH-34425]: https://github.com/apache/arrow/issues/34425
+[GH-34650]: https://github.com/apache/arrow/issues/34650
+[GH-34951]: https://github.com/apache/arrow/issues/34951
+[GH-34953]: https://github.com/apache/arrow/issues/34953