You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ko...@apache.org on 2022/10/31 01:37:32 UTC

[arrow-site] branch master updated: [Website] Version 10.0.0 release blog post (#242)

This is an automated email from the ASF dual-hosted git repository.

kou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/master by this push:
     new 3f5a14ad38 [Website] Version 10.0.0 release blog post (#242)
3f5a14ad38 is described below

commit 3f5a14ad38dd839f98428f7f4abaa23397e07eff
Author: Raúl Cumplido <ra...@gmail.com>
AuthorDate: Mon Oct 31 02:37:27 2022 +0100

    [Website] Version 10.0.0 release blog post (#242)
    
    Co-authored-by: Eric Erhardt <er...@microsoft.com>
    Co-authored-by: David Li <li...@gmail.com>
    Co-authored-by: Matt Topol <zo...@gmail.com>
    Co-authored-by: Dominik Moritz <do...@gmail.com>
    Co-authored-by: Antoine Pitrou <an...@python.org>
    Co-authored-by: Sutou Kouhei <ko...@cozmixng.org>
    Co-authored-by: Kenta Murata <39...@users.noreply.github.com>
    Co-authored-by: Larry White <lj...@gmail.com>
    Co-authored-by: Will Jones <wi...@gmail.com>
    Co-authored-by: Alenka Frim <Al...@users.noreply.github.com>
    Co-authored-by: Joris Van den Bossche <jo...@gmail.com>
    Co-authored-by: Neal Richardson <ne...@gmail.com>
---
 _posts/2022-10-31-10.0.0-release.md | 297 ++++++++++++++++++++++++++++++++++++
 1 file changed, 297 insertions(+)

diff --git a/_posts/2022-10-31-10.0.0-release.md b/_posts/2022-10-31-10.0.0-release.md
new file mode 100644
index 0000000000..74774947d0
--- /dev/null
+++ b/_posts/2022-10-31-10.0.0-release.md
@@ -0,0 +1,297 @@
+---
+layout: post
+title: "Apache Arrow 10.0.0 Release"
+date: "2022-10-31 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+
+The Apache Arrow team is pleased to announce the 10.0.0 release. This covers
+over 3 months of development work and includes [**473 resolved issues**][1]
+from [**100 distinct contributors**][2]. See the [Install Page](https://arrow.apache.org/install/)
+to learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 9.0.0 release, Yanghong Zhong, Remzi Yang, Dan Harris and
+Bogumił Kamińsk have been invited to be committers.
+L.C. Hsieh, Weston Pace, Raphael Taylor-Davies, Nicola Crane and
+Jacob Quinn have joined the Project Management Committee (PMC).
+Thanks for your contributions and participation in the project!
+
+## Columnar Format Notes
+
+## Arrow Flight RPC notes
+
+A JDBC driver based on [Arrow Flight SQL](https://arrow.apache.org/docs/format/FlightSql.html) is now available, courtesy of a code donation from Dremio ([ARROW-7744](https://issues.apache.org/jira/browse/ARROW-7744)).
+Flight SQL is now supported in Go ([ARROW-17326](https://issues.apache.org/jira/browse/ARROW-17326)).
+Protocol definitions for transactions and [Substrait](https://substrait.io) plans were added to Flight SQL and are implemented in C++ and Java ([ARROW-17688](https://issues.apache.org/jira/browse/ARROW-17688)).
+General "best practices" documentation was added for C++ ([ARROW-17407](https://issues.apache.org/jira/browse/ARROW-17407)).
+The C++ implementation now has basic [OpenTelemetry](https://opentelemetry.io/) integration ([ARROW-14958](https://issues.apache.org/jira/browse/ARROW-14958)).
+
+## C++ notes
+
+### C++11 is no longer supported
+
+The Arrow C++ codebase has moved to C++17 as its language compatibility
+standard ([ARROW-17545](https://issues.apache.org/jira/browse/ARROW-17545)). This means Arrow C++, including its header files,
+now requires a C++17-compliant compiler and standard library to be used.
+Such compilers are widely available on most platforms.
+
+Compatibility backports of C++14 and C++17 features, such as `std::string_view`
+or `std::variant`, have been removed in favor of the standard library version
+of these APIs ([ARROW-17546](https://issues.apache.org/jira/browse/ARROW-17546)). This will also make integration of Arrow C++
+with other codebases easier.
+
+It is expected that the Arrow C++ codebase will be gradually modernized to use
+C++17 features in subsequent releases, when the need arises.
+
+### Plasma is deprecated
+
+The Plasma module is deprecated and will be removed in a future release.
+
+### Compute / Acero
+
+Extension types are now supported in hash joins ([ARROW-16695](https://issues.apache.org/jira/browse/ARROW-16695)).
+
+### Datasets
+
+The fragments of a dataset can now be iterated in an asynchronous fashion,
+using `Dataset::GetFragmentsAsync` ([ARROW-17318](https://issues.apache.org/jira/browse/ARROW-17318)).
+
+### Filesystems
+
+It is now possible to configure a timeout policy for S3 ([ARROW-16521](https://issues.apache.org/jira/browse/ARROW-16521)).
+
+Error messages for S3 have been improved to give more context about the
+error ([ARROW-17079](https://issues.apache.org/jira/browse/ARROW-17079)).
+
+`GetFileInfoGenerator` has been optimized for local filesystems, with
+dedicated options to tune chunking and readahead ([ARROW-17306](https://issues.apache.org/jira/browse/ARROW-17306)).
+
+### JSON
+
+Previously, the JSON reader could only read Decimal fields from JSON strings
+(i.e. quoted). Now it can also read Decimal fields from JSON numbers as well
+([ARROW-17847](https://issues.apache.org/jira/browse/ARROW-17847)).
+
+### Parquet
+
+Before Arrow 3.0.0, data pages version 2 were incorrectly written out, making
+them unreadable with spec-compliant readers. A compatibility fix has been
+introduced so that they can still be read with contemporary versions of
+Arrow ([ARROW-17100](https://issues.apache.org/jira/browse/ARROW-17100)).
+
+### Substrait
+
+The Substrait consumer, which allows Substrait plans to be executed by the
+Acero execution engine, has received some improvements:
+
+* Aggregations are now supported ([ARROW-15591](https://issues.apache.org/jira/browse/ARROW-15591)).
+
+* Conversion options have been added so that the level of compliance and
+  rountrippability can be chosen when converting between Substrait and
+  Acero representations of a plan ([ARROW-16988](https://issues.apache.org/jira/browse/ARROW-16988)).
+
+* Support for many standard Substrait functions has been added
+  ([ARROW-15582](https://issues.apache.org/jira/browse/ARROW-15582), [ARROW-17523](https://issues.apache.org/jira/browse/ARROW-17523))
+
+Some work has also been done in the reverse direction, to allow Acero execution
+plans to be serialized as Substrait plans ([ARROW-16855](https://issues.apache.org/jira/browse/ARROW-16855)).
+
+### Other changes
+
+Our CMake package files have been overhauled ([ARROW-12175](https://issues.apache.org/jira/browse/ARROW-12175)).  As a result,
+namespaced targets are now exported, such as `Arrow::arrow_shared`.
+Legacy (non-namespaced) names are still available, for example `arrow_shared`.
+
+Compiling in release mode now uses `-O2`, not `-O3`, by default ([ARROW-17436](https://issues.apache.org/jira/browse/ARROW-17436)).
+
+The RISC-V architecture is now recongnized at build time ([ARROW-17440](https://issues.apache.org/jira/browse/ARROW-17440)).
+
+The PyArrow-specific C++ code was moved into the PyArrow source tree
+(see below, [ARROW-16340](https://issues.apache.org/jira/browse/ARROW-16340)). The `ARROW_PYTHON` CMake variable has been deprecated
+and will be removed in a later release; you should instead enable the necessary
+components explicitly ([ARROW-17868](https://issues.apache.org/jira/browse/ARROW-17868)).
+
+Some classes with a `Equals` method now also support `operator==` ([ARROW-6772](https://issues.apache.org/jira/browse/ARROW-6772)).
+It was decided to only do this when equality is computationally cheap (i.e.
+not on data collections such as Array, RecordBatch...).
+
+## C# notes
+
+#### Bug Fixes
+
+* DecimalArray incorrectly appends values very large and very small values. ([ARROW-17223](https://github.com/apache/arrow/pull/13732))
+
+## Gandiva notes
+
+Gandiva has been migrated to use LLVM opaque pointer types, as typed
+pointers had been deprecated ([ARROW-17790](https://issues.apache.org/jira/browse/ARROW-17790)).
+
+## Go notes
+
+* A new CI job has been added to run all of the tests with the `-asan` option using go1.18 ([ARROW-17324](https://issues.apache.org/jira/browse/ARROW-17324))
+* Go now passes all integration tests on data types and IPC handling.
+* The Go Arrow and Parquet packages now require go1.17+ ([ARROW-17646](https://issues.apache.org/jira/browse/ARROW-17646))
+
+### Compute
+
+The compute package that was importable via `github.com/apache/arrow/go/v9/arrow/compute` is now a separate module which requires go1.18+ (only the compute module, the rest of the packages still work fine under go1.17). ([ARROW-17456](https://issues.apache.org/jira/browse/ARROW-17456)).
+
+Scalar and Vector kernel infrastructure has been implemented for performing compute operations providing the following functionality:
+
+* casting Arrow Arrays from one type to another ([ARROW-17454](https://issues.apache.org/jira/browse/ARROW-17454))
+* Using Filter and Take functions on an Arrow Array, Record Batch, or Table ([ARROW-17669](https://issues.apache.org/jira/browse/ARROW-17669))
+
+### Arrow
+
+* Sparse and Dense Union Arrays have been implemented along with appropriate builders and data type support including in IPC reading and writing. ([ARROW-3678](https://issues.apache.org/jira/browse/ARROW-3678), [ARROW-17276](https://issues.apache.org/jira/browse/ARROW-17276)). This includes scalar types for Dense and Sparse union ([ARROW-17390](https://issues.apache.org/jira/browse/ARROW-17390))
+* LargeBinary, LargeString and LargeList arrays have been implemented for handling arrays with 64-bit offsets. This also included fixing a bug so that binary builders are correctly resettable. ([ARROW-8226](https://issues.apache.org/jira/browse/ARROW-8226), [ARROW-17275](https://issues.apache.org/jira/browse/ARROW-17275))
+* Support for Decimal256 arrays has been implemented ([ARROW-10600](https://issues.apache.org/jira/browse/ARROW-10600))
+* Automatic Endianness Conversion for non-native endianness is now an option for IPC streams ([ARROW-17219](https://issues.apache.org/jira/browse/ARROW-17219))
+* CSV Writer now supports Timestamp, Date32 and Date64 types ([ARROW-17273](https://issues.apache.org/jira/browse/ARROW-17273))
+* CSV Writer now supports custom formatting for boolean values ([ARROW-17277](https://issues.apache.org/jira/browse/ARROW-17277))
+* The Go Arrow Library now provides a FlightSQL client and server implementation ([ARROW-17326](https://issues.apache.org/jira/browse/ARROW-17326)). An example server implementation is provided for a FlightSQL server using SQLite ([ARROW-17359](https://issues.apache.org/jira/browse/ARROW-17359))
+* CSV Reader now supports schema type inference via NewInferringReader, along with options for specifying the type of some columns and skipping columns ([ARROW-17778](https://issues.apache.org/jira/browse/ARROW-17778))
+
+### Parquet
+
+* RowGroupReader.Column(index int) no longer panics if provided an invalid column index. Instead the signature has been changed to now return (PageReader, error) similar to other methods in the codebase. ([ARROW-17274](https://issues.apache.org/jira/browse/ARROW-17274))
+* Bitpacking and other internal required implementations for ppc64le have been added for the Parquet library ([ARROW-17372](https://issues.apache.org/jira/browse/ARROW-17372))
+* A bug has been fixed that caused inconsistent row information data from a table written by Athena ([ARROW-17453](https://issues.apache.org/jira/browse/ARROW-17453))
+* Fixed a bug that caused panics when writing a Nullable List of Structs ([ARROW-17169](https://issues.apache.org/jira/browse/ARROW-17169))
+* Key Value metadata in an Arrow Schema will be propagated to the Parquet file when using pqarrow even when not using StoreSchema ([ARROW-17627](https://issues.apache.org/jira/browse/ARROW-17627))
+* A memory leak when using statistics on ByteArray columns has been fixed ([ARROW-17573](https://issues.apache.org/jira/browse/ARROW-17573))
+
+
+## Java notes
+Many important features, enhancements, and bug fixes are included in this release, as are documentation enhancements, and a large number of improvements to build processes and project infrastructure. Selected highlights can be found below.
+
+#### New Features and Enhancements
+
+- JDBC Driver for Arrow Flight SQL ([13800](https://github.com/apache/arrow/pull/13800))
+- Initial implementation of immutable Table API ([14316](https://github.com/apache/arrow/pull/14316))
+- Substrait, transaction, cancellation for Flight SQL ([13492](https://github.com/apache/arrow/pull/13492))
+- Read Arrow IPC, CSV, and ORC files by NativeDatasetFactory ([13811](https://github.com/apache/arrow/pull/13811), [13973](https://github.com/apache/arrow/pull/13973), [14182](https://github.com/apache/arrow/pull/14182))
+- Add utility to bind Arrow data to JDBC parameters ([13589](https://github.com/apache/arrow/pull/13589))
+
+#### Build enhancements
+
+- Add Windows build script that produces DLLs ([14203](https://github.com/apache/arrow/pull/14203))
+- C Data Interface and Dataset libraries can now be built with mvn commands ([13881](https://github.com/apache/arrow/pull/13881), [13889](https://github.com/apache/arrow/pull/13889))
+
+#### Java notes: 
+
+- Java Plasma JNI bindings have been deprecated ([14262](https://github.com/apache/arrow/pull/14262))
+## JavaScript notes
+- No major changes.
+
+## Python notes
+
+Compatibility notes:
+
+* Some deprecated APIs (deprecated at least since pyarrow 1.0.0) have been removed: the
+  `RecordBatchReader.get_next_batch` method, `DataType.num_children` attribute, etc
+  ([ARROW-17649](https://issues.apache.org/jira/browse/ARROW-17649)).
+* When writing to Arrow IPC file format with `pyarrow.dataset.write_dataset` using
+  `format="ipc"` or `format="arrow"`, the default extension for the resulting files is
+  changed to `.arrow` instead of `.feather`. You can still use `format="feather"` to
+  write identical files but using the `.feather` extension
+  ([ARROW-17089](https://issues.apache.org/jira/browse/ARROW-17089)).
+
+New features and improvements:
+
+* Filters in `pq.read_table()` can be passed as an expression in addition to the legacy
+  list of tuples. For example, `filters=pc.field("col") < 4` is equivalent to
+  `filters=[("col", "<", 4)]`
+  ([ARROW-17483](https://issues.apache.org/jira/browse/ARROW-17483)).
+* The `batch_readahead` and `fragment_readahead` arguments for scanning Datasets are
+  exposed in Python ([ARROW-17299](https://issues.apache.org/jira/browse/ARROW-17299)).
+* ExtensionArrays can now be created from a storage array through the `pa.array(..)`
+  constructor ([ARROW-17834](https://issues.apache.org/jira/browse/ARROW-17834)).
+* Converting ListArrays containing ExtensionArray values to numpy or pandas works by
+  falling back to the storage array
+  ([ARROW-17813](https://issues.apache.org/jira/browse/ARROW-17813)).
+* The `pyarrow.substrait.run_query()` function gained a `table_provider` keyword to run
+  the query against in-memory tables
+  ([ARROW-17521](https://issues.apache.org/jira/browse/ARROW-17521)).
+* The `StructType` class gained a `field()` method to retrieve a child field
+  ([ARROW-17131](https://issues.apache.org/jira/browse/ARROW-17131)).
+* Casting Tables to a new schema now honors the nullability flag in the target schema
+  ([ARROW-16651](https://issues.apache.org/jira/browse/ARROW-16651)).
+* Parquet files are now explicitly closed after reading
+  ([ARROW-13763](https://issues.apache.org/jira/browse/ARROW-13763)).
+
+Further, the Python bindings benefit from improvements in the C++ library
+(e.g. new compute functions); see the C++ notes above for additional details.
+
+**Build notes**
+
+The PyArrow-specific C++ code, previously part of Arrow C++ codebase, is now integrated
+into PyArrow. The tests are run automatically as part of the PyArrow test suite. See:
+[ARROW-16340](https://issues.apache.org/jira/browse/ARROW-16340),
+[ARROW-17122](https://issues.apache.org/jira/browse/ARROW-17122) and
+[PyArrow C++ API notes](https://arrow.apache.org/docs/dev/python/integration/extending.html#c-api)).
+
+The build process is generally not affected by the change, but the `ARROW_PYTHON` CMake
+variable has been deprecated and will be removed in a later release; you should instead
+enable the necessary components explicitly
+([ARROW-17868](https://issues.apache.org/jira/browse/ARROW-17868)).
+
+## R notes
+Many improvements to Arrow dplyr queries are added in this version, including:
+
+ * `dplyr::across()` can be used to apply the same computation across multiple columns;
+ * long-running queries can now be cancelled;
+ * the data source file name can be added as a column when reading multi-file datasets with `add_filename()`;
+  * joins now support extension arrays;
+ * and all supported Arrow dplyr functions are now documented on the [R documentation site][6].
+
+For more on what’s in the 10.0.0 R package, see the [R changelog][4].
+
+## Ruby and C GLib notes
+
+### Ruby
+
+- Plasma binding has been deprecated ([ARROW-17864](https://issues.apache.org/jira/browse/ARROW-17864))
+
+### C GLib
+
+- Plasma binding has been deprecated ([ARROW-17862](https://issues.apache.org/jira/browse/ARROW-17862))
+
+## Rust notes
+
+The Rust projects have moved to separate repositories outside the
+main Arrow monorepo. For notes on the latest release of the Rust
+implementation, see the latest [Arrow Rust changelog][5].
+
+[1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%2010.0.0
+[2]: {{ site.baseurl }}/release/10.0.0.html#contributors
+[3]: {{ site.baseurl }}/release/10.0.0.html#changelog
+[4]: {{ site.baseurl }}/docs/r/news/
+[5]: https://github.com/apache/arrow-rs/tags
+[6]: https://arrow.apache.org/docs/r/reference/acero.html