You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ko...@apache.org on 2022/08/16 07:08:01 UTC

[arrow-site] branch master updated: Version 9.0.0 release blog post (#227)

This is an automated email from the ASF dual-hosted git repository.

kou pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/master by this push:
     new 06412519eb Version 9.0.0 release blog post (#227)
06412519eb is described below

commit 06412519ebed0b74f6c8bc44532fc9293244edff
Author: Raúl Cumplido <ra...@gmail.com>
AuthorDate: Tue Aug 16 09:07:57 2022 +0200

    Version 9.0.0 release blog post (#227)
    
    Co-authored-by: David Li <li...@gmail.com>
    Co-authored-by: Eric Erhardt <er...@microsoft.com>
    Co-authored-by: Antoine Pitrou <an...@python.org>
    Co-authored-by: Matt Topol <zo...@gmail.com>
    Co-authored-by: Dominik Moritz <do...@gmail.com>
    Co-authored-by: Larry White <lj...@gmail.com>
    Co-authored-by: Neal Richardson <ne...@gmail.com>
    Co-authored-by: Ian Cook <ia...@gmail.com>
    Co-authored-by: Joris Van den Bossche <jo...@gmail.com>
    Co-authored-by: Sutou Kouhei <ko...@clear-code.com>
    Co-authored-by: Weston Pace <we...@gmail.com>
---
 _posts/2022-08-16-9.0.0-release.md | 311 +++++++++++++++++++++++++++++++++++++
 1 file changed, 311 insertions(+)

diff --git a/_posts/2022-08-16-9.0.0-release.md b/_posts/2022-08-16-9.0.0-release.md
new file mode 100644
index 0000000000..f69e3edfc2
--- /dev/null
+++ b/_posts/2022-08-16-9.0.0-release.md
@@ -0,0 +1,311 @@
+---
+layout: post
+title: "Apache Arrow 9.0.0 Release"
+date: "2022-08-16 00:00:00"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+
+The Apache Arrow team is pleased to announce the 9.0.0 release. This covers
+over 3 months of development work and includes [**509 resolved issues**][1]
+from [**114 distinct contributors**][2]. See the Install Page to learn how to
+get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bug fixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 8.0.0 release, Dewey Dunnington, Alenka Frim and Rok Mihevc
+have been invited to be committers.
+Thanks for your contributions and participation in the project!
+
+## Columnar Format Notes
+
+## Arrow Flight RPC notes
+
+Arrow Flight is now available in MacOS M1 Python wheels ([ARROW-16779](https://issues.apache.org/jira/browse/ARROW-16779)).
+Arrow Flight SQL is now buildable on Windows ([ARROW-16902](https://issues.apache.org/jira/browse/ARROW-16902)).
+Ruby now exposes more of the Flight and Flight SQL APIs (various JIRAs).
+
+## Linux packages notes
+
+AlmaLinux 9 is now supported. ([ARROW-16745](https://issues.apache.org/jira/browse/ARROW-16745))
+
+AmazonLinux 2 aarch64 is now supported. ([ARROW-16477](https://issues.apache.org/jira/browse/ARROW-16477))
+
+## C++ notes
+
+STL-like iteration is now provided over chunked arrays ([ARROW-602](https://issues.apache.org/jira/browse/ARROW-602)).
+
+### Compute
+
+The C++ compute and execution engine is now officially named "Acero", though
+its C++ namespaces have not changed.
+
+New light-weight data holder abstractions have been introduced in order
+to reduce the overhead of invoking compute functions and kernels, especially
+at the small data sizes desirable for efficient parallelization (typically
+L1- or L2-sized).  Specifically, the non-owning `ArraySpan` and `ExecSpan`
+structures have internally superseded the much heavier `ExecBatch`, which
+is still supported for compatibility at the API level
+([ARROW-16756](https://issues.apache.org/jira/browse/ARROW-16756), [ARROW-16824](https://issues.apache.org/jira/browse/ARROW-16824), [ARROW-16852](https://issues.apache.org/jira/browse/ARROW-16852)).
+
+In a similar vein, the `ValueDescr` class was removed and `ScalarKernel`
+implementations now always receive at least one non-scalar input, removing
+the special case where a `ScalarKernel` needs to output a scalar rather than
+an array. The higher-level compute APIs still allow executing a scalar function
+over all-scalar inputs; but those scalars are internally broadcasted to
+1-element arrays so as to simplify kernel implementation ([ARROW-16757](https://issues.apache.org/jira/browse/ARROW-16757)).
+
+Some performance improvements were made to the hash join node.  These changes
+do not require additional configuration.  The hash join exec node has been
+improved to more efficiently use CPU cache and make better use of available
+vectorization hardware ([ARROW-14182](https://issues.apache.org/jira/browse/ARROW-14182)).
+
+Some plans containing a sequence of hash join operators will now use bloom
+filters to eliminate rows earlier in the plan, reducing the overall CPU
+cost of the plan ([ARROW-15498](https://issues.apache.org/jira/browse/ARROW-15498)).
+
+Timestamp comparison is now supported ([ARROW-16425](https://issues.apache.org/jira/browse/ARROW-16425)).
+
+A cumulative sum function is implemented over numeric inputs ([ARROW-13530](https://issues.apache.org/jira/browse/ARROW-13530)). Note that this is a vector
+function so cannot be used in an Acero ExecPlan.
+
+A rank vector kernel has been added ([ARROW-16234](https://issues.apache.org/jira/browse/ARROW-16234)).
+
+Temporal rounding functions received additional options to control how
+rounding is done ([ARROW-14821](https://issues.apache.org/jira/browse/ARROW-14821)).
+
+Improper computation of the "mode" function on boolean input was fixed
+([ARROW-17096](https://issues.apache.org/jira/browse/ARROW-17096)).
+
+Function registries can now be nested ([ARROW-16677](https://issues.apache.org/jira/browse/ARROW-16677)).
+
+### Dataset
+
+The `autogenerate_column_names` option for CSV reading is now handled correctly
+([ARROW-16436](https://issues.apache.org/jira/browse/ARROW-16436)).
+
+Fix `InMemoryDataset::ReplaceSchema` to actually replace the schema
+([ARROW-16085](https://issues.apache.org/jira/browse/ARROW-16085)).
+
+Fix `FilenamePartitioning` to properly support null values ([ARROW-16302](https://issues.apache.org/jira/browse/ARROW-16302)).
+
+### Filesystem
+
+A number of bug fixes and improvements were made to the Google Cloud Storage
+filesystem implementation ([ARROW-14892](https://issues.apache.org/jira/browse/ARROW-14892)).
+
+By default, the S3 filesystem implementation does not create or drop buckets
+anymore ([ARROW-15906](https://issues.apache.org/jira/browse/ARROW-15906)). This is a compatibility-breaking change intended
+to prevent user errors from having potentially catastrophic consequences.
+Options have been added to restore the previous behavior if necessary.
+
+### Parquet
+
+The default Parquet version is now 2.4 for writing, enabling use of
+more recent logical types by default ([ARROW-12203](https://issues.apache.org/jira/browse/ARROW-12203)).
+
+Non-nullable fields are now handled correctly by the Parquet reader
+([ARROW-16116](https://issues.apache.org/jira/browse/ARROW-16116)).
+
+Reading encrypted files should now be thread-safe ([ARROW-14114](https://issues.apache.org/jira/browse/ARROW-14114)).
+
+Statistics equality now works correctly with minmax ([ARROW-16487](https://issues.apache.org/jira/browse/ARROW-16487)).
+
+The minimum Thrift version required for building is now 0.13 ([ARROW-16721](https://issues.apache.org/jira/browse/ARROW-16721)).
+
+The Thrift deserialization limits can now be configured to accommodate for
+data files with very large metadata ([ARROW-16546](https://issues.apache.org/jira/browse/ARROW-16546)).
+
+### Substrait
+
+The Substrait spec has been updated to 0.6.0 ([ARROW-16816](https://issues.apache.org/jira/browse/ARROW-16816)). In addition, a
+larger subset of the Substrait specification is now supported ([ARROW-15587](https://issues.apache.org/jira/browse/ARROW-15587),
+[ARROW-15590](https://issues.apache.org/jira/browse/ARROW-15590),
+[ARROW-15901](https://issues.apache.org/jira/browse/ARROW-15901),
+[ARROW-16657](https://issues.apache.org/jira/browse/ARROW-16657),
+[ARROW-15591](https://issues.apache.org/jira/browse/ARROW-15591)).
+
+## C# notes
+
+#### New Features
+
+* Added support for Time32Array and Time64Array ([ARROW-16660](https://github.com/apache/arrow/pull/13279))
+
+#### Bug Fixes
+
+* When using TableFromRecordBatches, the resulting table columns have no data array. ([ARROW-13129](https://github.com/apache/arrow/pull/10562))
+* Fix intermittent test failures due to async memory management bug. ([ARROW-16978](https://github.com/apache/arrow/pull/13573))
+
+## Go notes
+
+### Security
+
+* Updated testify dependency to address CVE-2022-28948. ([ARROW-16759](https://issues.apache.org/jira/browse/ARROW-16759)) (This was also backported to previous versions and released as patch versions v6.0.2, v7.0.1, and v8.0.1)
+
+### Arrow
+
+#### New Features
+
+* Dictionary Scalars are now available ([ARROW-16323](https://issues.apache.org/jira/browse/ARROW-16323))
+* Introduced a DictionaryUnifier object along with functions for unifying Chunked Arrays and Tables ([ARROW-16324](https://issues.apache.org/jira/browse/ARROW-16324))
+* New CSV examples added to documentation to demonstrate error handling ([ARROW-16450](https://issues.apache.org/jira/browse/ARROW-16450))
+* CSV Reader now supports arrow.TimestampType ([ARROW-16504](https://issues.apache.org/jira/browse/ARROW-16504))
+* JSON parsing for Temporal Types now allow passing numeric values in addition to strings for parsing. Timezones will be properly parsed if they exist in the string and a function was added to retrieve a time.Location object from a TimestampType ([ARROW-16551](https://issues.apache.org/jira/browse/ARROW-16551))
+* New utilities added to decimal128 for rescaling and easy conversion to and from float32/float64 ([ARROW-16552](https://issues.apache.org/jira/browse/ARROW-16552))
+* Arrow DataType interface now has a LayoutMethod which returns the physical layout of the given datatype such as the number of buffers, types, etc. This matches the behavior of the layout() methods in C++ for data types. ([ARROW-16556](https://issues.apache.org/jira/browse/ARROW-16556))
+* Added a SliceBuffer function to the memory package to allow better re-using of memory across buffer objects ([ARROW-16557](https://issues.apache.org/jira/browse/ARROW-16557))
+* Dictionary Arrays can now be concatenated using array.Concatenate ([ARROW-17095](https://issues.apache.org/jira/browse/ARROW-17095))
+
+#### Bug Fixes
+
+* ipc.FileReader now properly uses the memory.Allocator interface ([ARROW-16002](https://issues.apache.org/jira/browse/ARROW-16002))
+* Addressed issue with Integration tests between Go and Java ([ARROW-16441](https://issues.apache.org/jira/browse/ARROW-16441))
+* RecordBuilder.UnmarshalJSON now properly ignores extra unknown fields rather than panicking ([ARROW-16456](https://issues.apache.org/jira/browse/ARROW-16456))
+* StructBuilder.UnmarshalJSON will no longer fail and panic when Nullable fields are missing ([ARROW-16502](https://issues.apache.org/jira/browse/ARROW-16502))
+* ipc.Reader no longer silently accepts string columns with invalid offsets, preventing unexpected panics later when writing or accessing the resulting arrays. ([ARROW-16831](https://issues.apache.org/jira/browse/ARROW-16831))
+* Arrow CSV reader no longer clobbers its reported errors and properly surfaces them ([ARROW-16926](https://issues.apache.org/jira/browse/ARROW-16926))
+
+### Parquet
+
+#### New Features
+
+* The CreatedBy version string for the Parquet writer will now correctly reflect the library version, and will be updated by the release scripts ([ARROW-16484](https://issues.apache.org/jira/browse/ARROW-16484))
+* Parquet bit_packing functions now have ARM64 NEON implementations for performance ([ARROW-16486](https://issues.apache.org/jira/browse/ARROW-16486))
+* It is now possible to customize the root node in the Parquet writer instead of hardcoding it to be named "schema" with a repetition type of Repeated. This was needed to allow producing files similar to Apache Spark where the root node has a repetition type of Required. It still defaults to the spec definition of Repeated. ([ARROW-16561](https://issues.apache.org/jira/browse/ARROW-16561))
+* parquet_reader CLI mainprog has been enhanced to dump values out as JSON and CSV along with setting an output file instead of just dumping to the terminal. ([ARROW-16934](https://issues.apache.org/jira/browse/ARROW-16934))
+
+#### Bug Fixes
+
+* Fixed a memory leak with Parquet page reading ([ARROW-16473](https://issues.apache.org/jira/browse/ARROW-16473))
+* Parquet Reader properly parallelizes column reads when the parallel option is set to true. ([ARROW-16530](https://issues.apache.org/jira/browse/ARROW-16530))
+* Fixed bug in the Bool decoder for plain encoding ([ARROW-16563](https://issues.apache.org/jira/browse/ARROW-16563))
+* Fixed a bug in the Parquet bool column reader where it failed to properly skip rows ([ARROW-16638](https://issues.apache.org/jira/browse/ARROW-16638))
+* Fixed the flakey travis ARM64 builds by reducing the size of a test case in the pqarrow unit tests to reduce the memory usage for the tests. ([ARROW-16669](https://issues.apache.org/jira/browse/ARROW-16669))
+* Parquet writer now properly handles writing arrow.NULL type arrays ([ARROW-16749](https://issues.apache.org/jira/browse/ARROW-16749))
+* Column level dictionary encoding configuration for Parquet writing now correctly respects the input value ([ARROW-16813](https://issues.apache.org/jira/browse/ARROW-16813))
+* Memory leak in DeltaByteArray encoding fixed ([ARROW-16983](https://issues.apache.org/jira/browse/ARROW-16983))
+
+
+## Java notes
+#### New Features
+* Allow overriding column nullability in arrow-jdbc ([#13558](https://github.com/apache/arrow/pull/13558))
+* Enable skip BOUNDS_CHECKING with setBytes and getBytes of ArrowBuf ([#13161](https://github.com/apache/arrow/pull/13161))
+* Initialize JNI components on use instead of statically ([#13146](https://github.com/apache/arrow/pull/13146))
+* Provide explicit JDBC column type mapping ([#13166](https://github.com/apache/arrow/pull/13166))
+* Allow duplicated field names in Java C data interface ([#13247](https://github.com/apache/arrow/pull/13247))
+* Improve and document StackTrace ([#12656](https://github.com/apache/arrow/pull/12656))
+* Keep more context when marshaling errors through JNI ([#13246](https://github.com/apache/arrow/pull/13246))
+* Make RoundingMode configurable to handle inconsistent scale in BigDecimals ([#13433](https://github.com/apache/arrow/pull/13433))
+* Improve Java dev experience with IntelliJ ([#13017](https://github.com/apache/arrow/pull/13017))
+* Implement ArrowArrayStream ([#13465](https://github.com/apache/arrow/pull/13465)))
+
+#### Bug Fixes
+* Fix variable-width vectors in integration JSON writer ([#13676](https://github.com/apache/arrow/pull/13676))
+* Handle empty JDBC ResultSet ([#13049](https://github.com/apache/arrow/pull/13049))
+* Fix hasNext() in ArrowVectorIterator ([#13107](https://github.com/apache/arrow/pull/13107))
+* Fix ArrayConsumer when using ArrowVectorIterator ([#12692](https://github.com/apache/arrow/pull/12692))
+* Update Gandiva Protobuf library to enable builds on Apple M1 ([#13121](https://github.com/apache/arrow/pull/13121))
+* Patch dataset module testing failure with JSE11+ ([#13200](https://github.com/apache/arrow/pull/13200))
+* Don't duplicate generated Protobuf classes between flight-core and flight-sql ([#13596](https://github.com/apache/arrow/pull/13596))
+
+## JavaScript notes
+
+* Fix error iterating tables with no batches ([ARROW-16371](https://issues.apache.org/jira/browse/ARROW-16371))
+* Handle case where `tableFromIPC` input is an async `RecordBatchReader` ([ARROW-16704](https://issues.apache.org/jira/browse/ARROW-16704))
+
+## Python notes
+
+Compatibility notes:
+
+* PyArrow now requires Python >= 3.7 ([ARROW-16474](https://issues.apache.org/jira/browse/ARROW-16474)).
+
+* The default behaviour regarding memory mapping has changed in several APIs (reading of Feather or Parquet files, IPC RecordBatchFileReader and RecordBatchStreamReader) to disable memory mapping by default ([ARROW-16382](https://issues.apache.org/jira/browse/ARROW-16382)).
+
+* The default Parquet version is now 2.4 for writing, enabling use of
+more recent logical types by default such as unsigned integers ([ARROW-12203](https://issues.apache.org/jira/browse/ARROW-12203)). One can specify `version="2.6"` to also enable support for nanosecond timestamps. Use `version="1.0"` to restore the old behaviour and maximizes file compatibility.
+
+* Some deprecated APIs (deprecated at least since pyarrow 1.0.0) have been removed: IPC methods in the top-level namespace, the `Value` scalar classes and the `pyarrow.compat` module ([ARROW-17010](https://issues.apache.org/jira/browse/ARROW-17010)).
+
+New features:
+
+* Google Cloud Storage (GCS) File System support is now available in the Python bindings ([ARROW-14892](https://issues.apache.org/jira/browse/ARROW-14892)).
+
+* The `Table.filter()` method now supports passing an expression in addition to a boolean array ([ARROW-16469](https://issues.apache.org/jira/browse/ARROW-16469)).
+
+* When implementing extension types in Python, it is now possible to also customize which Python scalar gets returned (in `Array.to_pylist()` or `Scalar.as_py()`) by subclassing `ExtensionScalar` ([ARROW-13612](https://issues.apache.org/jira/browse/ARROW-13612), ([ARROW-17065](https://issues.apache.org/jira/browse/ARROW-17065))).
+
+* It is now possible to register User Defined Functions (UDF) for scalar functions using `register_scalar_function` ([ARROW-15639](https://issues.apache.org/jira/browse/ARROW-15639)).
+
+* Basic support for consuming a Substrait plan has been exposed in Python as `pyarrow.substrait.run_query` ([ARROW-15779](https://issues.apache.org/jira/browse/ARROW-15779)).
+
+* The `cast` method and compute kernel now exposes the fine grained options in addition to safe/unsafe casting ([ARROW-15365](https://issues.apache.org/jira/browse/ARROW-15365)).
+
+In addition, this release includes several bug fixes and documention improvements (such as expanded examples in docstrings ([ARROW-16091](https://issues.apache.org/jira/browse/ARROW-16091))).
+
+Further, the Python bindings benefit from improvements in the C++ library
+(e.g. new compute functions); see the C++ notes above for additional details.
+
+
+## R notes
+
+Highlights include several new `dplyr` verbs, including `glimpse()` and `union_all()`, as well as many more datetime functions from `lubridate`. There is also experimental support for user-defined scalar functions in the query engine, and most packages include native support for datasets in Google Cloud Storage (opt-in in the Linux full source build). 
+
+For more on what’s in the 9.0.0 R package, see the [R changelog][4].
+
+## Ruby and C GLib notes
+
+FlightSQL is now supported but there are minimum features for now.
+
+More Flight features are now supported.
+
+### Ruby
+
+`Enumerable` compatible methods such as `#min` and `#max` on `Arrow::Array`, `Arrow::ChunkedArray` and `Arrow::Column` are implemented by C++'s [compute functions]({{ site.baseurl }}/docs/cpp/compute.html). This improves performance. ([ARROW-15222](https://issues.apache.org/jira/browse/ARROW-15222))
+
+This release fixed some memory leaks. ([ARROW-14790](https://issues.apache.org/jira/browse/ARROW-14790))
+
+This release improved support for interval type arrays such as `Arrow::MonthIntervalArray`. ([ARROW-16206](https://issues.apache.org/jira/browse/ARROW-16206))
+
+This release improved auto data type conversion. ([ARROW-16874](https://issues.apache.org/jira/browse/ARROW-16874))
+
+### C GLib
+
+Vala is now supported. ([ARROW-15671](https://issues.apache.org/jira/browse/ARROW-15671)). See [`c_glib/example/vala/`](https://github.com/apache/arrow/tree/apache-arrow-9.0.0/c_glib/example/vala) for examples.
+
+`GArrowQuantil
+eOptions` is added. ([ARROW-16623](https://issues.apache.org/jira/browse/ARROW-16623))
+
+## Rust notes
+
+The Rust projects have moved to separate repositories outside the
+main Arrow monorepo. For notes on the 19.0.0 release of the Rust
+implementation, see the [Arrow Rust changelog][5].
+
+[1]: https://issues.apache.org/jira/issues/?jql=project%20%3D%20ARROW%20AND%20status%20%3D%20Resolved%20AND%20fixVersion%20%3D%209.0.0
+[2]: {{ site.baseurl }}/release/9.0.0.html#contributors
+[3]: {{ site.baseurl }}/release/9.0.0.html#changelog
+[4]: {{ site.baseurl }}/docs/r/news/
+[5]: https://github.com/apache/arrow-rs/blob/19.0.0/CHANGELOG.md