You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by GitBox <gi...@apache.org> on 2021/11/01 17:42:24 UTC

[GitHub] [arrow-site] nugend commented on a change in pull request #153: 6.0 release blog post

nugend commented on a change in pull request #153:
URL: https://github.com/apache/arrow-site/pull/153#discussion_r740404696



##########
File path: _posts/2021-10-22-6.0.0-release.md
##########
@@ -0,0 +1,212 @@
+---
+layout: post
+title: "Apache Arrow 6.0.0 Release"
+date: "2020-10-22 00:00:00 -0600"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+
+The Apache Arrow team is pleased to announce the 6.0.0 release. This covers
+over 3 months of development work and includes [**572 resolved issues**][1]
+from [**77 distinct contributors**][2]. See the Install Page to learn how to
+get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 5.0.0 release, Nic Crane, QP Hou, Jiayu Liu, and Matt Topol have been invited to be committers, and Neville Dipale has joined the Project Management Committee (PMC). Thanks for your contributions and participation in the project!
+
+## Columnar Format Notes
+* A new calendar interval type consisting of Month, Day and Nanoseconds has been added to the specification.  Reference implementations existing in Java, C++ and Python.
+## Arrow Flight RPC notes
+
+GLib and Ruby have added bindings for Arrow Flight.
+
+While not part of the release, work is ongoing on Arrow Flight SQL, which defines a protocol for clients to communicate with SQL databases using Arrow Flight. For those interested in the project, please reach out on the [mailing list](https://arrow.apache.org/community/).
+## C++ notes
+
+The month-day-nano interval type has been added (ARROW-13628).
+
+Various APIs, including extension types and scalars, are no longer experimental (ARROW-5244).
+
+Support for Visual Studio 2015 was dropped (ARROW-14070).
+
+### Compute Layer
+
+A basic in-memory query engine has been implemented and is accessible from the R bindings. Operations including filter, project, sort, equality joins, and various aggregations are supported.
+
+The following compute functions have been added:
+
+* aggregate functions: `approximate_median`, `count_distinct`, `max`, `min`, `product`
+* hash aggregate functions: `hash_all`, `hash_any`, `hash_approximate_median`, `hash_count_distinct`, `hash_distinct`, `hash_max`, `hash_mean`, `hash_min`, `hash_product`, `hash_stdev`, `hash_variance`
+* scalar arithmetic functions: `logb`, `round`, `round_to_multiple`
+* scalar string functions: `ascii_capitalize`, `ascii_swapcase`, `ascii_title`, `utf8_capitalize`, `utf8_swapcase`, `utf8_title`
+* scalar temporal functions: `assume_timezone`, `day_time_interval_between`, `days_between`, `hours_between`, `microseconds_between`, `milliseconds_between`, `minutes_between`, `month_day_nano_interval_between`, `month_interval_between`, `nanoseconds_between`, `quarters_between`, `seconds_between`, `strftime`, `us_week`, `week`, `weeks_between`, `years_between`
+* other scalar functions: `choose`, `list_element`
+* vector functions: `drop_null`, `select_k_unstable`
+
+In general, type support has been improved for most of the compute functions, but work here is ongoing, particularly around decimal support.
+
+Crashes have been fixed in particular cases for `take`, `filter`, `unique`, and `value_counts` (ARROW-13474, ARROW-13509, ARROW-14129).
+
+Hash aggregations (i.e. group by) supports scalar and array values (ARROW-13737, ARROW-14027).
+
+Temporal functions are now timezone-aware (e.g. when extracting the hour of a timestamp) (ARROW-12980).
+
+`count` can optionally count all values, not just null or non-null values (ARROW-13574).
+
+`fill_null` has been replaced by the more general `coalesce` (ARROW-7179).
+
+`is_null` can optionally consider NaN as null (ARROW-12959).
+
+Sorting has been optimized (ARROW-10898, ARROW-14165). Also, null values can now be sorted at either the beginning or the end (ARROW-12063).
+
+### CSV
+
+The CSV reader can read time32 and time64 types, and will infer time32 values for columns in the format "hh:mm" and "hh:mm:ss" (ARROW-11243).
+
+The decimal point can be customized when reading (ARROW-13421).
+
+The streaming reader will not unintentionally infer null-typed columns when using the various skip options (ARROW-13441).
+
+If a row has an incorrect number of columns, now the row can be skipped instead of raising an error (ARROW-12673).
+
+The option `quoted_strings_can_be_null` applies to all column types now, not just strings (ARROW-13580). When quoting is disabled entirely, the reader now takes advantage of this to improve performance (ARROW-14150).
+
+A CSVWriter object is now exposed, allowing for incremental writing (ARROW-11828). Dates can now be written (ARROW-12540).
+
+### Dataset Layer
+
+The dataset writer was refactored, and now supports more options, including a limit on the number of files open at once, compatibility with the async scanner, a limit on the number of rows written per file, and control over what to do when files already exist in the target directory (ARROW-13650). Additionally, the query engine can feed into the dataset writer as a sink (ARROW-13542).
+
+The asynchronous scanner now properly respects backpressure (ARROW-13611, ARROW-14192), as does the writer (ARROW-14191).
+
+ORC datasets are supported (ARROW-13572) with support for column projection pushdown (ARROW-13797).
+
+The Parquet/IPC format readers now respect the batch_size scanner option (ARROW-14024). Also, the Parquet reader now properly implements readahead for better performance (ARROW-14026).
+
+### IO and Filesystem Layer
+
+The retry strategy of S3FileSystem can be customized (ARROW-13508). When writing to an existing bucket as a user with limited permissions, Arrow will no longer emit a spurious "Access Denied" error (ARROW-13685).
+
+On MacOS with NFS mounts, a "[errno 25] Inappropriate ioctl for device" error was fixed (ARROW-13983). 
+
+The basics of a Google Cloud Storage filesystem have been added; work is in progress for full support (ARROW-8147, ARROW-14222, ARROW-14223, ARROW-14232, ARROW-14236, ARROW-14345, ARROW-14157).
+
+### JSON
+
+A crash was fixed when duplicate keys were present (ARROW-14109). 
+
+### Parquet
+
+Written min/max and null_count statistics for dictionary types were corrected (ARROW-11634, ARROW-12513).
+null_count statistics for columns that contain repeated data where corrected.
+
+file_offset for row groups was not being populated according to the specification, this issue has been corrected.
+
+Column selection now works for repeated columns and structs of more then one level.
+
+An error with large files when built with Thrift 0.14 was fixed (ARROW-13655). 
+
+The ParquetVersion enum was updated with more values to support finer-grained Parquet format version selection (ARROW-13794). 
+
+Writer performance was improved by avoiding repeated dynamic casts (ARROW-13965). 
+
+## C# notes
+
+This release includes improved support for dictionary arrays, as well as integration testing with the other Arrow implementations for the primitive and decimal types 
+
+## Go notes
+
+### Bug Fixes
+
+* Fixed handling of the zero value for Decimal128 `FromBigInt` [#10796](https://github.com/apache/arrow/pull/10796)
+* Fixed various "too many releases" errors in the tests allowing all tests to be run using the `assert` build tag in CI from now on, including a bug when writing slices of String, Binary or FixedWidthType arrays via ipc.Writer [#11270](https://github.com/apache/arrow/pull/11270), [#11276](https://github.com/apache/arrow/pull/11276)
+* Fixed builds on ARM and s390x architectures [#11299](https://github.com/apache/arrow/pull/11299), [#11305]((https://github.com/apache/arrow/pull/11305)
+
+### Enhancements
+
+* Added [Concatenate](https://github.com/apache/arrow/pull/11128) function for concatenating arrays
+* Implemented [Scalar](https://github.com/apache/arrow/pull/11024) values and `MakeArrayFromScalar` function [#11252]((https://github.com/apache/arrow/pull/11252)
+* Added cgo [optional allocator](https://github.com/apache/arrow/pull/11206) for allocating memory using the C++ memory pool for use with the C Data [import](https://github.com/apache/arrow/pull/11037) and [export](https://github.com/apache/arrow/pull/11220) APIs 
+* Added support for Month, Day, Nano interval type [#11310](https://github.com/apache/arrow/pull/11310)
+* Completed [Encoding](https://github.com/apache/arrow/pull/10716) package for Parquet, added [Metadata](https://github.com/apache/arrow/pull/10951) package.
+
+### Version Compatibility Update
+
+* Release process updated to add proper git tags for Go release for Module aware version tracking ([#11312](https://github.com/apache/arrow/pull/11312)) meaning that this release will be correctly tagged as v6.0.0 in go.mod and pkg.go.dev and future releases will be correctly versioned with the Go module system.
+
+## Java notes
+
+## JavaScript notes
+
+This release fixes builds with the latest TypeScript versions and ESM tree shaking. 
+
+## Python notes
+* Many new `pyarrow.compute` functions are available (see the C++ notes above for more details), and introspection of the functions was improved so that they look more like standard Python functions.
+* All Python functions and classes should now have documented parameters in the API reference.
+* SIMD optimization is now enabled in M1 wheels
+* Wheels are now built for more Python versions on M1 systems.
+* PyArrow is now compatible with Python 3.10
+* Creating Arrow arrays now supports more than just numpy arrays as masks
+* Printing Tables now previews the values in the columns
+* `copy_files` is now available in Python
+* Datasets now support ORC files
+* Sets are now supported when building arrays or converting from pandas.
+* 39 bugs have been fixed.
+
+## R notes
+
+This release adds grouped aggregation and joins in the `dplyr` interface, on top of the new Arrow C++ query engine. There is also support for using `duckdb` to query Arrow datasets. For more details, see the [complete R changelog][4]. 
+
+## Ruby and C GLib notes
+
+### Ruby
+
+### C GLib
+
+## Rust notes
+
+Rust continues to release minor versions every 2 weeks in addition to
+a major version with the rest of the Arrow language
+implementations. Thus most enhancements have been incrementally
+released over the last 3 months as part of the 5.x.
+
+The DataFusion and Ballista sub projects have begun releasing at their
+own cadence which is expected to continue in the new few weeks.

Review comment:
       ```suggestion
   own cadence which is expected to continue in the next few weeks.
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org