You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by ia...@apache.org on 2021/08/04 13:17:06 UTC

[arrow-site] branch master updated: Blog post for 5.0.0 release (#127)

This is an automated email from the ASF dual-hosted git repository.

ianmcook pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow-site.git


The following commit(s) were added to refs/heads/master by this push:
     new ec70f41  Blog post for 5.0.0 release (#127)
ec70f41 is described below

commit ec70f41ce7688b35fb463d56b03de6eb49b773c2
Author: Neal Richardson <ne...@gmail.com>
AuthorDate: Wed Aug 4 09:17:02 2021 -0400

    Blog post for 5.0.0 release (#127)
    
    * Start 5.0.0 blog post
    
    * Update _posts/2021-07-20-5.0.0-release.md
    
    Co-authored-by: David Li <li...@gmail.com>
    
    * JS notes
    
    Co-authored-by: Dominik Moritz <do...@gmail.com>
    
    * Python from @amol-
    
    * Drop C#
    
    * Add Ruby and C GLib notes
    
    * Add C++ notes
    
    * Add Java notes
    
    * Include arrow-rs details
    
    * Update links for arrow-rs changes
    
    * Refer to Rust changelog and blog post
    
    * Change quotes to backticks
    
    * Update _posts/2021-07-20-5.0.0-release.md
    
    Co-authored-by: Weston Pace <we...@gmail.com>
    
    * Go
    
    Co-authored-by: Matt Topol <zo...@gmail.com>
    
    * Organize and trim the go notes
    
    * Delete comment
    
    * Update _posts/2021-07-20-5.0.0-release.md
    
    Co-authored-by: David Li <li...@gmail.com>
    
    * Update _posts/2021-07-20-5.0.0-release.md
    
    Adding spacing
    
    * Update numbers of commits and contributors
    
    * Update date
    
    * Update _posts/2021-07-20-5.0.0-release.md
    
    * Update _posts/2021-07-20-5.0.0-release.md
    
    Co-authored-by: liyafan82 <li...@gmail.com>
    
    * fix typo + add refer to C++
    
    * Remove unneeded quotes
    
    * Remove errant space
    
    * Fix bad link
    
    * Mention arrange() dplyr verb too
    
    Co-authored-by: David Li <li...@gmail.com>
    Co-authored-by: Dominik Moritz <do...@gmail.com>
    Co-authored-by: Sutou Kouhei <ko...@clear-code.com>
    Co-authored-by: Antoine Pitrou <an...@python.org>
    Co-authored-by: Bryan Cutler <cu...@gmail.com>
    Co-authored-by: Ian Cook <ia...@gmail.com>
    Co-authored-by: Weston Pace <we...@gmail.com>
    Co-authored-by: Matt Topol <zo...@gmail.com>
    Co-authored-by: liyafan82 <li...@gmail.com>
    Co-authored-by: Joris Van den Bossche <jo...@gmail.com>
---
 _posts/2021-07-20-5.0.0-release.md | 267 +++++++++++++++++++++++++++++++++++++
 1 file changed, 267 insertions(+)

diff --git a/_posts/2021-07-20-5.0.0-release.md b/_posts/2021-07-20-5.0.0-release.md
new file mode 100644
index 0000000..c6d574b
--- /dev/null
+++ b/_posts/2021-07-20-5.0.0-release.md
@@ -0,0 +1,267 @@
+---
+layout: post
+title: "Apache Arrow 5.0.0 Release"
+date: "2020-07-29 00:00:00 -0600"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+
+The Apache Arrow team is pleased to announce the 5.0.0 release. This covers
+3 months of development work and includes **684 commits** from
+[**99 distinct contributors**][1] in 2 repositories. See the Install Page to
+learn how to get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the complete changelogs for the [`apache/arrow`][2] and
+[`apache/arrow-rs`][3] repositories.
+
+## Community
+
+Since the 4.0.0 release, Daniël Heres, Kazuaki Ishizaki, Dominik Moritz, and Weston Pace
+have been invited as committers to Arrow,
+and Benjamin Kietzman and David Li have joined the Project Management Committee
+(PMC). Thank you for all of your contributions!
+
+## Columnar Format Notes
+
+Official IANA Media types (MIME types) have been registered for Apache
+Arrow IPC protocol data, both [stream]({{ site.baseurl }}/docs/format/Columnar.html#ipc-streaming-format)
+and [file]({{ site.baseurl }}/docs/format/Columnar.html#ipc-file-format) variants:
+
+* https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.stream
+* https://www.iana.org/assignments/media-types/application/vnd.apache.arrow.file
+
+We recommend `.arrow` as the IPC file format file extension and `.arrows` for
+the IPC streaming format file extension.
+
+## Arrow Flight RPC notes
+
+The Go implementation now supports custom metadata and middleware, and has
+been added to integration testing.
+
+In Python, some operations can now be interrupted via Control-C.
+
+## C++ notes
+
+`MakeArrayFromScalar` now works for fixed-size binary types (ARROW-13321).
+
+### Compute layer
+
+The following [compute functions]({{site.baseurl}}/docs/cpp/compute.html)
+were added:
+
+* aggregations: `index`
+
+* scalar arithmetic and math functions: `abs`, `abs_checked`, `acos`,
+  `acos_checked`, `asin`, `asin_checked`, `atan`, `atan2`, `ceil`, `cos`,
+  `cos_checked`, `floor`, `ln`, `ln_checked`, `log10`, `log10_checked`,
+  `log1p`, `log1p_checked`, `log2`, `log2_checked`, `negate`, `negate_checked`,
+  `sign`, `sin`, `sin_checked`, `tan`, `tan_checked`, `trunc`
+
+* scalar bitwise functions: `bit_wise_and`, `bit_wise_not`, `bit_wise_or`,
+  `bit_wise_xor`, `shift_left`, `shift_left_checked`, `shift_right`,
+  `shift_right_checked`
+
+* scalar string functions: `ascii_center`, `ascii_lpad`, `ascii_reverse`,
+  `ascii_rpad`, `binary_join`, `binary_join_element_wise`,
+  `binary_replace_slice`, `count_substring`, `count_substring_regex`,
+  `ends_with`, `find_substring`, `find_substring_regex`, `match_like`,
+  `split_pattern_regex`, `starts_with`, `utf8_center`, `utf8_lpad`,
+  `utf8_replace_slice`, `utf8_rpad`, `utf8_reverse`, `utf8_slice_codepoints`
+
+* scalar temporal functions: `day`, `day_of_week`, `day_of_year`,
+  `iso_calendar`, `iso_week`, `iso_year`, `hour`, `microsecond`, `millisecond`,
+  `minute`, `month`, `nanosecond`, `quarter`, `second`, `subsecond`, `year`
+
+* other scalar functions: `case_when`, `coalesce`, `if_else`, `is_finite`,
+  `is_inf`, `is_nan`, `max_element_wise`, `min_element_wise`, `make_struct`
+
+* vector functions: `replace_with_mask`
+
+Duplicates are now allowed in `SetLookupOptions::value_set` (ARROW-12554).
+
+Decimal types are now supported by some basic arithmetic functions (ARROW-12074).
+
+The `take` function now supports dense unions (ARROW-13005).
+
+It is now possible to cast between dictionary types with different index
+types (ARROW-11673).
+
+Sorting is now implemented for boolean input (ARROW-12016).
+
+### CSV
+
+The streaming CSV reader can now take some advantage of multiple threads (ARROW-11889).
+
+The CSV reader tries to make its errors more informative by adding the
+row number when it is known, i.e. when parallel reading is disabled (ARROW-12675).
+
+A new option `ReaderOptions::skip_rows_after_names` allows skipping a number
+of rows _after_ reading the column names (as opposed to
+`ReaderOptions::skip_rows`).
+
+Quoted strings can now be treated as always non-null (ARROW-10115).
+
+### Dataset layer
+
+The asynchronous scanner introduced in 4.0.0 has been improved with truly 
+asynchronous readers implemented for CSV, Parquet, and IPC file formats and 
+file-level parallelism added.  This mode is controlled by a flag `use_async` that
+can be passed into methods which scan a dataset.  Setting this flag to True
+will have significant improvements on filesystems with high latency or parallel
+reads (e.g. S3).
+
+A CountRows method has been added to count rows matching a predicate; where
+possible, this will use metadata in files instead of reading the data itself.
+
+CSV datasets can now be written, and when reading a CSV dataset, explicit types can
+now be specified for a subset of columns while allowing the rest to still be inferred. 
+
+### IO and Filesystem layer
+
+The I/O thread pool size can now be adjusted at runtime (ARROW-12760).
+The default size remains 8 threads.
+
+Streams now can have auxiliary metadata, depending on the backend.  This
+has been implemented for the S3 filesystems, where a couple metadata
+keys are supported such as `Content-Type` and `ACL` (ARROW-11161, ARROW-12719).
+
+The HadoopFileSystem implementation now implements the FileSystem abstraction
+more faithfully (ARROW-12790).
+
+### Parquet
+
+The new LZ4_RAW compression scheme was implemented (PARQUET-1998).
+Unlike the legacy LZ4 compression scheme, it is defined unambiguously
+and should provide better portability once other Parquet implementations
+catch up.
+
+## Go notes
+
+
+### Flight 
+
+* Flight Client and Server now support Custom Metadata through the functions `flight.NewClientWithMiddleware` and `flight.NewServerWithMiddleware`. Functions
+`flight.NewFlightClient`, `flight.NewFlightServer`, `flight.CreateServerBearerTokenAuthInterceptors` have been deprecated in favor of using the new middleware. [#10633](https://github.com/apache/arrow/pull/10633)
+* Flight Client `AuthHandler` no longer overwrites outgoing metadata, correctly appending new metadata without overwriting existing metadata [#10297](https://github.com/apache/arrow/pull/10297)
+* Flight AppMetadata field is now exposed both for Reading and Writing via `flight.Reader#LatestAppMetadata()` and `flight.Writer#WriteWithAppMetadata` functions [#10142](https://github.com/apache/arrow/pull/10142)
+
+### Other enhancements 
+
+* [Map](https://github.com/apache/arrow/pull/10106) and [Extension](https://github.com/apache/arrow/pull/10203) Datatypes are now implemented for Arrow Arrays 
+* [Schema package](https://github.com/apache/arrow/10071) and first part of [Encoding package](https://github.com/apache/arrow/10379) added for Golang Parquet Implementation
+
+## Java notes
+
+Highlighted improvements and fixes:
+
+* Improved support for extension types using a complex storage type, e.g. struct, map or union. These can now extend
+the `ExtensionTypeVector` base class.
+* Union vectors now extend `AbstractContainerVector` to be consistent with other vectors.
+* Guava dependency updated to 30.1.1
+* Memory leak fixed if an exception occurs when reading IPC messages from a channel. [#10423](https://github.com/apache/arrow/pull/10423)
+* Flight error metadata is now propagated to the client. [#10370](https://github.com/apache/arrow/pull/10370)
+* JDBC adapter now preserves nullability. [#10285](https://github.com/apache/arrow/pull/10285)
+* The memory rounding policy is respected when allocating vector buffers. This helps saving memory space. [#10576](https://github.com/apache/arrow/pull/10576)
+
+API compatibility changes:
+
+* Complex vectors now return covariant types from `getObject(int)`. [#9964](https://github.com/apache/arrow/pull/9964)
+
+## JavaScript notes
+
+* Tables do not extend DataFrames anymore. This enables smaller bundles. [#10277](https://github.com/apache/arrow/pull/10277)
+* Arrow uses closure compiler for all UMD bundles, making them smaller. [#10281](https://github.com/apache/arrow/pull/10281)
+* The npm package now comes with declaration maps for better navigation from types to source code. [#10673](https://github.com/apache/arrow/pull/10673)
+* Updated dependencies and improvements to the code.
+
+## Python notes
+
+* Datasets can now scan files asynchronously when the `use_async=True` option is provided to `Dataset.scanner`, `Dataset.to_table`, or `Dataset.to_batches` methods. This should provide better performance in environments where I/O can be slow, such as with remote sources.
+* Arrow now provides builtin support for writing CSV files through `pyarrow.csv.write_csv`
+* Wheels for Apple M1 Macs are now provided.
+* Many new `pyarrow.compute` functions are available (see the C++ notes above
+  for more details), and introspection of the functions was improved so that
+  they look more like standard Python functions.
+* It is now possible to access ORC file metadata from Python `ORCFile` objects
+* Building a `StructArray` now accepts a `mask` like other arrays
+* Many updates and fixes for the documentation
+
+## R notes
+
+In this release, we've more than doubled the number of functions you can call on Arrow Datasets inside `dplyr::filter()`, `mutate()`, and `arrange()`, including many more string, datetime, and math functions. You can also write Datasets to CSV files, in addition to Parquet and Feather. We've also deepened support for the Arrow C interface, which is used in the Python interface and allows integration with other projects, such as DuckDB.
+
+For more on what’s in the 5.0.0 R package, see the [R changelog][4].
+
+## Ruby and C GLib notes
+
+Apache Arrow Flight support is started. But `ListFlights` is only supported for now. More features will be implemented in the next major release.
+
+### Ruby
+
+You need gobject-introspection gem 3.4.5 or later to implement your Apache Arrow Flight server. If you only use Apache Arrow Flight client, gobject-introspection gem 3.4.5 or later isn't required.
+
+Here are highlighted improvements:
+
+* Compute functions accept raw Ruby objects such as `true`, `Integer`, `Array` and `String`:
+
+  ```ruby
+  add_function = Arrow::Function.find("add")
+  # Not shortcut version
+  augend = Arrow::Int8Array.new([1, 2, 3])
+  addend = Arrow::Int8Scalar.new(5)
+  args = [
+    Arrow::ArrayDatum.new(augend),
+    Arrow::ScalarDatum.new(addend),
+  ]
+  add_function.execute(args).value.to_a # => [6, 7, 8]
+  # Shortcut version
+  add_function.execute([[1, 2, 3], 5]).value.to_a # => [6, 7, 8]
+  ```
+* `Arrow::PrimaryArray` and `Arrow::Buffer` can be used as MemoryView that is added in Ruby 3.0.
+
+There are some backward incompatible changes:
+
+* `Arrow::CountOptions` and `Arrow::CountMode` are removed. Use `Arrow::ScalarAggregateOptions` instead.
+
+### C GLib
+
+There are some backward incompatible changes:
+
+* `GArrowCountOptions` and `GArrowCountMode` are removed. Use `GArrowScalarAggregateOptions` instead.
+* `garrow_array_equal_range()` requires `GArrowEqualOptions`.
+* Prefix in arrow-dataset-glib is changed to `gadataset_`/`GADATASET_` from `gad_`/`GAD_`.
+* `GADScanOptions`, `GADScanTask`  and `GADInMemoryScanTask` are removed. Use `gadataset_begin_scan()` or `gadataset_to_table()` instead.
+* `GArrowCompareOptions`, `GArrowCompareOperator` and `garrow_*_array_compare()` are removed. Use `equal`, `not_equal`, `less_than`, `less_than_equal`, `greater_than` and `greater_than_equal` compute functions directly instead.
+
+## Rust notes
+
+The Rust projects have moved to separate repositories outside the
+main Arrow monorepo. For notes on the 5.0.0 release of the Rust
+implementation, see the [Arrow Rust changelog][3] and the
+[Apache Arrow Rust 5.0.0 Release blog post]({% post_url 2021-07-20-5.0.0-rs-release %}).
+
+[1]: {{ site.baseurl }}/release/5.0.0.html#contributors
+[2]: {{ site.baseurl }}/release/5.0.0.html#changelog
+[3]: https://github.com/apache/arrow-rs/blob/5.0.0/CHANGELOG.md
+[4]: {{ site.baseurl }}/docs/r/news/