You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/29 21:49:05 UTC

[GitHub] [arrow-site] andygrove commented on a change in pull request #104: Start 4.0 blog post

andygrove commented on a change in pull request #104:
URL: https://github.com/apache/arrow-site/pull/104#discussion_r623425332



##########
File path: _posts/2021-04-26-4.0.0-release.md
##########
@@ -0,0 +1,161 @@
+---
+layout: post
+title: "Apache Arrow 4.0.0 Release"
+date: "2021-04-26 00:00:00 -0600"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+
+The Apache Arrow team is pleased to announce the 4.0.0 release. This covers
+3 months of development work and includes [**711 resolved issues**][1]
+from [**114 distinct contributors**][2]. See the Install Page to learn how to
+get the libraries for your platform.
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release. Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## Community
+
+Since the 3.0.0 release, Yibo Cai has been invited as a committer to Arrow, and Andrew Lamb and Jorge Leitão have joined the Project Management Committee (PMC). Thank you for all of your contributions!
+
+## Columnar Format Notes
+
+## Arrow Flight RPC notes
+
+In Java, applications can now enable zero-copy optimizations when writing data (ARROW-11066). This potentially breaks source compatibility, so it is not enabled by default.
+
+Arrow Flight is now packaged for C#/.NET.
+
+## Linux packages notes
+
+There are Linux packages for C++ and C GLib. They were provided by Bintray but [Bintray will no longer be available on 2021-05-01][5]. They are provided by Artifactory now. Users needs to change install instructions because URL is changed. See [the install document][6] for new instructions. Here is a summary of needed changes.
+
+For Debian GNU Linux and Ubuntu users:
+
+  * Users need to change `apache-arrow-archive-keyring` install instruction:
+    * Package name is changed to `apache-arrow-apt-source`.
+    * Download URL is changed to `https://apache.jfrog.io/artifactory/arrow/...` from `https://apache.bintray.com/arrow/...`.
+
+For CentOS and Red Hat Enterprise Linux users:
+
+  * Users need to change `apache-arrow-release` install instruction:
+    * Download URL is changed to `https://apache.jfrog.io/artifactory/arrow/...` from `https://apache.bintray.com/arrow/...`.
+
+## C++ notes
+
+The Arrow C++ library now includes a [`vcpkg.json`](https://github.com/apache/arrow/blob/master/cpp/vcpkg.json) manifest file and a new CMake option `-DARROW_DEPENDENCY_SOURCE=VCPKG` to simplify installation of dependencies using the [vcpkg](https://github.com/microsoft/vcpkg) package manager. This provides an alternative means of installing C++ library dependencies on Linux, macOS, and Windows. See the [Building Arrow C++]({{ site.baseurl }}/docs/developers/cpp/building.html) and [Developing on Windows]({{ site.baseurl }}/docs/developers/cpp/windows.html) docs pages for details.
+
+
+* Parquet library version will now be updated with each release.
+* Parquet DECIMAL statistics where previously calculated incorrectly, this has now been fixed.
+* The Arrow C++ library will now include an ORC writer. Hence it is possible to read and write ORC files in the Arrow C++ library. 
+
+* The S3FileSystem now supports specifying a proxy url (ARROW-8900)
+
+TODO, possible issues to mention:
+
+ARROW-11838 - [C++] Support reading IPC data with shared dictionaries
+ARROW-12316 - [C++] Switch default memory allocator from jemalloc to mimalloc on macOS
+
+### Compute layer
+
+Automatic implicit casting in compute kernels (ARROW-8919). For example, for
+the addition of two arrays, the arrays are first cast to their common numeric
+type instead of erroring when the types are not equal.
+
+ARROW-11744 - [C++] Add xsimd dependency
+
+Compute functions `quantile` (ARROW-10831) and `power` (ARROW-11070) have been
+added for numeric data.
+
+Compute functions for string processing have been added for:
+
+* Trim characters (ARROW-9128).
+* Extract substrings captured by a regex pattern (`extract_regex`, ARROW-10195).
+* Compute UTF8 string lengths (`utf8_length`, ARROW-11693).
+* Match strings against regex pattern (`match_substring_regex`, ARROW-12134).
+* Replace non-overlapping substrings that match (regex) pattern with
+  replacement (`replace_substring` and `replace_substring_regex`,
+  ARROW-10306).
+
+### Dataset
+
+TODO, possible issues to mention:
+
+ARROW-10438 - [C++][Dataset] Partitioning::Format on nulls
+ARROW-10846 - [C++] Add async filesystem operations
+ARROW-11174 - [C++][Dataset] Make Expressions available for projection
+ARROW-11601 - [C++][Dataset] Expose pre-buffering in ParquetFileFormatReaderOptions
+
+
+## C# notes
+
+## Go notes
+
+The go implementation now supports IPC buffer compression
+
+## Java notes
+Java now supports IPC buffer compression (ZSTD is recommended as the current performance of LZ4 is quite slow).
+
+## JavaScript notes
+* The Arrow JS module is now tree-shakeable. 
+* Iterating over Tables or Vectors is ~2X faster. [Demo](https://observablehq.com/@domoritz/arrow-js-3-vs-4-iterator) 
+* The default bundles use modern JS.
+## Python notes
+- Limited support for writing out CSV files (only types that have cast implementation to String) is now available.
+- Writing parquet list types now has the option of enabling the canonical group naming according to the Parquet specification.
+- The ORC Writer is now available.
+
+Creating a dataset with `pyarrow.dataset.write_dataset` is now possible from a
+Python iterator of record batches (ARROW-10882).
+The Dataset interface can now use custom projections using expressions when
+scanning (ARROW-11750). The expressions gained basic support for arithmetic
+operations (e.g. `ds.field('a') / ds.field('b')`) (ARROW-12058). See
+the [Dataset docs][7] for more details.
+
+See the C++ notes above for additional details.
+
+## R notes
+
+The `dplyr` interface to Arrow data gained many new features in this release, including support for `mutate()`, `relocate()`, and more. You can also call in `filter()` or `mutate()` over 100 functions supported by the Arrow C++ library, and many string functions are available both by their base R (`grepl()`, `gsub()`, etc.) and `stringr` (`str_detect()`, `str_replace()`) spellings.
+
+Datasets can now read compressed CSVs automatically, and you can also open a dataset that is based on a single file, enabling you to use `write_dataset()` to partition a very large file without having to read the whole file into memory.
+
+For more on what’s in the 4.0.0 R package, see the [R changelog][4].
+
+## Ruby and C GLib notes
+
+### Ruby
+
+### C GLib
+
+## Rust notes
+

Review comment:
       ```suggestion
   This release of Arrow continues to add new features and performance improvements. Much of our time this release was spent hammering out the necessary details so we can release the Rust versions to cargo at a more regular rate. In addition, we welcomed the Ballista distributed compute project officially to the fold. 
   
   ### Arrow
   
   - Improved LargeUtf8 support
   
   - Improved null handling in AND/OR kernels
   
   - Added JSON writer support (ARROW-11310)
   
   - JSON reader improvements
   
   - - LargeUTF8
     - Improved schema inference for nested list and struct types
   
   - Various performance improvements
   
   - IPC writer no longer calls finish() implicitly on drop
   
   - Compute kernels
   
   - - Support for optional `limit` in sort kernel 
     - Divide by a single scalar
     - Support for casting to timestamps
     - Cast: Improved support between casting List, LargeList, Int32, Int64, Date64
     - Kernel to combine two arrays based on boolean mask
     - Pow kernel
   
   - `new_null_array` for creating Arrays full of nulls. 
   
   ### Parquet
   
   - Added support for filtering row groups (used by DataFusion to implement filter push-down)
   - Added support for Parquet v 2.0 logical types 
   
   ### DataFusion
   
   New Features
   
   - SQL Support
   
   - - CTEs
     - UNION
     - HAVING
     - EXTRACT
     - SHOW TABLES
     - SHOW COLUMNS
     - INTERVAL
     - SQL Information schema
     - Support GROUP BY for more data types, including dictionary columns, boolean, Date32
   
   - Extensibility API
   
   - - Catalogs and schemas support
     - Table deregistration
     - Better support for multiple optimizers 
     - User defined functions can now provide specialized implementations for scalar values
   
   - Physical Plans
   
   - - Hash Repartitioning
   
   - SQL Metrics
   
   - Additional Postgres compatible function library:
   
   - - Length functions
     - Pad/trim functions
     - Concat functions
     - Ascii/Unicode functions
     - Regex
   
   - Proper identifier case identification (e.g. “Foo” vs Foo vs foo)
   
   - Upgraded to Tokio 1.x
   
     
   
   Performance Improvements:
   
   - LIMIT pushdown
   - Constant folding
   - Partitioned hash join
   - Create hashes vectorized in hash join
   - Improve parallelism using repartitioning pass
   - Improved hash aggregate performance with large number of grouping values
   - Predicate pushdown support for table scans
   - Predicate push-down to parquet enables DataFusion to quickly eliminate entire parquet row-groups based on query filter expressions and parquet row group min/max statistics
   
   
   
   API Changes
   
   - DataFrame methods now take `Vec<Expr>` rather than `&[Expr]`
   - TableProvider now consistently uses `Arc<TableProvider>` rather than `Box<TableProvider>`
   
   ### Ballista
   
   - Ballista was donated shortly before the Arrow 4.0.0 release and there is no new release of Ballista as part of Arrow 4.0.0
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org