You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by GitBox <gi...@apache.org> on 2020/01/14 21:33:15 UTC

[GitHub] [arrow-site] nealrichardson opened a new pull request #41: ARROW-7580: [Website] 0.16 release post

nealrichardson opened a new pull request #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] bkietz commented on a change in pull request #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41#discussion_r370340034
 
 

 ##########
 File path: _posts/2020-01-25-0.16.0-release.md
 ##########
 @@ -92,6 +92,33 @@ and result value, rather than taking a pointer-out function parameter
 
 ### C++: Datasets
 
+
+Added support for Arrow IPC files via `IPCFileFormat` (ARROW-7415), and the
+`IN` and `IS_VALID` filter operators (ARROW-7185).
+
+Classes were renamed to avoid the repeated `Data` prefix. The core classes are
+now `Dataset`, `Source`, `Fragment`, and `Partitioning` (ARROW-7498). The dataset
+APIs now use `Result<T>` when returning both a Status and result value, rather
+than taking a pointer-out function parameter (ARROW-7148).
+
+A discovery facility was added to create a `Dataset` and/or `Source` via the
+`DatasetFactory` and `SourceFactory` interfaces. Notably `FileSystemSourceFactory`
+can crawl directories to find the candidates files (`Fragment`) supporting by a
+given `FileFormat`. The factories will also try to unify schemas, transparently
+supporting missing/added columns to the final unified schema. (ARROW-6614,
+ARROW-7061, ARROW-843, ARROW-7380)
+
+A partitioning facility was added to support partition pruning with predicate
+pushdown (ARROW-6494). The partitioning extracts partition from the `Fragment`'s
+path, e.g. `/data/year=2015/month=04/day=29` would extract the `year`, `month`
+and `day` partitions. Partitions can injected in the schema and the RecordBatch
 
 Review comment:
   ```suggestion
   and `day` partitions. Partitions fields can injected into the schema and into the RecordBatch
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] paddyhoran commented on a change in pull request #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
paddyhoran commented on a change in pull request #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41#discussion_r370896853
 
 

 ##########
 File path: _posts/2020-01-25-0.16.0-release.md
 ##########
 @@ -0,0 +1,250 @@
+---
+layout: post
+title: "Apache Arrow 0.16.0 Release"
+date: "2020-01-25 00:00:00 -0600"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+The Apache Arrow team is pleased to announce the 0.16.0 release. This covers
+about 4 months of development work and includes [**XXX resolved issues**][1]
+from [**YY distinct contributors**][2].  See the Install Page to learn how to
+get the libraries for your platform.
+
+<!-- Another paragraph here -->
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release.  Many other bugfixes and improvements have been made, we refer
+you to the [complete changelog][3].
+
+## New committers
+
+Since the 0.15.0 release, we've added two new committers:
+
+* [Eric Erhardt][4]
+* [Joris Van den Bossche][5]
+
+Thank you for all your contributions!
+
+## Columnar Format Notes
+
+
+## Arrow Flight notes
+
+## C++ notes
+
+Some work has been done to make the default build configuration of Arrow C++
+as lean as possible:
+
+* The Arrow C++ core can now be built without any dependency on Boost
+(ARROW-6613, ARROW-6782, ARROW-6743, ARROW-6742).
+
+* Flatbuffers and its generated files are vendored within the Arrow C++ source
+tree (ARROW-6634), as well as the double-conversion library (ARROW-6633) and the
+uriparser library (ARROW-7169).
+
+* Compression support (ARROW-6631) and GLog integration (ARROW-6635) are disabled
+by default.
+
+* The filesystem (ARROW-6610), CSV, compute, dataset and JSON layers (ARROW-6637),
+as well as command-line utilities (ARROW-6636), are disabled by default.
+
+When enabled, the default jemalloc configuration has been tweaked to return
+memory more aggressively to the OS (ARROW-6910, ARROW-6994).
+
+The array validation facilities have been vastly expanded and now exist in
+two flavors: the `Validate` method does a light-weight validation that's
+O(1) in array size, while the potentially O(N) method `ValidateFull` does
+thorough data validation (ARROW-6157).
+
+The IO APIs now use `Result<T>` when returning both a Status
+and result value, rather than taking a pointer-out function parameter
+(ARROW-7235).
+
+### C++: CSV
+
+An option is added to attempt automatic dictionary encoding of string columns
+during reading a CSV file, until a cardinality limit is reached. When
+successful, it can make reading faster and the resulting Arrow data is
+much more memory-efficient (ARROW-3408).
+
+The CSV APIs now use `Result<T>` when returning both a Status
+and result value, rather than taking a pointer-out function parameter
+(ARROW-7236).
+
+### C++: Datasets
+
+
+Added support for Arrow IPC files via `IPCFileFormat` (ARROW-7415), and the
+`IN` and `IS_VALID` filter operators (ARROW-7185).
+
+Classes were renamed to avoid the repeated `Data` prefix. The core classes are
+now `Dataset`, `Source`, `Fragment`, and `Partitioning` (ARROW-7498). The dataset
+APIs now use `Result<T>` when returning both a Status and result value, rather
+than taking a pointer-out function parameter (ARROW-7148).
+
+A discovery facility was added to create a `Dataset` and/or `Source` via the
+`DatasetFactory` and `SourceFactory` interfaces. Notably `FileSystemSourceFactory`
+can crawl directories to find the candidates files (`Fragment`) supporting by a
+given `FileFormat`. The factories will also try to unify schemas, transparently
+supporting missing/added columns to the final unified schema. (ARROW-6614,
+ARROW-7061, ARROW-843, ARROW-7380)
+
+A partitioning facility was added to support partition pruning with predicate
+pushdown (ARROW-6494). The partitioning extracts partition from the `Fragment`'s
+path, e.g. `/data/year=2015/month=04/day=29` would extract the `year`, `month`
+and `day` partitions. Partitions can injected in the schema and the RecordBatch
+as materialized columns (ARROW-6965). Add a `PartitioningFactory` discovery
+facility such that types of the `Partitioning`'s schema are inferred if possible.
+
+The `ParquetFileFormat` transparently supports predicate pushdown by ignoring
+RowGroups based on their statistic (ARROW-6952). It also supports column
+projection, effectively only reading data from columns of interest (ARROW-6951).
+
+The dataset layer now compiles and passes tests on Visual Studio (ARROW-7650).
+
+### C++: Filesystem layer
+
+An HDFS implementation of the FileSystem class is available (ARROW-6720).
+
+The filesystem APIs now use `Result<T>` when returning both a Status
+and result value, rather than taking a pointer-out function parameter
+(ARROW-7161).
+
+### C++: IPC
+
+The Arrow IPC reader is being fuzzed continuously by the [OSS-Fuzz][6]
+infrastructure, to detect undesirable behavior on invalid or malicious input.
+Several issues have already been found and fixed.
+
+### C++: Parquet
+
+[Modular encryption][10] is supported (PARQUET-1300).
+
+A performance regression when reading a file with a large number of columns
+has been fixed (ARROW-6876, ARROW-7059).
+
+A number of severe bugs were fixed (PARQUET-1766, ARROW-6895).
+
+### C++: Tensors
+
+CSC sparse matrices are supported (ARROW-4225).
+
+The Tensor APIs now use `Result<T>` when returning both a Status
+and result value, rather than taking a pointer-out function parameter
+(ARROW-7420).
+
+## C# Notes
+
+
+## Java notes
+
+
+## Python notes
+
+Arrow is now tested against Python 3.8 as well.
+
+Python now has bindings for the datasets API (ARROW-6341) as well as the S3
+(ARROW-6655) and HDFS (ARROW-7310) filesystem implementations.
+
+The Duration (ARROW-5855) and Fixed Size List (ARROW-7261) types are exposed
+in Python.
+
+Sparse tensors can be converted to dense tensors (ARROW-6624).  They are
+also interoperable with the `pydata/sparse` and `scipy.sparse` libraries
+(ARROW-4223, ARROW-4224).
+
+A memory leak when converting Arrow data to Pandas "object" data has been
+fixed (ARROW-6874).
+
+Pandas extension arrays now are able to roundtrip through Arrow conversion
+(ARROW-2428).
+
+We now build manylinux2014 wheels for Python 3 (ARROW-7344).
+
+## Ruby and C GLib notes
+
+
+## Rust notes
+
+Support for Arrow data types has been improved, with the following array types now supported (ARROW-3690):
+
+* Fixed Size List and Fixed Size Binary
+* Adding a String Array for utf-8 strings, and keeping the Binary Array for general binary data
+* Duration and interval arrays.
+
+IPC readers for files and streams have been implemented (ARROW-5180).
+
+### Rust: DataFusion
+
+Query execution has been reimplemented with an extensible physical query plan. This allows other projec ts to add other plans, such as for distributed computing or for specific database servers (ARROW-5227).
 
 Review comment:
   ```suggestion
   Query execution has been reimplemented with an extensible physical query plan. This allows other projects to add other plans, such as for distributed computing or for specific database servers (ARROW-5227).
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] nealrichardson commented on issue #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on issue #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41#issuecomment-579008786
 
 
   I removed the Jira references from the dataset section since it’s all new functionality. I agree that they aren’t interesting to probably anyone but us. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] nevi-me commented on issue #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
nevi-me commented on issue #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41#issuecomment-578205665
 
 
   Hi @nealrichardson, my sincere apologies; I was actually trying to push my commit on Rust changes into my fork, then open a PR. I think I had finger trouble. If you prefer, I can revert my commit and make it a PR instead.
   
   @andygrove @paddyhoran @liurenjie1024 are you otherwise happy with the Rust notes additions? (Rendered: https://github.com/apache/arrow-site/pull/41/commits/74447c81cebbdc6f7164c48aa38981958234bc54?short_path=6283d65#diff-6283d65beb8b589c142c7e3d9c16ad64)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] nealrichardson commented on issue #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
nealrichardson commented on issue #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41#issuecomment-584665957
 
 
   Please no, not without reading through it first. There are still some placeholders that need to be filled in, e.g. https://github.com/apache/arrow-site/pull/41/files#diff-6283d65beb8b589c142c7e3d9c16ad64R28-R29

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] bkietz commented on a change in pull request #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41#discussion_r370341160
 
 

 ##########
 File path: _posts/2020-01-25-0.16.0-release.md
 ##########
 @@ -92,6 +92,33 @@ and result value, rather than taking a pointer-out function parameter
 
 ### C++: Datasets
 
+
+Added support for Arrow IPC files via `IPCFileFormat` (ARROW-7415), and the
+`IN` and `IS_VALID` filter operators (ARROW-7185).
+
+Classes were renamed to avoid the repeated `Data` prefix. The core classes are
+now `Dataset`, `Source`, `Fragment`, and `Partitioning` (ARROW-7498). The dataset
+APIs now use `Result<T>` when returning both a Status and result value, rather
+than taking a pointer-out function parameter (ARROW-7148).
+
+A discovery facility was added to create a `Dataset` and/or `Source` via the
+`DatasetFactory` and `SourceFactory` interfaces. Notably `FileSystemSourceFactory`
+can crawl directories to find the candidates files (`Fragment`) supporting by a
 
 Review comment:
   ```suggestion
   can crawl directories to find the candidates files (`FileFragment`s) supported by a
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] kszucs commented on issue #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
kszucs commented on issue #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41#issuecomment-584613648
 
 
   @wesm @nealrichardson shall we merge this?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] wesm commented on a change in pull request #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
wesm commented on a change in pull request #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41#discussion_r371476426
 
 

 ##########
 File path: _posts/2020-01-25-0.16.0-release.md
 ##########
 @@ -0,0 +1,271 @@
+---
+layout: post
+title: "Apache Arrow 0.16.0 Release"
+date: "2020-01-25 00:00:00 -0600"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+The Apache Arrow team is pleased to announce the 0.16.0 release. This covers
+about 4 months of development work and includes [**XXX resolved issues**][1]
+from [**YY distinct contributors**][2].  See the Install Page to learn how to
+get the libraries for your platform.
+
+<!-- Another paragraph here -->
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release.  Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## New committers
+
+Since the 0.15.0 release, we've added two new committers:
+
+* [Eric Erhardt][4]
+* [Joris Van den Bossche][5]
+
+Thank you for all your contributions!
+
+## Columnar Format Notes
+
+We still have work to do to complete comprehensive columnar format integration
+testing between the Java and C++ libraries. Once this work is completed, we
+intend to make a 1.0.0 release with [forward and backward compatibility
+guarantees][23].
+
+## Arrow Flight RPC notes
+
+Flight development work has recently focused on robustness and stability. If
+you are not yet familiar flight, read the [introductory blog post from
+October][24].
+
+We are also discussing adding a "bidirectional RPC" which enables
+request-response workflows requiring both client and server to send data
+streams to be performed a single RPC request.
+
+## C++ notes
+
+Some work has been done to make the default build configuration of Arrow C++ as
+lean as possible. The Arrow C++ core can now be built without any external
+dependencies other than a new enough C++ compiler (gcc 4.9 or higher). Notably,
+Boost is no longer required. We invested effort to vendor some small essential
+dependencies: Flatbuffers, double-conversion, and uriparser. Many optional
+features requiring external libraries, like compression and GLog integration,
+are now disabled by default. Several subcomponents of the C++ project like the
+filesystem API, CSV, compute, dataset and JSON layers, as well as command-line
+utilities, are now disabled by default. The only toolchain dependency enabled
+by default is jemalloc, the recommended memory allocator, but this can also be
+disabled if desired.
+
+When enabled, the default jemalloc configuration has been tweaked to return
+memory more aggressively to the OS (ARROW-6910, ARROW-6994). We welcome
+feedback from users about our memory allocation configuration and performance
+in applications.
+
+The array validation facilities have been vastly expanded and now exist in
+two flavors: the `Validate` method does a light-weight validation that's
+O(1) in array size, while the potentially O(N) method `ValidateFull` does
+thorough data validation (ARROW-6157).
+
+The IO APIs now use `Result<T>` when returning both a Status
+and result value, rather than taking a pointer-out function parameter
+(ARROW-7235).
+
+### C++: CSV
+
+An option is added to attempt automatic dictionary encoding of string columns
+during reading a CSV file, until a cardinality limit is reached. When
+successful, it can make reading faster and the resulting Arrow data is
+much more memory-efficient (ARROW-3408).
+
+The CSV APIs now use `Result<T>` when returning both a Status
+and result value, rather than taking a pointer-out function parameter
+(ARROW-7236).
+
+### C++: Datasets
+
+The 0.16 release introduces the Datasets API to the C++ library, along with
+bindings in Python and R.
+This API allows you to treat multiple files as a single logical dataset entity
+and make efficient selection queries against it.
+This release includes support for Parquet and Arrow IPC file formats.
+Factory objects allow you to discover files in a directory recursively, inspect the schemas in the files, and performs some basic schema unification.
+You may specify how file path segments map to partition, and there is support for auto-detecting some partition information, including Hive-style partitioning.
+The Datasets API includes a filter expression syntax as well as column selection.
+These are evaluated with predicate pushdown, and for Parquet, evaluation is pushed down to row groups.
+
+### C++: Filesystem layer
+
+An HDFS implementation of the FileSystem class is available (ARROW-6720). We
+plan to deprecate the prior bespoke C++ HDFS class in favor of the standardized
+filesystem API.
+
+The filesystem APIs now use `Result<T>` when returning both a Status
+and result value, rather than taking a pointer-out function parameter
+(ARROW-7161).
+
+### C++: IPC
+
+The Arrow IPC reader is being fuzzed continuously by the [OSS-Fuzz][6]
+infrastructure, to detect undesirable behavior on invalid or malicious input.
+Several issues have already been found and fixed.
+
+### C++: Parquet
+
+[Modular encryption][10] is now supported (PARQUET-1300).
+
+A performance regression when reading a file with a large number of columns
+has been fixed (ARROW-6876, ARROW-7059), as well as several bugs (PARQUET-1766, ARROW-6895).
+
+### C++: Tensors
+
+CSC sparse matrices are supported (ARROW-4225).
+
+The Tensor APIs now use `Result<T>` when returning both a Status
+and result value, rather than taking a pointer-out function parameter
+(ARROW-7420).
+
+## C# Notes
+
+
+## Java notes
+
+
+## Python notes
+
 
 Review comment:
   yes

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] pitrou commented on issue #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
pitrou commented on issue #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41#issuecomment-577674517
 
 
   We need to add a summary of the dataset additions. @fsaintjacques @bkietz ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] wesm commented on issue #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
wesm commented on issue #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41#issuecomment-584682335
 
 
   I think ample time has passed... I'm giving it a read through and will merge after that

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] wesm commented on issue #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
wesm commented on issue #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41#issuecomment-584686108
 
 
   I rebased and gave a skim. I'll leave this open for another half hour or so in case anyone wants to give it a skim and fix typos or reword things

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] bkietz commented on a change in pull request #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41#discussion_r370340319
 
 

 ##########
 File path: _posts/2020-01-25-0.16.0-release.md
 ##########
 @@ -92,6 +92,33 @@ and result value, rather than taking a pointer-out function parameter
 
 ### C++: Datasets
 
+
+Added support for Arrow IPC files via `IPCFileFormat` (ARROW-7415), and the
+`IN` and `IS_VALID` filter operators (ARROW-7185).
+
+Classes were renamed to avoid the repeated `Data` prefix. The core classes are
+now `Dataset`, `Source`, `Fragment`, and `Partitioning` (ARROW-7498). The dataset
+APIs now use `Result<T>` when returning both a Status and result value, rather
+than taking a pointer-out function parameter (ARROW-7148).
+
+A discovery facility was added to create a `Dataset` and/or `Source` via the
+`DatasetFactory` and `SourceFactory` interfaces. Notably `FileSystemSourceFactory`
+can crawl directories to find the candidates files (`Fragment`) supporting by a
+given `FileFormat`. The factories will also try to unify schemas, transparently
+supporting missing/added columns to the final unified schema. (ARROW-6614,
+ARROW-7061, ARROW-843, ARROW-7380)
+
+A partitioning facility was added to support partition pruning with predicate
+pushdown (ARROW-6494). The partitioning extracts partition from the `Fragment`'s
+path, e.g. `/data/year=2015/month=04/day=29` would extract the `year`, `month`
+and `day` partitions. Partitions can injected in the schema and the RecordBatch
+as materialized columns (ARROW-6965). Add a `PartitioningFactory` discovery
 
 Review comment:
   ```suggestion
   as materialized columns (ARROW-6965). There is now a `PartitioningFactory` discovery
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] wesm commented on issue #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
wesm commented on issue #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41#issuecomment-578891994
 
 
   I added some notes and compacted part of the C++ section. We might in general consider omitting most JIRA references to keep the information density as high as possible for readers who are less interested in the gory details. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] wesm merged pull request #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
wesm merged pull request #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] bkietz commented on a change in pull request #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
bkietz commented on a change in pull request #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41#discussion_r370340520
 
 

 ##########
 File path: _posts/2020-01-25-0.16.0-release.md
 ##########
 @@ -92,6 +92,33 @@ and result value, rather than taking a pointer-out function parameter
 
 ### C++: Datasets
 
+
+Added support for Arrow IPC files via `IPCFileFormat` (ARROW-7415), and the
+`IN` and `IS_VALID` filter operators (ARROW-7185).
+
+Classes were renamed to avoid the repeated `Data` prefix. The core classes are
+now `Dataset`, `Source`, `Fragment`, and `Partitioning` (ARROW-7498). The dataset
+APIs now use `Result<T>` when returning both a Status and result value, rather
+than taking a pointer-out function parameter (ARROW-7148).
+
+A discovery facility was added to create a `Dataset` and/or `Source` via the
+`DatasetFactory` and `SourceFactory` interfaces. Notably `FileSystemSourceFactory`
+can crawl directories to find the candidates files (`Fragment`) supporting by a
+given `FileFormat`. The factories will also try to unify schemas, transparently
+supporting missing/added columns to the final unified schema. (ARROW-6614,
+ARROW-7061, ARROW-843, ARROW-7380)
+
+A partitioning facility was added to support partition pruning with predicate
+pushdown (ARROW-6494). The partitioning extracts partition from the `Fragment`'s
+path, e.g. `/data/year=2015/month=04/day=29` would extract the `year`, `month`
+and `day` partitions. Partitions can injected in the schema and the RecordBatch
+as materialized columns (ARROW-6965). Add a `PartitioningFactory` discovery
+facility such that types of the `Partitioning`'s schema are inferred if possible.
+
+The `ParquetFileFormat` transparently supports predicate pushdown by ignoring
+RowGroups based on their statistic (ARROW-6952). It also supports column
 
 Review comment:
   ```suggestion
   RowGroups based on their statistics (ARROW-6952). It also supports column
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [arrow-site] jorisvandenbossche commented on a change in pull request #41: ARROW-7580: [Website] 0.16 release post

Posted by GitBox <gi...@apache.org>.
jorisvandenbossche commented on a change in pull request #41: ARROW-7580: [Website] 0.16 release post 
URL: https://github.com/apache/arrow-site/pull/41#discussion_r371442130
 
 

 ##########
 File path: _posts/2020-01-25-0.16.0-release.md
 ##########
 @@ -0,0 +1,271 @@
+---
+layout: post
+title: "Apache Arrow 0.16.0 Release"
+date: "2020-01-25 00:00:00 -0600"
+author: pmc
+categories: [release]
+---
+<!--
+{% comment %}
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements.  See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to you under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License.  You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+{% endcomment %}
+-->
+
+The Apache Arrow team is pleased to announce the 0.16.0 release. This covers
+about 4 months of development work and includes [**XXX resolved issues**][1]
+from [**YY distinct contributors**][2].  See the Install Page to learn how to
+get the libraries for your platform.
+
+<!-- Another paragraph here -->
+
+The release notes below are not exhaustive and only expose selected highlights
+of the release.  Many other bugfixes and improvements have been made: we refer
+you to the [complete changelog][3].
+
+## New committers
+
+Since the 0.15.0 release, we've added two new committers:
+
+* [Eric Erhardt][4]
+* [Joris Van den Bossche][5]
+
+Thank you for all your contributions!
+
+## Columnar Format Notes
+
+We still have work to do to complete comprehensive columnar format integration
+testing between the Java and C++ libraries. Once this work is completed, we
+intend to make a 1.0.0 release with [forward and backward compatibility
+guarantees][23].
+
+## Arrow Flight RPC notes
+
+Flight development work has recently focused on robustness and stability. If
+you are not yet familiar flight, read the [introductory blog post from
+October][24].
+
+We are also discussing adding a "bidirectional RPC" which enables
+request-response workflows requiring both client and server to send data
+streams to be performed a single RPC request.
+
+## C++ notes
+
+Some work has been done to make the default build configuration of Arrow C++ as
+lean as possible. The Arrow C++ core can now be built without any external
+dependencies other than a new enough C++ compiler (gcc 4.9 or higher). Notably,
+Boost is no longer required. We invested effort to vendor some small essential
+dependencies: Flatbuffers, double-conversion, and uriparser. Many optional
+features requiring external libraries, like compression and GLog integration,
+are now disabled by default. Several subcomponents of the C++ project like the
+filesystem API, CSV, compute, dataset and JSON layers, as well as command-line
+utilities, are now disabled by default. The only toolchain dependency enabled
+by default is jemalloc, the recommended memory allocator, but this can also be
+disabled if desired.
+
+When enabled, the default jemalloc configuration has been tweaked to return
+memory more aggressively to the OS (ARROW-6910, ARROW-6994). We welcome
+feedback from users about our memory allocation configuration and performance
+in applications.
+
+The array validation facilities have been vastly expanded and now exist in
+two flavors: the `Validate` method does a light-weight validation that's
+O(1) in array size, while the potentially O(N) method `ValidateFull` does
+thorough data validation (ARROW-6157).
+
+The IO APIs now use `Result<T>` when returning both a Status
+and result value, rather than taking a pointer-out function parameter
+(ARROW-7235).
+
+### C++: CSV
+
+An option is added to attempt automatic dictionary encoding of string columns
+during reading a CSV file, until a cardinality limit is reached. When
+successful, it can make reading faster and the resulting Arrow data is
+much more memory-efficient (ARROW-3408).
+
+The CSV APIs now use `Result<T>` when returning both a Status
+and result value, rather than taking a pointer-out function parameter
+(ARROW-7236).
+
+### C++: Datasets
+
+The 0.16 release introduces the Datasets API to the C++ library, along with
+bindings in Python and R.
+This API allows you to treat multiple files as a single logical dataset entity
+and make efficient selection queries against it.
+This release includes support for Parquet and Arrow IPC file formats.
+Factory objects allow you to discover files in a directory recursively, inspect the schemas in the files, and performs some basic schema unification.
+You may specify how file path segments map to partition, and there is support for auto-detecting some partition information, including Hive-style partitioning.
+The Datasets API includes a filter expression syntax as well as column selection.
+These are evaluated with predicate pushdown, and for Parquet, evaluation is pushed down to row groups.
+
+### C++: Filesystem layer
+
+An HDFS implementation of the FileSystem class is available (ARROW-6720). We
+plan to deprecate the prior bespoke C++ HDFS class in favor of the standardized
+filesystem API.
+
+The filesystem APIs now use `Result<T>` when returning both a Status
+and result value, rather than taking a pointer-out function parameter
+(ARROW-7161).
+
+### C++: IPC
+
+The Arrow IPC reader is being fuzzed continuously by the [OSS-Fuzz][6]
+infrastructure, to detect undesirable behavior on invalid or malicious input.
+Several issues have already been found and fixed.
+
+### C++: Parquet
+
+[Modular encryption][10] is now supported (PARQUET-1300).
+
+A performance regression when reading a file with a large number of columns
+has been fixed (ARROW-6876, ARROW-7059), as well as several bugs (PARQUET-1766, ARROW-6895).
+
+### C++: Tensors
+
+CSC sparse matrices are supported (ARROW-4225).
+
+The Tensor APIs now use `Result<T>` when returning both a Status
+and result value, rather than taking a pointer-out function parameter
+(ARROW-7420).
+
+## C# Notes
+
+
+## Java notes
+
+
+## Python notes
+
 
 Review comment:
   Should we mention here that 0.16 is the last release to support Python 2.7 ?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services