You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by fs...@apache.org on 2020/01/23 14:28:17 UTC

[arrow-site] branch pr/41 created (now c40438b)

This is an automated email from the ASF dual-hosted git repository.

fsaintjacques pushed a change to branch pr/41
in repository https://gitbox.apache.org/repos/asf/arrow-site.git.


      at c40438b  Add dataset notes

This branch includes the following new commits:

     new c40438b  Add dataset notes

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.



[arrow-site] 01/01: Add dataset notes

Posted by fs...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

fsaintjacques pushed a commit to branch pr/41
in repository https://gitbox.apache.org/repos/asf/arrow-site.git

commit c40438b473b711903f51abd0d262fd5fd680b2f5
Author: François Saint-Jacques <fs...@gmail.com>
AuthorDate: Thu Jan 23 09:27:58 2020 -0500

    Add dataset notes
---
 _posts/2020-01-25-0.16.0-release.md | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/_posts/2020-01-25-0.16.0-release.md b/_posts/2020-01-25-0.16.0-release.md
index d9498be..5d45e71 100644
--- a/_posts/2020-01-25-0.16.0-release.md
+++ b/_posts/2020-01-25-0.16.0-release.md
@@ -92,6 +92,33 @@ and result value, rather than taking a pointer-out function parameter
 
 ### C++: Datasets
 
+
+Added support for Arrow IPC files via `IPCFileFormat` (ARROW-7415), and the
+`IN` and `IS_VALID` filter operators (ARROW-7185).
+
+Classes were renamed to avoid the repeated `Data` prefix. The core classes are
+now `Dataset`, `Source`, `Fragment`, and `Partitioning` (ARROW-7498). The dataset
+APIs now use `Result<T>` when returning both a Status and result value, rather
+than taking a pointer-out function parameter (ARROW-7148).
+
+A discovery facility was added to create a `Dataset` and/or `Source` via the
+`DatasetFactory` and `SourceFactory` interfaces. Notably `FileSystemSourceFactory`
+can crawl directories to find the candidates files (`Fragment`) supporting by a
+given `FileFormat`. The factories will also try to unify schemas, transparently
+supporting missing/added columns to the final unified schema. (ARROW-6614,
+ARROW-7061, ARROW-843, ARROW-7380)
+
+A partitioning facility was added to support partition pruning with predicate
+pushdown (ARROW-6494). The partitioning extracts partition from the `Fragment`'s
+path, e.g. `/data/year=2015/month=04/day=29` would extract the `year`, `month`
+and `day` partitions. Partitions can injected in the schema and the RecordBatch
+as materialized columns (ARROW-6965). Add a `PartitioningFactory` discovery
+facility such that types of the `Partitioning`'s schema are inferred if possible.
+
+The `ParquetFileFormat` transparently supports predicate pushdown by ignoring
+RowGroups based on their statistic (ARROW-6952). It also supports column
+projection, effectively only reading data from columns of interest (ARROW-6951).
+
 The dataset layer now compiles and passes tests on Visual Studio (ARROW-7650).
 
 ### C++: Filesystem layer
@@ -115,6 +142,8 @@ Several issues have already been found and fixed.
 A performance regression when reading a file with a large number of columns
 has been fixed (ARROW-6876, ARROW-7059).
 
+A number of severe bugs were fixed (PARQUET-1766, ARROW-6895).
+
 ### C++: Tensors
 
 CSC sparse matrices are supported (ARROW-4225).