You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@arrow.apache.org by fs...@apache.org on 2020/01/23 14:28:18 UTC

[arrow-site] 01/01: Add dataset notes

This is an automated email from the ASF dual-hosted git repository.

fsaintjacques pushed a commit to branch pr/41
in repository https://gitbox.apache.org/repos/asf/arrow-site.git

commit c40438b473b711903f51abd0d262fd5fd680b2f5
Author: François Saint-Jacques <fs...@gmail.com>
AuthorDate: Thu Jan 23 09:27:58 2020 -0500

    Add dataset notes
---
 _posts/2020-01-25-0.16.0-release.md | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/_posts/2020-01-25-0.16.0-release.md b/_posts/2020-01-25-0.16.0-release.md
index d9498be..5d45e71 100644
--- a/_posts/2020-01-25-0.16.0-release.md
+++ b/_posts/2020-01-25-0.16.0-release.md
@@ -92,6 +92,33 @@ and result value, rather than taking a pointer-out function parameter
 
 ### C++: Datasets
 
+
+Added support for Arrow IPC files via `IPCFileFormat` (ARROW-7415), and the
+`IN` and `IS_VALID` filter operators (ARROW-7185).
+
+Classes were renamed to avoid the repeated `Data` prefix. The core classes are
+now `Dataset`, `Source`, `Fragment`, and `Partitioning` (ARROW-7498). The dataset
+APIs now use `Result<T>` when returning both a Status and result value, rather
+than taking a pointer-out function parameter (ARROW-7148).
+
+A discovery facility was added to create a `Dataset` and/or `Source` via the
+`DatasetFactory` and `SourceFactory` interfaces. Notably `FileSystemSourceFactory`
+can crawl directories to find the candidates files (`Fragment`) supporting by a
+given `FileFormat`. The factories will also try to unify schemas, transparently
+supporting missing/added columns to the final unified schema. (ARROW-6614,
+ARROW-7061, ARROW-843, ARROW-7380)
+
+A partitioning facility was added to support partition pruning with predicate
+pushdown (ARROW-6494). The partitioning extracts partition from the `Fragment`'s
+path, e.g. `/data/year=2015/month=04/day=29` would extract the `year`, `month`
+and `day` partitions. Partitions can injected in the schema and the RecordBatch
+as materialized columns (ARROW-6965). Add a `PartitioningFactory` discovery
+facility such that types of the `Partitioning`'s schema are inferred if possible.
+
+The `ParquetFileFormat` transparently supports predicate pushdown by ignoring
+RowGroups based on their statistic (ARROW-6952). It also supports column
+projection, effectively only reading data from columns of interest (ARROW-6951).
+
 The dataset layer now compiles and passes tests on Visual Studio (ARROW-7650).
 
 ### C++: Filesystem layer
@@ -115,6 +142,8 @@ Several issues have already been found and fixed.
 A performance regression when reading a file with a large number of columns
 has been fixed (ARROW-6876, ARROW-7059).
 
+A number of severe bugs were fixed (PARQUET-1766, ARROW-6895).
+
 ### C++: Tensors
 
 CSC sparse matrices are supported (ARROW-4225).