You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Neal Richardson (Jira)" <ji...@apache.org> on 2021/04/30 23:37:00 UTC
[jira] [Created] (ARROW-12620) [C++] Dataset writing can only
include projected columns if input columns are also included
Neal Richardson created ARROW-12620:
---------------------------------------
Summary: [C++] Dataset writing can only include projected columns if input columns are also included
Key: ARROW-12620
URL: https://issues.apache.org/jira/browse/ARROW-12620
Project: Apache Arrow
Issue Type: Bug
Components: C++
Affects Versions: 4.0.0
Reporter: Neal Richardson
I discovered this while working on https://github.com/apache/arrow/pull/10191. You can project new columns when writing a dataset, but only if they are derived from columns that are included in the output. Here's an R-based example:
{code}
# Simple function to write and re-open the new dataset
write_then_open <- function(ds, path, ...) {
write_dataset(ds, path, ...)
open_dataset(path)
}
tab <- Table$create(a = 1:5)
tab %>%
write_then_open(ds_dir) %>%
collect()
# # A tibble: 5 x 1
# a
# <int>
# 1 1
# 2 2
# 3 3
# 4 4
# 5 5
# If you rename a column, it's all nulls
tab %>%
select(b = a) %>%
write_then_open(ds_dir) %>%
collect()
# # A tibble: 5 x 1
# b
# <int>
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# If you derive a new column and keep the original, it works
tab %>%
mutate(b = a) %>%
write_then_open(ds_dir) %>%
collect()
# # A tibble: 5 x 2
# a b
# <int> <int>
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
# transmute() only keeps the added columns, so it also illustrates the failure
tab %>%
transmute(b = a) %>%
write_then_open(ds_dir) %>%
collect()
# # A tibble: 5 x 1
# b
# <int>
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)