You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/10/11 15:51:00 UTC
[jira] [Created] (ARROW-14287) [R] Selecting colums while reading
Parquet file with nested types can give wrong column
Joris Van den Bossche created ARROW-14287:
---------------------------------------------
Summary: [R] Selecting colums while reading Parquet file with nested types can give wrong column
Key: ARROW-14287
URL: https://issues.apache.org/jira/browse/ARROW-14287
Project: Apache Arrow
Issue Type: Bug
Components: R
Reporter: Joris Van den Bossche
I created two small files (using Python for my convenience):
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({"a": [1, 2], "b": [3, 4]})
pq.write_table(table, "test1.parquet")
table = pa.table({"a": [1, 2], "nested": [[{'f1': 1, 'f2': 3}, {'f1': 3, 'f2': 4}], None], "b": [3, 4]})
pq.write_table(table, "test2.parquet")
{code}
where the first is a simple file, and the second contains a column with a nested list of struct type.
Reading that in R with a column selection works in the first case, but actually reads the second column instead of third in the second case:
{code:r}
> arrow::read_parquet("test1.parquet", col_select=c("b"))
b
1 3
2 4
> arrow::read_parquet("test2.parquet", col_select=c("b"))
nested
1 3, 4
2 NULL
{code}
This is due to the simple conversion of column names to integer indices in the R code, while Parquet counts the individual fields of nested columns separately.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)