You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/10/11 15:51:00 UTC

[jira] [Created] (ARROW-14287) [R] Selecting colums while reading Parquet file with nested types can give wrong column

Joris Van den Bossche created ARROW-14287:
---------------------------------------------

             Summary: [R] Selecting colums while reading Parquet file with nested types can give wrong column
                 Key: ARROW-14287
                 URL: https://issues.apache.org/jira/browse/ARROW-14287
             Project: Apache Arrow
          Issue Type: Bug
          Components: R
            Reporter: Joris Van den Bossche


I created two small files (using Python for my convenience):

{code:python}
import pyarrow as pa
import pyarrow.parquet as pq

table = pa.table({"a": [1, 2], "b": [3, 4]})
pq.write_table(table, "test1.parquet")

table = pa.table({"a": [1, 2], "nested": [[{'f1': 1, 'f2': 3}, {'f1': 3, 'f2': 4}], None], "b": [3, 4]})
pq.write_table(table, "test2.parquet")
{code}

where the first is a simple file, and the second contains a column with a nested list of struct type.

Reading that in R with a column selection works in the first case, but actually reads the second column instead of third in the second case:

{code:r}
> arrow::read_parquet("test1.parquet", col_select=c("b"))
  b
1 3
2 4
> arrow::read_parquet("test2.parquet", col_select=c("b"))
  nested
1   3, 4
2   NULL
{code}

This is due to the simple conversion of column names to integer indices in the R code, while Parquet counts the individual fields of nested columns separately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)