You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Young-Jun Ko (JIRA)" <ji...@apache.org> on 2017/11/21 16:11:01 UTC

[jira] [Created] (ARROW-1842) ParquetDataset.read(): selectively reading array column

Young-Jun Ko created ARROW-1842:
-----------------------------------

             Summary: ParquetDataset.read(): selectively reading array column
                 Key: ARROW-1842
                 URL: https://issues.apache.org/jira/browse/ARROW-1842
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.7.1
            Reporter: Young-Jun Ko


Scenario:
- created a dataframe in spark and saved it as parquet
- columns include simple types, e.g. String, but also an array of doubles

Issue:
I can read the whole data using ParquetDataset in pyarrow.
I tried reading selectively a simple type => works
I tried reading selectively the array column => key error in the following place:

KeyError: 'c'

/home/hadoop/Python/lib/python2.7/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.column_name_idx (/arrow/python/build/temp.linux-x86_64-2.7/_parquet.cxx:9777)()
    513                 self.column_idx_map[col_bytes] = i
    514 
--> 515         return self.column_idx_map[tobytes(column_name)]

When I just read the whole dataset, I get the correct metadata


pyarrow.Table
a: string
b: string
c: list<element: double not null>
  child 0, element: double
d: int64
metadata
--------
{'org.apache.spark.sql.parquet.row.metadata': '{"type":"struct","fields":[{"name":"a","type":"string","nullable":true,"metadata":{}},{"name":"b","type":"string","nullable":true,"metadata":{}},{"name":"c","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}},{"name":"d","type":"long","nullable":false,"metadata":{}}]}'}


I might just be missing the correct naming convention of the array column.
But then this name should be reflected in the metadata.

Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)