You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Young-Jun Ko (JIRA)" <ji...@apache.org> on 2017/11/21 16:11:01 UTC
[jira] [Created] (ARROW-1842) ParquetDataset.read(): selectively
reading array column
Young-Jun Ko created ARROW-1842:
-----------------------------------
Summary: ParquetDataset.read(): selectively reading array column
Key: ARROW-1842
URL: https://issues.apache.org/jira/browse/ARROW-1842
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.7.1
Reporter: Young-Jun Ko
Scenario:
- created a dataframe in spark and saved it as parquet
- columns include simple types, e.g. String, but also an array of doubles
Issue:
I can read the whole data using ParquetDataset in pyarrow.
I tried reading selectively a simple type => works
I tried reading selectively the array column => key error in the following place:
KeyError: 'c'
/home/hadoop/Python/lib/python2.7/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.column_name_idx (/arrow/python/build/temp.linux-x86_64-2.7/_parquet.cxx:9777)()
513 self.column_idx_map[col_bytes] = i
514
--> 515 return self.column_idx_map[tobytes(column_name)]
When I just read the whole dataset, I get the correct metadata
pyarrow.Table
a: string
b: string
c: list<element: double not null>
child 0, element: double
d: int64
metadata
--------
{'org.apache.spark.sql.parquet.row.metadata': '{"type":"struct","fields":[{"name":"a","type":"string","nullable":true,"metadata":{}},{"name":"b","type":"string","nullable":true,"metadata":{}},{"name":"c","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}},{"name":"d","type":"long","nullable":false,"metadata":{}}]}'}
I might just be missing the correct naming convention of the array column.
But then this name should be reflected in the metadata.
Thanks!
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)