You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2017/11/21 21:20:00 UTC

[jira] [Commented] (ARROW-1842) ParquetDataset.read(): selectively reading array column

    [ https://issues.apache.org/jira/browse/ARROW-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261496#comment-16261496 ] 

Wes McKinney commented on ARROW-1842:
-------------------------------------

I think this is a duplicate of https://issues.apache.org/jira/browse/ARROW-1684. I think if you specify {{'c.element'}} it will read the column of interest, but please confirm

> ParquetDataset.read(): selectively reading array column
> -------------------------------------------------------
>
>                 Key: ARROW-1842
>                 URL: https://issues.apache.org/jira/browse/ARROW-1842
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.7.1
>            Reporter: Young-Jun Ko
>
> Scenario:
> - created a dataframe in spark and saved it as parquet
> - columns include simple types, e.g. String, but also an array of doubles
> Issue:
> I can read the whole data using ParquetDataset in pyarrow.
> I tried reading selectively a simple type => works
> I tried reading selectively the array column => key error in the following place:
> KeyError: 'c'
> /home/hadoop/Python/lib/python2.7/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.column_name_idx (/arrow/python/build/temp.linux-x86_64-2.7/_parquet.cxx:9777)()
>     513                 self.column_idx_map[col_bytes] = i
>     514 
> --> 515         return self.column_idx_map[tobytes(column_name)]
> When I just read the whole dataset, I get the correct metadata
> pyarrow.Table
> a: string
> b: string
> c: list<element: double not null>
>   child 0, element: double
> d: int64
> metadata
> --------
> {'org.apache.spark.sql.parquet.row.metadata': '{"type":"struct","fields":[{"name":"a","type":"string","nullable":true,"metadata":{}},{"name":"b","type":"string","nullable":true,"metadata":{}},{"name":"c","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}},{"name":"d","type":"long","nullable":false,"metadata":{}}]}'}
> I might just be missing the correct naming convention of the array column.
> But then this name should be reflected in the metadata.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)