You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Miki Tebeka (JIRA)" <ji...@apache.org> on 2017/03/14 19:14:41 UTC

[jira] [Comment Edited] (ARROW-539) [Python] Support reading Parquet datasets with standard partition directory schemes

    [ https://issues.apache.org/jira/browse/ARROW-539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924820#comment-15924820 ] 

Miki Tebeka edited comment on ARROW-539 at 3/14/17 7:14 PM:
------------------------------------------------------------

We can either do it in the arrow level and return a table with extra fields generated from the directory structure or we can do it in the Pandas level, read only the values from the parquet files and then generate columns for the DataFrame from the directory structure.

Both cases we'll need to guess the type of the fields from the directory structure. Unless there are metadata files.

Which is better?


was (Author: tebeka):
We can either do it in the arrow level and return a table with extra fields generated from the directory structure or we can do it in the Pandas level, read only the value from the parquet files and then generate columns for the DataFrame from the directory structure.

Which is better?

> [Python] Support reading Parquet datasets with standard partition directory schemes
> -----------------------------------------------------------------------------------
>
>                 Key: ARROW-539
>                 URL: https://issues.apache.org/jira/browse/ARROW-539
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: Python
>            Reporter: Wes McKinney
>         Attachments: partitioned_parquet.tar.gz
>
>
> Currently, we only support multi-file directories with a flat structure (non-partitioned). 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)