You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/04/26 13:25:15 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue #133: Add support for reading partitioned Parquet files

alamb opened a new issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133


   *Note*: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-11019
   
   Add support for reading Parquet files that are partitioned by key where the files are under a directory structure based on partition keys and values.
   
   /path/to/files/KEY1=value/KEY2=value/files


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rdettai edited a comment on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
rdettai edited a comment on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-945630854


   @houqp I opened #1139 for adding the feature in the listing provider, we can close this one!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rdettai commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
rdettai commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-945398603


   `ListingTable` does not implement it yet, but I will open a PR, probably this week, to get started on it 😉


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan closed issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
Dandandan closed issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] heymind removed a comment on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
heymind removed a comment on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-830729871


   I would like to implement it.
   
   For schema inference, maybe only sampling for the first N items is enough. Schemaless JSON repression is much more difficult to implement, but there are limited usage scenarios, maybe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] nugend commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
nugend commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-839088592


   Is there a name for this sort of thing? I've seen it called Hive partitioning somewhere, but I couldn't find any kind of standard, particularly regarding the way that values should be parsed into types.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-839653973


   I do not know of any standard -- the systems I have heard of basically "follow what hive did" -- though if someone else has a reference that would be great.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-840044928


   Hive partitioning is the most commonly used scheme, but there are other schemes as well, for example, the python arrow package supports both directory partitioning and hive partitioning: https://arrow.apache.org/docs/python/generated/pyarrow.dataset.partitioning.html?highlight=partition.
   
   I agree with @Dandandan that we should add the concept of partition column first, then tackle how we ser/de partition values from file paths. I can see us going the python arrow route as well, i.e. supporting multiple partitioning schemes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] heymind commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
heymind commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-830729871


   I would like to implement it.
   
   For schema inference, maybe only sampling for the first N items is enough. Schemaless JSON repression is much more difficult to implement, but there are limited usage scenarios, maybe.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Dandandan commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
Dandandan commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-839771888


   @jorgecarleitao yes
   
   I am also not aware of any standard - also implementations do differ in some subtle ways. I think we have to compare to hive / spark / etc.
   
   On the types - it depends if the type already is set in the schema or if some inference is used for the paths. I think we can first start with adding partition columns to the table schema so we can actually parse the locations based on the type - and add automatic detection of types (like CSV) later.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] dispanser edited a comment on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
dispanser edited a comment on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-827854576


   Is there any reason to limit this to parquet files? In spark, this functionality is shared between csv, json, orc and parquet.
   
   Maybe the implementation could target the shared file listing in `physical_plan::common::build_file_list()` which seems to be shared between parquet and csv.
   
   Considering #204 (adding partition pruning), it may be sensible to already implement the partition pruning logic early in the file listing procedure itself, as it could save on file listing operations, which tend to be expensive in particular on cloud storage (EBS).
   
   I'd love to work on this, but I'd need a bit of guidance on the preferred approach.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-872902468


   > The Presto/Athena syntax is nice for declaring a partitions without dynamic discovery on the filesystem.
   
   I agree


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] dispanser commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
dispanser commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-827854576


   Is there any reason to limit this to parquet files? In spark, this functionality is shared between csv, json, orc and parquet.
   
   Maybe the implementation could target the shared file listing in `physical_plan::common::build_file_list()` which seems to be shared between parquet and csv.
   
   Considering #204 (adding partition pruning), it may be sensibel to already implement the partition pruning logic early in the file listing procedure itself, as it could save on file listing operations, which tend to be expensive in particular on cloud storage (EBS).
   
   I'd love to work on this, but I'd need a bit of guidance on the preferred approach.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rdettai commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
rdettai commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-908395161


   I have tried to come up with a design document regarding table formats and partitioning:
   - https://docs.google.com/document/d/1Bd4-PLLH-pHj0BquMDsJ6cVr_awnxTuvwNJuWsTHxAQ/edit?usp=sharing
   
   Sorry its length. Inputs are very welcome!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-945411956


   oh right, but at least we now have a single implementation to cover all file formats :D


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] jorgecarleitao commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-839722834


   just to check, what hive did in this context is the `column=X/`, `column=Y/`, right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-912284478


   Thank you @rdettai for the detailed write up, I recommend you sending it to the arrow dev mailing list too since it's a pretty major design change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rdettai commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
rdettai commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-908395161


   I have tried to come up with a design document regarding table formats and partitioning:
   - https://docs.google.com/document/d/1Bd4-PLLH-pHj0BquMDsJ6cVr_awnxTuvwNJuWsTHxAQ/edit?usp=sharing
   
   Sorry its length. Inputs are very welcome!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-828446299


   > Is there any reason to limit this to parquet files? 
   
   I do not think there is any reason to limit to parquet files. Parquet files are probably the most important usecase initially but the functionality would be useful for everyone
   
   I think the first thing to do might be to write up a high level proposal (we have used google docs to good effect in the past). The first work needed (for this ticket) is probably to do a recursive directory traversal and find all parquet (or other) formats in subdirectories.
   
   Then there is probably work to interpret paths as their relevant partition keys, and then implement partition pruning (based on the existing row group pruning code, I would think) 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] snoe925 commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
snoe925 commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-872375783


   The Presto/Athena syntax is nice for declaring a partitions without dynamic discovery on the filesystem.
   I would like to have the dynamic discovery as the default.  But there is a means to do explicit mappings in Athena/Presto SQL.
   This is perhaps a companion to the feature requested in this issue.  The benefit is perhaps faster operation as you don't have to scan the filesystem to discover partitions.  A secondary benefit is using this scheme for version snapshot support.  This is how delta-io works with Athena/Presto/Trino.
   
   Here is an example of syntax.  Definitely needs a Google Doc treatment to outline the details.
   
   I just wanted to comment to show how one can split the filesystem / storage discovery from the idea of partitions.  This is certainly easy syntax for test cases as 100% SQL based interaction.
   
   CREATE EXTERNAL TABLE users (
   first string,
   last string,
   username string
   )
   PARTITIONED BY (id string, id2 string)  -- same as the create table column syntax
   STORED AS PARQUET
   -- omit LOCATION because we are going to explicitly partition with ALTER TABLE
   
   ALTER TABLE user 
       ADD PARTITION (id='a', id2='02') LOCATION '/id=a/id=02/data.parquet'
       ADD PARTITION (id='a', id2='03') LOCATION '/id=a/id=03/data.parquet'
   
   This is perhaps a UNION ALL of hidden tables for each partition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
houqp commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-945332283


   I think this can be closed now with @rdettai 's new awesome listing table provider.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] rdettai commented on issue #133: Add support for reading partitioned Parquet files

Posted by GitBox <gi...@apache.org>.
rdettai commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-945630854


   @houqp I opened an issue for adding the feature in the listing provider, we can close this one!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org