You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/01 16:13:56 UTC

[GitHub] [arrow-datafusion] snoe925 commented on issue #133: Add support for reading partitioned Parquet files

snoe925 commented on issue #133:
URL: https://github.com/apache/arrow-datafusion/issues/133#issuecomment-872375783

The Presto/Athena syntax is nice for declaring a partitions without dynamic discovery on the filesystem.
I would like to have the dynamic discovery as the default. But there is a means to do explicit mappings in Athena/Presto SQL.
This is perhaps a companion to the feature requested in this issue. The benefit is perhaps faster operation as you don't have to scan the filesystem to discover partitions. A secondary benefit is using this scheme for version snapshot support. This is how delta-io works with Athena/Presto/Trino.

Here is an example of syntax. Definitely needs a Google Doc treatment to outline the details.

I just wanted to comment to show how one can split the filesystem / storage discovery from the idea of partitions. This is certainly easy syntax for test cases as 100% SQL based interaction.

CREATE EXTERNAL TABLE users (
first string,
last string,
username string
)
PARTITIONED BY (id string, id2 string) -- same as the create table column syntax
STORED AS PARQUET
-- omit LOCATION because we are going to explicitly partition with ALTER TABLE

ALTER TABLE user
ADD PARTITION (id='a', id2='02') LOCATION '/id=a/id=02/data.parquet'
ADD PARTITION (id='a', id2='03') LOCATION '/id=a/id=03/data.parquet'

This is perhaps a UNION ALL of hidden tables for each partition.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org