You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:04:19 UTC

[jira] [Updated] (SPARK-21706) Support Custom PartitionSpec Provider for Kinesis Firehose or similar

     [ https://issues.apache.org/jira/browse/SPARK-21706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-21706:
---------------------------------
    Labels: bulk-closed custom firehose kinesis partition partitioning spark sql  (was: custom firehose kinesis partition partitioning spark sql)

> Support Custom PartitionSpec Provider for Kinesis Firehose or similar
> ---------------------------------------------------------------------
>
>                 Key: SPARK-21706
>                 URL: https://issues.apache.org/jira/browse/SPARK-21706
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.6.3, 2.1.1, 2.2.0
>            Reporter: Sebastian Herold
>            Priority: Major
>              Labels: bulk-closed, custom, firehose, kinesis, partition, partitioning, spark, sql
>
> Many people are using Kinesis Firehose to ingest data into a S3-based data lake. Kinesis Firehose produces a directory layout like this:
> {code}
> s3://data-lake-bucket/my-prefix/2017/08/11/10/my-stream-2017-08-11-11-10-10
> s3://data-lake-bucket/my-prefix/2017/08/11/11/my-stream-2017-08-11-11-11-10
>   .
>   .
>   .
> s3://data-lake-bucket/my-prefix/2017/08/12/00/my-stream-2017-08-12-00-01-01
> {code}
> Spark is (like Hive) not supporting this kind of partitioning. Therefore it would be great, if you could configure a {{CustomPartitionDiscoverer}} or {{PartitionSpecProvider}} to provide a custom partition mapping and easily select a date range of files afterwards. Sadly, the partition discovery is deeply integrated into {{DataSource}}. 
> *Could this be encapsulated smarter to be able to intercept the default behaviour?*
> Another partition schema that I've seen a lot in this context is:
> {code}
> s3://data-lake-bucket/prefix/2017-08-11/file.1.json
> s3://data-lake-bucket/prefix/2017-08-11/file.2.json
>   .
>   .
>   .
> s3://data-lake-bucket/prefix/2017-08-12/file.1.json
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org