You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Wang, Gang (JIRA)" <ji...@apache.org> on 2018/09/12 02:51:00 UTC

[jira] [Created] (SPARK-25411) Implement range partition in Spark

Wang, Gang created SPARK-25411:
----------------------------------

             Summary: Implement range partition in Spark
                 Key: SPARK-25411
                 URL: https://issues.apache.org/jira/browse/SPARK-25411
             Project: Spark
          Issue Type: New Feature
          Components: SQL
    Affects Versions: 2.3.0
            Reporter: Wang, Gang


In our PROD environment, there are some partitioned fact tables, which are all quite huge. To accelerate join execution, we need make them also bucketed. Than comes the problem, if the bucket number is large enough, there may be two many files(files count = bucket number * partition count), which may bring pressure to the HDFS. And if the bucket number is small, Spark will launch equal number of tasks to read/write it.

 

So, can we implement a new partition support range values, just like range partition in Oracle/MySQL ([https://docs.oracle.com/cd/E17952_01/mysql-5.7-en/partitioning-range.html]). Say, we can partition by a date column, and make every two months as a partition, or partitioned by a integer column, make interval of 10000 as a partition.

 

Ideally, feature like range partition should be implemented in Hive. While, it's been always hard to update Hive version in a prod environment, and much lightweight and flexible if we implement it in Spark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org