You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yin Huai (JIRA)" <ji...@apache.org> on 2015/01/09 23:44:34 UTC

[jira] [Commented] (SPARK-5182) Partitioning support for tables created by the data source API

    [ https://issues.apache.org/jira/browse/SPARK-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14272002#comment-14272002 ] 

Yin Huai commented on SPARK-5182:
---------------------------------

Here is the doc from [~marmbrus].

Partitioning data by one or more columns is a very important optimization for many analytic workloads.  Right now, the implementation of partitioning in the Data Sources API suffers from several shortcomings.
First, each data source must implement the support on its own leading to code duplication.  This duplication applies both to the code of discovering / cataloging partitions, but also to the code required to evaluate predicates against a given partitions. 
Second, only a limited set of predicates are pushed down and so partitioning misses opportunities to prune.  While we can continue to expand the set of filters, however, this does not solve the problem that each data source would still need to implement its own version of expression evaluation for each new (Filter x DataType).

Requirements for the new API:
* Built in support for telling a data source which partitions it should read based on arbitrary predicates (including things like UDFS).
* Support for multiple levels of nested directories that store data based on partitioning attributes (e.g, /table/col1=a/col2=b).
* Rapid auto-discovery of large numbers of partitions.
* Discovery of partition column types using schema inference similar to JSON.
* Support for user defined partitioning schemes? (i.e. /table/2001/02/03)

Proposed interface:
{code}
case class Partition(values: Row, path: String)
case class PartitionSpec(
    partitionColumns: StructType, 
    partitions: Array[Partition])

class PartitionedRelation {
  // Has default implementation
  def parsePartitions(paths: Array[String]): PartitionSpec 

  def basePath: String

  def buildScan(
      partitions: Array[Partition], 
      requiredColumns: Array[String], 
      filters: Array[Filter]): RDD[Row]
}
{code}
Open Questions:
* Is it okay to store all of the partition metadata in-memory initially? Or should we consider storing this data locally to something like BDB?
* Should we be using metastore partitioning instead?


> Partitioning support for tables created by the data source API
> --------------------------------------------------------------
>
>                 Key: SPARK-5182
>                 URL: https://issues.apache.org/jira/browse/SPARK-5182
>             Project: Spark
>          Issue Type: Sub-task
>          Components: SQL
>            Reporter: Yin Huai
>            Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org