You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Russell Spitzer (JIRA)" <ji...@apache.org> on 2016/07/18 22:34:20 UTC

[jira] [Created] (SPARK-16616) Allow Catalyst to take Advantage of Hash Partitioned DataSources

Russell Spitzer created SPARK-16616:
---------------------------------------

             Summary: Allow Catalyst to take Advantage of Hash Partitioned DataSources
                 Key: SPARK-16616
                 URL: https://issues.apache.org/jira/browse/SPARK-16616
             Project: Spark
          Issue Type: New Feature
            Reporter: Russell Spitzer


Many Distributed Databases provide hash partitioned data (in contrast to data partitioned on a specific column) and this information can be used to greatly enhance Spark performance. 

For example: 

Data within Cassandra is distributed based on a Hash of the "Partition Key" which is a set of columns. This means all values read from the database which contain the same "partition key" will exist in the same Spark Partition. When these rows are joined with themselves or aggregated on these "Partition Key" columns there is no need to do a shuffle.

{code}
CREATE TABLE (UserID int, purchase int, amount int, PRIMARY KEY (customer, purchase))
{code}

Would internally (using the SparkCassandraConnector) make an RDD that looks like

{code}
Spark Partition 1 :  (1, 1, 5), (1, 2, 6), (432, 1, 10) .... 
Spark Partition 2 :  (2, 1, 4), (2, 2, 5), (700, 1, 1) ...
{code}

Where the all values for {{UserID}} 1 are in the First Partition but the values contained within Spark Partition 1 do not cover a contiguous range of values for {{UserID}}

Like with normal RDDs, it would be nice if we could expose a Partitioning function that (given the key value) we could indicate what partition the row would be in. This information could also be used in aggregates and joins. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org