You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (JIRA)" <ji...@apache.org> on 2019/05/21 04:21:29 UTC

[jira] [Updated] (SPARK-16614) DirectJoin with DataSource for SparkSQL

     [ https://issues.apache.org/jira/browse/SPARK-16614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon updated SPARK-16614:
---------------------------------
    Labels: bulk-closed  (was: )

> DirectJoin with DataSource for SparkSQL
> ---------------------------------------
>
>                 Key: SPARK-16614
>                 URL: https://issues.apache.org/jira/browse/SPARK-16614
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Russell Spitzer
>            Priority: Major
>              Labels: bulk-closed
>
> Join behaviors against some datasources can be improved by skipping a full scan and instead performing a series of point lookups.
> An example
> {code}DataFrame A contains { key1, key5, key302, ... key 50923423} 
>     DataFrame B is a source reading from a C* database with keys {key1, key2, key3 ....}
>     a.join(b){code}
> Currently this will cause the entirety of the DataFrame B to be read into memory before performing a Join. Instead it would be useful if we could expose another api, {{DirectJoinSource}} which allowed connectors to provide a means of requesting a non-contiguous subset of keys from a DataSource.
> This kind of lookup would behave like the joinWithCasandraTable call in the Spark Cassandra Connector https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable. 
> We find that this is much more useful when the end user is requesting only a small portion of well defined records. I believe this could be applicable to a variety of datasources where reading the entire source is inefficient compared to point lookups.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org