You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by drewrobb <dr...@gmail.com> on 2017/06/13 23:18:15 UTC

Join pushdown on two external tables from the same external source?

I'm trying to figure out how to multiple tables from a single external source
directly in spark sql. Say I do the following in spark SQL:

CREATE OR REPLACE TEMPORARY VIEW t1 USING jdbc OPTIONS ( dbtable 't1' ...)
CREATE OR REPLACE TEMPORARY VIEW t2 USING jdbc OPTIONS ( dbtable 't2' ...)

SELECT * from t1 join t2 on t1.id = t2.id limit 10;

This query will result in a full table select from t1 and t2 in my jdbc
source, which isn't great, but understandable given how I have defined the
tables. An optimized query would perhaps only need to select 10 rows from
the underlying database.

This would work using the scala API (not sure exactly what the spark SQL
equivalent is, or if there is any):

spark.read.jdbc("jdbc:...", "(SELECT * from t1 join t2 on t1.id = t2.id
limit 10) as t", new java.util.Properties)

However, this method seems cumbersome to use for every query I might want to
run on my remote jdbc DB (requires writing a query in a string, and doesn't
use spark sql). Ideally, I would want something like defining an entire
database using the JDBC source, so that queries using only tables from that
source could be entirely pushed down to the underlying database. Does anyone
know a better approach to this problem, or even more generally how to have a
nicer integration with spark sql and remote database using some other
approach or tool?




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Join-pushdown-on-two-external-tables-from-the-same-external-source-tp28759.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org