You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Gavin Ray <ra...@gmail.com> on 2022/05/19 01:21:22 UTC

[SQL] Why does a small two-source JDBC query take ~150-200ms with all optimizations (AQE, CBO, pushdown, Kryo, unsafe) enabled? (v3.4.0-SNAPSHOT)

I did some basic testing of multi-source queries with the most recent Spark:
https://github.com/GavinRay97/spark-playground/blob/44a756acaee676a9b0c128466e4ab231a7df8d46/src/main/scala/Application.scala#L46-L115

The output of "spark.time()" surprised me:

SELECT p.id, p.name, t.id, t.title
FROM db1.public.person p
JOIN db2.public.todos t
ON p.id = t.person_id
WHERE p.id = 1

+---+----+---+------+
| id|name| id| title|
+---+----+---+------+
|  1| Bob|  1|Todo 1|
|  1| Bob|  2|Todo 2|
+---+----+---+------+
Time taken: 168 ms

SELECT p.id, p.name, t.id, t.title
FROM db1.public.person p
JOIN db2.public.todos t
ON p.id = t.person_id
WHERE p.id = 2
LIMIT 1

+---+-----+---+------+
| id| name| id| title|
+---+-----+---+------+
|  2|Alice|  3|Todo 3|
+---+-----+---+------+
Time taken: 228 ms


Calcite and Teiid manage to do this on the order of 5-50ms for basic
queries,
so I'm curious about the technical specifics on why Spark appears to be so
much slower here?