You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Gavin Ray <ra...@gmail.com> on 2022/05/19 01:21:22 UTC
[SQL] Why does a small two-source JDBC query take ~150-200ms with all optimizations (AQE, CBO, pushdown, Kryo, unsafe) enabled? (v3.4.0-SNAPSHOT)
I did some basic testing of multi-source queries with the most recent Spark:
https://github.com/GavinRay97/spark-playground/blob/44a756acaee676a9b0c128466e4ab231a7df8d46/src/main/scala/Application.scala#L46-L115
The output of "spark.time()" surprised me:
SELECT p.id, p.name, t.id, t.title
FROM db1.public.person p
JOIN db2.public.todos t
ON p.id = t.person_id
WHERE p.id = 1
+---+----+---+------+
| id|name| id| title|
+---+----+---+------+
| 1| Bob| 1|Todo 1|
| 1| Bob| 2|Todo 2|
+---+----+---+------+
Time taken: 168 ms
SELECT p.id, p.name, t.id, t.title
FROM db1.public.person p
JOIN db2.public.todos t
ON p.id = t.person_id
WHERE p.id = 2
LIMIT 1
+---+-----+---+------+
| id| name| id| title|
+---+-----+---+------+
| 2|Alice| 3|Todo 3|
+---+-----+---+------+
Time taken: 228 ms
Calcite and Teiid manage to do this on the order of 5-50ms for basic
queries,
so I'm curious about the technical specifics on why Spark appears to be so
much slower here?