You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Shi Yu <sh...@gmail.com> on 2016/10/06 15:50:58 UTC
Best practice of complicated SQL query in Spark/Hive
Hello,
I wonder what is the state-of-art best practice to achieve best performance
running complicated SQL query today in 2016? I am new to this topic and
have read about
Hive on Tez
Spark on Hive
Spark SQL 2.0 (It seems Spark 2.0 supports complicated nest query)
The documentation I read suggests Hive on Tez is the winner. But I wonder
whether the new Spark 2.0 improves the performance and becomes the new
champion?
And FYI, the query I am running is complicated nested query, for instance:
with CTEA as (
select hsld_dmi as hsld_dmi_A,
cycle_num as cycle_num_A
from TableA where stat_fl=0.0
),
CTEB as (
select hsld_dmi as hsld_dmi_B,
cycle_num as cycle_num_B
from TableB O
where (
O.hsld_dmi in (
select distinct hsld_dmi from TableC T where T.cycle_num=282.0
and T.stat_fl=1.0
)
and O.stat_fl =0.0
)
)
select cycle_num_B, count(distinct hsld_dmi_B) as attrition_count from CTEB
left join CTEA on
(
CTEB.hsld_dmi_B = CTEA.hsld_dmi_A and CTEB.cycle_num_B =
CTEA.cycle_num_A+1
)
where CTEA.hsld_dmi_A is NULL
group by CTEB.cycle_num_B order by CTEB.cycle_num_B;
Does Spark 2.0 support this type query?
Best,
Shi