You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Shi Yu <sh...@gmail.com> on 2016/10/06 15:50:58 UTC

Best practice of complicated SQL query in Spark/Hive

Hello,

I wonder what is the state-of-art best practice to achieve best performance
running complicated SQL query today in 2016?  I am new to this topic and
have read about

Hive on Tez
Spark on Hive
Spark SQL 2.0  (It seems Spark 2.0 supports complicated nest query)

The documentation I read suggests Hive on Tez is the winner.  But I wonder
whether the new Spark 2.0 improves the performance and becomes the new
champion?


And FYI, the query I am running is complicated nested query, for instance:


with CTEA as (
   select hsld_dmi as hsld_dmi_A,
   cycle_num as cycle_num_A
   from TableA where stat_fl=0.0
),
CTEB as (
  select hsld_dmi as hsld_dmi_B,
  cycle_num as cycle_num_B
  from TableB O
  where (
      O.hsld_dmi in (
          select distinct hsld_dmi from TableC T where T.cycle_num=282.0
and T.stat_fl=1.0
          )
      and O.stat_fl =0.0
      )
)
select cycle_num_B, count(distinct hsld_dmi_B) as attrition_count from CTEB

left join CTEA on
(
    CTEB.hsld_dmi_B = CTEA.hsld_dmi_A and CTEB.cycle_num_B =
CTEA.cycle_num_A+1
)
where CTEA.hsld_dmi_A is NULL
group by CTEB.cycle_num_B order by CTEB.cycle_num_B;

Does Spark 2.0 support this type query?

Best,

Shi