You are viewing a plain text version of this content. The canonical link for it is here.
Posted to by zjzzjz <> on 2019/03/10 01:10:57 UTC

Optimize tables used more than once: make dataframe persistent or save as parquet

I heard Spark SQL is lazy: whenver a result table is referred, Spark
recalculates the table :(

For example,

WITH tab0 AS (
   -- some complicated SQL that generates a table 
   -- with size of Giga bytes or Tera bytes

tab1 AS (
   -- use tab0

tab2 AS (
   -- use tab0


tabn AS (
   -- use tab0

select * from tab1 
join tab2 on ...
join tabn on ...
Spark could recalculate tab0 N times.

To avoid this, it is possible to save tab0 as a temp table. I found two

1) save tab0 into parquet, then load it into a temp view
How does createOrReplaceTempView work in Spark?

2) make tab0 persistent

Which one is better in terms of query speed?

Sent from:

To unsubscribe e-mail: