You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gireesh Puthumana <gi...@augmentiq.in> on 2016/01/29 05:10:02 UTC

Persisting of DataFrames in transformation workflows

Hi All,

I am trying to run a series of transformation over 3 DataFrames. After each
transformation, I want to persist DF and save it to text file. The steps I
am doing is as follows.

*Step0:*
Create DF1
Create DF2
Create DF3
Create DF4
(no persist no save yet)

*Step1:*
Create RESULT-DF1 by joining DF1 and DF2
Persist it to disk and memory
Save it to text file

*Step2:*
Create RESULT-DF2 by joining RESULT-DF1 and DF3
Persist it to disk and memory
Save it to text file

*Step3:*
Create RESULT-DF3 by joining RESULT-DF2 and DF4
Persist it to disk and memory
Save it to text file

*Observation:*
Number of tasks created at Step1 is 601
Number of tasks created at Step2 is 1004 (Didn't skip anything)
Number of tasks created at Step3 is 1400 (Skipped 400 tasks)

As different approach, I broke above steps into three different runs. ie;

   - Start, Load DF1 and DF2, Do Step1, Save RESULT-DF1 & exit
   - Start, Load DF3, Load RESULT-DF1 from file, do Step2, save RESULT-DF2
   & exit
   - Start, Load DF4, Load RESULT-DF2 from file, do Step3, save RESULT-DF3
   & exit

Later approach runs faster.

*My question is:*

   1. Am missing something on the persisting side in first approach?
   2. Why Step2 run didn't just use result from Step1 without redoing all
   it's tasks even after persisting (with only 601 tasks instead of 1004)?
   3. What are some good reads about best practices, when implementing such
   series of transformation workflows?

Thanks in advance,
Gireesh