You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Zeming Yu <ze...@gmail.com> on 2017/05/07 04:10:16 UTC

how to check whether spill over to hard drive happened or not

hi,

I'm running pyspark on my local PC using the stand alone mode.

After a pyspark window function on a dataframe, I did a groupby query on
the dataframe.
The groupby query turns out to be very slow (10+ minutes on a small data
set).
I then cached the dataframe and re-ran the same query. The query remained
very slow.

I could also hear noises from the hard drive - I assume the PC is busy
reading and writing from the hard drive. Is this an indication of the data
frame has spilled over to hard drive?

What's the best method for monitoring what's happening? How can I avoid
this from happening?

Thanks!