You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Vadim Semenov <va...@datadoghq.com> on 2017/12/19 14:45:26 UTC

Re: /tmp fills up to 100GB when using a window function

Spark doesn't remove intermediate shuffle files if they're part of the same
job.

On Mon, Dec 18, 2017 at 3:10 PM, Mihai Iacob <mi...@ca.ibm.com> wrote:

> This code generates files under /tmp...blockmgr... which do not get
> cleaned up after the job finishes.
>
> Anything wrong with the code below? or are there any known issues with
> spark not cleaning up /tmp files?
>
>
> window = Window.\
>               partitionBy('***', 'date_str').\
>               orderBy(sqlDf['***'])
>
> sqlDf = sqlDf.withColumn("***",rank().over(window))
> df_w_least = sqlDf.filter("***=1")
>
>
>
>
>
> Regards,
>
> *Mihai Iacob*
> DSX Local <https://datascience.ibm.com/local> - Security, IBM Analytics
>
> --------------------------------------------------------------------- To
> unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: /tmp fills up to 100GB when using a window function

Posted by Vadim Semenov <va...@datadoghq.com>.

Until after an action is done (i.e. save/count/reduce) or if you explicitly
truncate the DAG by checkpointing.

Spark needs to keep all shuffle files because if some task/stage/node fails
it'll only need to recompute missing partitions by using already computed
parts.

On Tue, Dec 19, 2017 at 10:08 AM, Mihai Iacob <mi...@ca.ibm.com> wrote:

> When does spark remove them?
>
>
> Regards,
>
> *Mihai Iacob*
> DSX Local <https://datascience.ibm.com/local> - Security, IBM Analytics
>
>
>
> ----- Original message -----
> From: Vadim Semenov <va...@datadoghq.com>
> To: Mihai Iacob <mi...@ca.ibm.com>
> Cc: user <us...@spark.apache.org>
> Subject: Re: /tmp fills up to 100GB when using a window function
> Date: Tue, Dec 19, 2017 9:46 AM
>
> Spark doesn't remove intermediate shuffle files if they're part of the
> same job.
>
> On Mon, Dec 18, 2017 at 3:10 PM, Mihai Iacob <mi...@ca.ibm.com> wrote:
>
> This code generates files under /tmp...blockmgr... which do not get
> cleaned up after the job finishes.
>
> Anything wrong with the code below? or are there any known issues with
> spark not cleaning up /tmp files?
>
> window = Window.\
>               partitionBy('***', 'date_str').\
>               orderBy(sqlDf['***'])
>
> sqlDf = sqlDf.withColumn("***",rank().over(window))
> df_w_least = sqlDf.filter("***=1")
>
>
>
>
> Regards,
>
> *Mihai Iacob*
> DSX Local <https://datascience.ibm.com/local> - Security, IBM Analytics
>
> --------------------------------------------------------------------- To
> unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
>
>