You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by NathanMarin <na...@teads.tv> on 2015/03/26 15:40:12 UTC
[Spark Streaming] Disk not being cleaned up during runtime after
RDD being processed
Hi,
I’ve been trying to use Spark Streaming for my real-time analysis
application using the Kafka Stream API on a cluster (using the yarn version)
of 6 executors with 4 dedicated cores and 8192mb of dedicated RAM.
The thing is, my application should run 24/7 but the disk usage is leaking.
This leads to some exceptions occurring when Spark tries to write on a file
system where no space is left.
Here are some graphs showing the disk space remaining on a node where my
application is deployed:
http://i.imgur.com/vdPXCP0.png
The « drops » occurred on a 3 minute interval.
The Disk Usage goes back to normal once I kill my application:
http://i.imgur.com/ERZs2Cj.png
The persistance level of my RDD is MEMORY_AND_DISK_SER_2, but even when I
tried MEMORY_ONLY_SER_2 the same thing happened.
My question is: How can I force Spark (Streaming?) to remove whatever he
stores immediately after he processed-it? Obviously it doesn’t look like the
disk is being cleaned up (even though the memory does) even with me calling
the rdd.unpersist() method foreach RDD processed.
Here’s a sample of my application code:
http://pastebin.com/K86LE1J6
Maybe something is wrong in my app too?
Thanks for your help,
NM
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Disk-not-being-cleaned-up-during-runtime-after-RDD-being-processed-tp22240.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org