You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Prashant Sharma <pr...@plume.com> on 2018/09/06 04:13:57 UTC

Spark Streaming RDD Cleanup too slow

I have a Spark Streaming job which takes too long to delete temp RDD's. I
collect about 4MM telemetry metrics per minute and do minor aggregations in
the Streaming Job.

I am using Amazon R4 instances.  The Driver RPC call although Async,i
believe, is slow getting the handle for future object  at "askAsync call.
Here  is the Spark code which does the cleanup -
https://github.com/apache/spark/blob/5264164a67df498b73facae207eda12ee133be7d/core/src/main/scala/org/apache/spark/storage/BlockManagerMaster.scala#L125

Any chance anyone else encountered similar issue with their Streaming jobs?
About 20% of our time (~60 secs) is spent in cleaning the temp RDDs.
best,
Prashant