You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Haopu Wang <HW...@qilinsoft.com> on 2014/07/23 09:00:05 UTC
"spark.streaming.unpersist" and "spark.cleaner.ttl"
I have a DStream receiving data from a socket. I'm using local mode.
I set "spark.streaming.unpersist" to "false" and leave "
spark.cleaner.ttl" to be infinite.
I can see files for input and shuffle blocks under "spark.local.dir"
folder and the size of folder keeps increasing, although JVM's memory
usage seems to be stable.
[question] In this case, because input RDDs are persisted but they don't
fit into memory, so write to disk, right? And where can I see the
details about these RDDs? I don't see them in web UI.
Then I set "spark.streaming.unpersist" to "true", the size of
"spark.local.dir" folder and JVM's used heap size are reduced regularly.
[question] In this case, because I didn't change "spark.cleaner.ttl",
which component is doing the cleanup? And what's the difference if I set
"spark.cleaner.ttl" to some duration in this case?
Thank you!
Re: "spark.streaming.unpersist" and "spark.cleaner.ttl"
Posted by Tathagata Das <ta...@gmail.com>.
Yeah, I wrote those lines a while back, I wanted to contrast storage
levels with and without serialization. Should have realized that
StorageLevel.MEMORY_ONLY_SER can be confused to be the default level.
TD
On Wed, Jul 23, 2014 at 5:12 AM, Shao, Saisai <sa...@intel.com> wrote:
> Yeah, the document may not be precisely aligned with latest code, so the best way is to check the code.
>
> -----Original Message-----
> From: Haopu Wang [mailto:HWang@qilinsoft.com]
> Sent: Wednesday, July 23, 2014 5:56 PM
> To: user@spark.apache.org
> Subject: RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"
>
> Jerry, thanks for the response.
>
> For the default storage level of DStream, it looks like Spark's document is wrong. In this link: http://spark.apache.org/docs/latest/streaming-programming-guide.html#memory-tuning
> It mentions:
> "Default persistence level of DStreams: Unlike RDDs, the default persistence level of DStreams serializes the data in memory (that is, StorageLevel.MEMORY_ONLY_SER for DStream compared to StorageLevel.MEMORY_ONLY for RDDs). Even though keeping the data serialized incurs higher serialization/deserialization overheads, it significantly reduces GC pauses."
>
> I will take a look at DStream.scala although I have no Scala experience.
>
> -----Original Message-----
> From: Shao, Saisai [mailto:saisai.shao@intel.com]
> Sent: 2014年7月23日 15:13
> To: user@spark.apache.org
> Subject: RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"
>
> Hi Haopu,
>
> Please see the inline comments.
>
> Thanks
> Jerry
>
> -----Original Message-----
> From: Haopu Wang [mailto:HWang@qilinsoft.com]
> Sent: Wednesday, July 23, 2014 3:00 PM
> To: user@spark.apache.org
> Subject: "spark.streaming.unpersist" and "spark.cleaner.ttl"
>
> I have a DStream receiving data from a socket. I'm using local mode.
> I set "spark.streaming.unpersist" to "false" and leave "
> spark.cleaner.ttl" to be infinite.
> I can see files for input and shuffle blocks under "spark.local.dir"
> folder and the size of folder keeps increasing, although JVM's memory usage seems to be stable.
>
> [question] In this case, because input RDDs are persisted but they don't fit into memory, so write to disk, right? And where can I see the details about these RDDs? I don't see them in web UI.
>
> [answer] Yes, if memory is not enough to put input RDDs, this data will be flush to disk, because the default storage level is "MEMORY_AND_DISK_SER_2" as you can see in StreamingContext.scala. Actually you cannot not see the input RDD in web UI, you can only see the cached RDD in web UI.
>
> Then I set "spark.streaming.unpersist" to "true", the size of "spark.local.dir" folder and JVM's used heap size are reduced regularly.
>
> [question] In this case, because I didn't change "spark.cleaner.ttl", which component is doing the cleanup? And what's the difference if I set "spark.cleaner.ttl" to some duration in this case?
>
> [answer] If you set "spark.streaming.unpersist" to true, old unused rdd will be deleted, as you can see in DStream.scala. While "spark.cleaner.ttl" is timer-based spark cleaner, not only clean streaming data, but also broadcast, shuffle and other data.
>
> Thank you!
>
RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"
Posted by "Shao, Saisai" <sa...@intel.com>.
Yeah, the document may not be precisely aligned with latest code, so the best way is to check the code.
-----Original Message-----
From: Haopu Wang [mailto:HWang@qilinsoft.com]
Sent: Wednesday, July 23, 2014 5:56 PM
To: user@spark.apache.org
Subject: RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"
Jerry, thanks for the response.
For the default storage level of DStream, it looks like Spark's document is wrong. In this link: http://spark.apache.org/docs/latest/streaming-programming-guide.html#memory-tuning
It mentions:
"Default persistence level of DStreams: Unlike RDDs, the default persistence level of DStreams serializes the data in memory (that is, StorageLevel.MEMORY_ONLY_SER for DStream compared to StorageLevel.MEMORY_ONLY for RDDs). Even though keeping the data serialized incurs higher serialization/deserialization overheads, it significantly reduces GC pauses."
I will take a look at DStream.scala although I have no Scala experience.
-----Original Message-----
From: Shao, Saisai [mailto:saisai.shao@intel.com]
Sent: 2014年7月23日 15:13
To: user@spark.apache.org
Subject: RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"
Hi Haopu,
Please see the inline comments.
Thanks
Jerry
-----Original Message-----
From: Haopu Wang [mailto:HWang@qilinsoft.com]
Sent: Wednesday, July 23, 2014 3:00 PM
To: user@spark.apache.org
Subject: "spark.streaming.unpersist" and "spark.cleaner.ttl"
I have a DStream receiving data from a socket. I'm using local mode.
I set "spark.streaming.unpersist" to "false" and leave "
spark.cleaner.ttl" to be infinite.
I can see files for input and shuffle blocks under "spark.local.dir"
folder and the size of folder keeps increasing, although JVM's memory usage seems to be stable.
[question] In this case, because input RDDs are persisted but they don't fit into memory, so write to disk, right? And where can I see the details about these RDDs? I don't see them in web UI.
[answer] Yes, if memory is not enough to put input RDDs, this data will be flush to disk, because the default storage level is "MEMORY_AND_DISK_SER_2" as you can see in StreamingContext.scala. Actually you cannot not see the input RDD in web UI, you can only see the cached RDD in web UI.
Then I set "spark.streaming.unpersist" to "true", the size of "spark.local.dir" folder and JVM's used heap size are reduced regularly.
[question] In this case, because I didn't change "spark.cleaner.ttl", which component is doing the cleanup? And what's the difference if I set "spark.cleaner.ttl" to some duration in this case?
[answer] If you set "spark.streaming.unpersist" to true, old unused rdd will be deleted, as you can see in DStream.scala. While "spark.cleaner.ttl" is timer-based spark cleaner, not only clean streaming data, but also broadcast, shuffle and other data.
Thank you!
RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"
Posted by Haopu Wang <HW...@qilinsoft.com>.
Jerry, thanks for the response.
For the default storage level of DStream, it looks like Spark's document is wrong. In this link: http://spark.apache.org/docs/latest/streaming-programming-guide.html#memory-tuning
It mentions:
"Default persistence level of DStreams: Unlike RDDs, the default persistence level of DStreams serializes the data in memory (that is, StorageLevel.MEMORY_ONLY_SER for DStream compared to StorageLevel.MEMORY_ONLY for RDDs). Even though keeping the data serialized incurs higher serialization/deserialization overheads, it significantly reduces GC pauses."
I will take a look at DStream.scala although I have no Scala experience.
-----Original Message-----
From: Shao, Saisai [mailto:saisai.shao@intel.com]
Sent: 2014年7月23日 15:13
To: user@spark.apache.org
Subject: RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"
Hi Haopu,
Please see the inline comments.
Thanks
Jerry
-----Original Message-----
From: Haopu Wang [mailto:HWang@qilinsoft.com]
Sent: Wednesday, July 23, 2014 3:00 PM
To: user@spark.apache.org
Subject: "spark.streaming.unpersist" and "spark.cleaner.ttl"
I have a DStream receiving data from a socket. I'm using local mode.
I set "spark.streaming.unpersist" to "false" and leave "
spark.cleaner.ttl" to be infinite.
I can see files for input and shuffle blocks under "spark.local.dir"
folder and the size of folder keeps increasing, although JVM's memory usage seems to be stable.
[question] In this case, because input RDDs are persisted but they don't fit into memory, so write to disk, right? And where can I see the details about these RDDs? I don't see them in web UI.
[answer] Yes, if memory is not enough to put input RDDs, this data will be flush to disk, because the default storage level is "MEMORY_AND_DISK_SER_2" as you can see in StreamingContext.scala. Actually you cannot not see the input RDD in web UI, you can only see the cached RDD in web UI.
Then I set "spark.streaming.unpersist" to "true", the size of "spark.local.dir" folder and JVM's used heap size are reduced regularly.
[question] In this case, because I didn't change "spark.cleaner.ttl", which component is doing the cleanup? And what's the difference if I set "spark.cleaner.ttl" to some duration in this case?
[answer] If you set "spark.streaming.unpersist" to true, old unused rdd will be deleted, as you can see in DStream.scala. While "spark.cleaner.ttl" is timer-based spark cleaner, not only clean streaming data, but also broadcast, shuffle and other data.
Thank you!
RE: "spark.streaming.unpersist" and "spark.cleaner.ttl"
Posted by "Shao, Saisai" <sa...@intel.com>.
Hi Haopu,
Please see the inline comments.
Thanks
Jerry
-----Original Message-----
From: Haopu Wang [mailto:HWang@qilinsoft.com]
Sent: Wednesday, July 23, 2014 3:00 PM
To: user@spark.apache.org
Subject: "spark.streaming.unpersist" and "spark.cleaner.ttl"
I have a DStream receiving data from a socket. I'm using local mode.
I set "spark.streaming.unpersist" to "false" and leave "
spark.cleaner.ttl" to be infinite.
I can see files for input and shuffle blocks under "spark.local.dir"
folder and the size of folder keeps increasing, although JVM's memory usage seems to be stable.
[question] In this case, because input RDDs are persisted but they don't fit into memory, so write to disk, right? And where can I see the details about these RDDs? I don't see them in web UI.
[answer] Yes, if memory is not enough to put input RDDs, this data will be flush to disk, because the default storage level is "MEMORY_AND_DISK_SER_2" as you can see in StreamingContext.scala. Actually you cannot not see the input RDD in web UI, you can only see the cached RDD in web UI.
Then I set "spark.streaming.unpersist" to "true", the size of "spark.local.dir" folder and JVM's used heap size are reduced regularly.
[question] In this case, because I didn't change "spark.cleaner.ttl", which component is doing the cleanup? And what's the difference if I set "spark.cleaner.ttl" to some duration in this case?
[answer] If you set "spark.streaming.unpersist" to true, old unused rdd will be deleted, as you can see in DStream.scala. While "spark.cleaner.ttl" is timer-based spark cleaner, not only clean streaming data, but also broadcast, shuffle and other data.
Thank you!