You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by swetha <sw...@gmail.com> on 2015/10/16 21:02:19 UTC

How to put an object in cache for ever in Streaming

Hi,

How to put a changing object in Cache for ever in Streaming. I know that we
can do rdd.cache but I think .cache would be cleaned up if we set ttl in
Streaming. Our requirement is to have an object in memory. The object would
be updated every minute  based on the records that we get in our Streaming
job.

 Currently I am keeping that in updateStateByKey. But, my updateStateByKey
is tracking the realtime Session information as well. So, my
updateStateByKey has 4 arguments that tracks session information and  this
object  that tracks the performance info separately. I was thinking it may
be too much to keep so much data in updateStateByKey.

Is it recommended to hold a lot of data using updateStateByKey?


Thanks,
Swetha



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-put-an-object-in-cache-for-ever-in-Streaming-tp25098.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to put an object in cache for ever in Streaming

Posted by Tathagata Das <td...@databricks.com>.

That should also get cleaned through the GC, though you may have to
explicitly run GC periodically for faster clean up.

RDDs are by definition distributed across executors in parts. When caches
the RDD partitions are cached in memory across the executors.

On Fri, Oct 16, 2015 at 6:15 PM, swetha kasireddy <swethakasireddy@gmail.com
> wrote:

> What about cleaning up the tempData that gets generated by shuffles. We
> have a lot of temp data that gets generated by shuffles in /tmp folder.
> That's why we are using ttl. Also if I keep an RDD in cache is it available
> across all the executors or just the same executor?
>
> On Fri, Oct 16, 2015 at 5:49 PM, Tathagata Das <td...@databricks.com>
> wrote:
>
>> Setting a ttl is not recommended any more as Spark works with Java GC to
>> clean up stuff (RDDs, shuffles, broadcasts,etc.) that are not in reference
>> any more.
>>
>> So you can keep an RDD cached in Spark, and every minute uncache the
>> previous one, and cache a new one.
>>
>> TD
>>
>> On Fri, Oct 16, 2015 at 12:02 PM, swetha <sw...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> How to put a changing object in Cache for ever in Streaming. I know that
>>> we
>>> can do rdd.cache but I think .cache would be cleaned up if we set ttl in
>>> Streaming. Our requirement is to have an object in memory. The object
>>> would
>>> be updated every minute  based on the records that we get in our
>>> Streaming
>>> job.
>>>
>>>  Currently I am keeping that in updateStateByKey. But, my
>>> updateStateByKey
>>> is tracking the realtime Session information as well. So, my
>>> updateStateByKey has 4 arguments that tracks session information and
>>> this
>>> object  that tracks the performance info separately. I was thinking it
>>> may
>>> be too much to keep so much data in updateStateByKey.
>>>
>>> Is it recommended to hold a lot of data using updateStateByKey?
>>>
>>>
>>> Thanks,
>>> Swetha
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-put-an-object-in-cache-for-ever-in-Streaming-tp25098.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>

Re: How to put an object in cache for ever in Streaming

Posted by swetha kasireddy <sw...@gmail.com>.

What about cleaning up the tempData that gets generated by shuffles. We
have a lot of temp data that gets generated by shuffles in /tmp folder.
That's why we are using ttl. Also if I keep an RDD in cache is it available
across all the executors or just the same executor?

On Fri, Oct 16, 2015 at 5:49 PM, Tathagata Das <td...@databricks.com> wrote:

> Setting a ttl is not recommended any more as Spark works with Java GC to
> clean up stuff (RDDs, shuffles, broadcasts,etc.) that are not in reference
> any more.
>
> So you can keep an RDD cached in Spark, and every minute uncache the
> previous one, and cache a new one.
>
> TD
>
> On Fri, Oct 16, 2015 at 12:02 PM, swetha <sw...@gmail.com>
> wrote:
>
>> Hi,
>>
>> How to put a changing object in Cache for ever in Streaming. I know that
>> we
>> can do rdd.cache but I think .cache would be cleaned up if we set ttl in
>> Streaming. Our requirement is to have an object in memory. The object
>> would
>> be updated every minute  based on the records that we get in our Streaming
>> job.
>>
>>  Currently I am keeping that in updateStateByKey. But, my updateStateByKey
>> is tracking the realtime Session information as well. So, my
>> updateStateByKey has 4 arguments that tracks session information and  this
>> object  that tracks the performance info separately. I was thinking it may
>> be too much to keep so much data in updateStateByKey.
>>
>> Is it recommended to hold a lot of data using updateStateByKey?
>>
>>
>> Thanks,
>> Swetha
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-put-an-object-in-cache-for-ever-in-Streaming-tp25098.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>

Re: How to put an object in cache for ever in Streaming

Posted by Tathagata Das <td...@databricks.com>.

Setting a ttl is not recommended any more as Spark works with Java GC to
clean up stuff (RDDs, shuffles, broadcasts,etc.) that are not in reference
any more.

So you can keep an RDD cached in Spark, and every minute uncache the
previous one, and cache a new one.

TD

On Fri, Oct 16, 2015 at 12:02 PM, swetha <sw...@gmail.com> wrote:

> Hi,
>
> How to put a changing object in Cache for ever in Streaming. I know that we
> can do rdd.cache but I think .cache would be cleaned up if we set ttl in
> Streaming. Our requirement is to have an object in memory. The object would
> be updated every minute  based on the records that we get in our Streaming
> job.
>
>  Currently I am keeping that in updateStateByKey. But, my updateStateByKey
> is tracking the realtime Session information as well. So, my
> updateStateByKey has 4 arguments that tracks session information and  this
> object  that tracks the performance info separately. I was thinking it may
> be too much to keep so much data in updateStateByKey.
>
> Is it recommended to hold a lot of data using updateStateByKey?
>
>
> Thanks,
> Swetha
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-put-an-object-in-cache-for-ever-in-Streaming-tp25098.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>