You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by rindra <ri...@gmail.com> on 2014/07/19 22:01:13 UTC

Caching issue with msg: RDD block could not be dropped from memory as it does not exist

Hi,

I am working with a small dataset about 13Mbyte on the spark-shell. After
doing a
groupBy on the RDD, I wanted to cache RDD in memory but I keep getting
these warnings:

scala> rdd.cache()
res28: rdd.type = MappedRDD[63] at repartition at <console>:28


scala> rdd.count()
14/07/19 12:45:18 WARN BlockManager: Block rdd_63_82 could not be dropped
from memory as it does not exist
14/07/19 12:45:18 WARN BlockManager: Putting block rdd_63_82 failed
14/07/19 12:45:18 WARN BlockManager: Block rdd_63_40 could not be dropped
from memory as it does not exist
14/07/19 12:45:18 WARN BlockManager: Putting block rdd_63_40 failed
res29: Long = 5

It seems that I could not cache the data in memory even though my local
machine has
16Gb RAM and the data is only 13MB with 100 partitions size.

How to prevent this caching issue from happening? Thanks.

Rindra



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-issue-with-msg-RDD-block-could-not-be-dropped-from-memory-as-it-does-not-exist-tp10248.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Caching issue with msg: RDD block could not be dropped from memory as it does not exist

Posted by Rico <ri...@gmail.com>.
I could find out the issue. In fact, I did not realize before that when
loaded into memory, the data is deserialized. As a result, what seems to be
a 21Gb dataset occupies 77Gb in memory. 

Details about this is clearly explained in the guide on  serialization and
memory tuning
<http://spark.apache.org/docs/latest/tuning.html#determining-memory-consumption> 
.




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-issue-with-msg-RDD-block-could-not-be-dropped-from-memory-as-it-does-not-exist-tp10248p10677.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Caching issue with msg: RDD block could not be dropped from memory as it does not exist

Posted by rindra <ri...@gmail.com>.
Hello Andrew,

Thank you very much for your great tips. Your solution worked perfectly.

In fact, I was not aware that the right option for local mode is
--driver.memory 1g

Cheers,

Rindra


On Mon, Jul 21, 2014 at 11:23 AM, Andrew Or-2 [via Apache Spark User List] <
ml-node+s1001560n10336h42@n3.nabble.com> wrote:

> Hi Rindra,
>
> Depending on what you're doing with your groupBy, you may end up inflating
> your data quite a bit. Even if your machine has 16G, by default spark-shell
> only uses 512M, and the amount used for storing blocks is only 60% of that
> (spark.storage.memoryFraction), so this space becomes ~300M. This is still
> many multiples of the size of your dataset, but not by orders of magnitude.
> If you are running Spark 1.0+, you can increase the amount of memory used
> by spark-shell by adding "--driver-memory 1g" as a command line argument in
> local mode, or "--executor-memory 1g" in any other mode.
>
> (Also, it seems that you set your log level to WARN. The cause is most
> probably because the cache is not big enough, but setting the log level to
> INFO will provide you with more information on the exact sizes that are
> being used by the storage and the blocks).
>
> Andrew
>
>
> 2014-07-19 13:01 GMT-07:00 rindra <[hidden email]
> <http://user/SendEmail.jtp?type=node&node=10336&i=0>>:
>
>> Hi,
>>
>> I am working with a small dataset about 13Mbyte on the spark-shell. After
>> doing a
>> groupBy on the RDD, I wanted to cache RDD in memory but I keep getting
>> these warnings:
>>
>> scala> rdd.cache()
>> res28: rdd.type = MappedRDD[63] at repartition at <console>:28
>>
>>
>> scala> rdd.count()
>> 14/07/19 12:45:18 WARN BlockManager: Block rdd_63_82 could not be dropped
>> from memory as it does not exist
>> 14/07/19 12:45:18 WARN BlockManager: Putting block rdd_63_82 failed
>> 14/07/19 12:45:18 WARN BlockManager: Block rdd_63_40 could not be dropped
>> from memory as it does not exist
>> 14/07/19 12:45:18 WARN BlockManager: Putting block rdd_63_40 failed
>> res29: Long = 5
>>
>> It seems that I could not cache the data in memory even though my local
>> machine has
>> 16Gb RAM and the data is only 13MB with 100 partitions size.
>>
>> How to prevent this caching issue from happening? Thanks.
>>
>> Rindra
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Caching-issue-with-msg-RDD-block-could-not-be-dropped-from-memory-as-it-does-not-exist-tp10248.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
>
> ------------------------------
>  If you reply to this email, your message will be added to the discussion
> below:
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Caching-issue-with-msg-RDD-block-could-not-be-dropped-from-memory-as-it-does-not-exist-tp10248p10336.html
>  To unsubscribe from Caching issue with msg: RDD block could not be
> dropped from memory as it does not exist, click here
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=10248&code=cmluZHJhLnViY0BnbWFpbC5jb218MTAyNDh8MTYyNTM1MTg3OQ==>
> .
> NAML
> <http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Caching-issue-with-msg-RDD-block-could-not-be-dropped-from-memory-as-it-does-not-exist-tp10248p10463.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Caching issue with msg: RDD block could not be dropped from memory as it does not exist

Posted by Andrew Or <an...@databricks.com>.
Hi Rindra,

Depending on what you're doing with your groupBy, you may end up inflating
your data quite a bit. Even if your machine has 16G, by default spark-shell
only uses 512M, and the amount used for storing blocks is only 60% of that
(spark.storage.memoryFraction), so this space becomes ~300M. This is still
many multiples of the size of your dataset, but not by orders of magnitude.
If you are running Spark 1.0+, you can increase the amount of memory used
by spark-shell by adding "--driver-memory 1g" as a command line argument in
local mode, or "--executor-memory 1g" in any other mode.

(Also, it seems that you set your log level to WARN. The cause is most
probably because the cache is not big enough, but setting the log level to
INFO will provide you with more information on the exact sizes that are
being used by the storage and the blocks).

Andrew


2014-07-19 13:01 GMT-07:00 rindra <ri...@gmail.com>:

> Hi,
>
> I am working with a small dataset about 13Mbyte on the spark-shell. After
> doing a
> groupBy on the RDD, I wanted to cache RDD in memory but I keep getting
> these warnings:
>
> scala> rdd.cache()
> res28: rdd.type = MappedRDD[63] at repartition at <console>:28
>
>
> scala> rdd.count()
> 14/07/19 12:45:18 WARN BlockManager: Block rdd_63_82 could not be dropped
> from memory as it does not exist
> 14/07/19 12:45:18 WARN BlockManager: Putting block rdd_63_82 failed
> 14/07/19 12:45:18 WARN BlockManager: Block rdd_63_40 could not be dropped
> from memory as it does not exist
> 14/07/19 12:45:18 WARN BlockManager: Putting block rdd_63_40 failed
> res29: Long = 5
>
> It seems that I could not cache the data in memory even though my local
> machine has
> 16Gb RAM and the data is only 13MB with 100 partitions size.
>
> How to prevent this caching issue from happening? Thanks.
>
> Rindra
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Caching-issue-with-msg-RDD-block-could-not-be-dropped-from-memory-as-it-does-not-exist-tp10248.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>