You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gourav Sengupta <go...@gmail.com> on 2017/07/27 19:04:07 UTC

SPARK Storagelevel issues

Hi,

I cached in a table in a large EMR cluster and it has a size of 62 MB.
Therefore I know the size of the table while cached.

But when I am trying to cache in the table in smaller cluster which still
has a total of 3 GB Driver memory and two executors with close to 2.5 GB
memory the job still keeps on failing giving JVM out of memory errors.

Is there something that I am missing?

CODE:
=================================================================
sparkSession =  spark.builder \
                .config("spark.rdd.compress", "true") \
                .config("spark.serializer",
"org.apache.spark.serializer.KryoSerializer") \

.config("spark.executor.extraJavaOptions","-XX:+UseCompressedOops
-XX:+PrintGCDetails -XX:+PrintGCTimeStamps") \
                .appName("test").enableHiveSupport().getOrCreate()

testdf = sparkSession.sql("select * from tablename")
testdf.persist(pyspark.storagelevel.StorageLevel.MEMORY_ONLY_SER)
=================================================================

This causes JVM out of memory error.


Regards,
Gourav Sengupta

Re: SPARK Storagelevel issues

Posted by 周康 <zh...@gmail.com>.

All right, i did not catch the point ,sorry for that.
But you can take a snapshot of the heap, and then analysis heap dump by mat
or other tools.
From the code i can not find any clue.

2017-07-28 17:09 GMT+08:00 Gourav Sengupta <go...@gmail.com>:

> Hi,
>
> I have done all of that, but my question is "why should a 62 MB data give
> memory error when we have over 2 GB of memory available".
>
> Therefore all that is mentioned by Zhoukang is not pertinent at all.
>
>
> Regards,
> Gourav Sengupta
>
> On Fri, Jul 28, 2017 at 4:43 AM, 周康 <zh...@gmail.com> wrote:
>
>> testdf.persist(pyspark.storagelevel.StorageLevel.MEMORY_ONLY_SER) maybe
>> StorageLevel should change.And check you config "
>> spark.memory.storageFraction" which default value is 0.5
>>
>> 2017-07-28 3:04 GMT+08:00 Gourav Sengupta <go...@gmail.com>:
>>
>>> Hi,
>>>
>>> I cached in a table in a large EMR cluster and it has a size of 62 MB.
>>> Therefore I know the size of the table while cached.
>>>
>>> But when I am trying to cache in the table in smaller cluster which
>>> still has a total of 3 GB Driver memory and two executors with close to 2.5
>>> GB memory the job still keeps on failing giving JVM out of memory errors.
>>>
>>> Is there something that I am missing?
>>>
>>> CODE:
>>> =================================================================
>>> sparkSession =  spark.builder \
>>>                 .config("spark.rdd.compress", "true") \
>>>                 .config("spark.serializer",
>>> "org.apache.spark.serializer.KryoSerializer") \
>>>                 .config("spark.executor.extraJ
>>> avaOptions","-XX:+UseCompressedOops -XX:+PrintGCDetails
>>> -XX:+PrintGCTimeStamps") \
>>>                 .appName("test").enableHiveSupport().getOrCreate()
>>>
>>> testdf = sparkSession.sql("select * from tablename")
>>> testdf.persist(pyspark.storagelevel.StorageLevel.MEMORY_ONLY_SER)
>>> =================================================================
>>>
>>> This causes JVM out of memory error.
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>
>>
>

Re: SPARK Storagelevel issues

Posted by Gourav Sengupta <go...@gmail.com>.

Hi,

I have done all of that, but my question is "why should a 62 MB data give
memory error when we have over 2 GB of memory available".

Therefore all that is mentioned by Zhoukang is not pertinent at all.


Regards,
Gourav Sengupta

On Fri, Jul 28, 2017 at 4:43 AM, 周康 <zh...@gmail.com> wrote:

> testdf.persist(pyspark.storagelevel.StorageLevel.MEMORY_ONLY_SER) maybe
> StorageLevel should change.And check you config "
> spark.memory.storageFraction" which default value is 0.5
>
> 2017-07-28 3:04 GMT+08:00 Gourav Sengupta <go...@gmail.com>:
>
>> Hi,
>>
>> I cached in a table in a large EMR cluster and it has a size of 62 MB.
>> Therefore I know the size of the table while cached.
>>
>> But when I am trying to cache in the table in smaller cluster which still
>> has a total of 3 GB Driver memory and two executors with close to 2.5 GB
>> memory the job still keeps on failing giving JVM out of memory errors.
>>
>> Is there something that I am missing?
>>
>> CODE:
>> =================================================================
>> sparkSession =  spark.builder \
>>                 .config("spark.rdd.compress", "true") \
>>                 .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>> \
>>                 .config("spark.executor.extraJ
>> avaOptions","-XX:+UseCompressedOops -XX:+PrintGCDetails
>> -XX:+PrintGCTimeStamps") \
>>                 .appName("test").enableHiveSupport().getOrCreate()
>>
>> testdf = sparkSession.sql("select * from tablename")
>> testdf.persist(pyspark.storagelevel.StorageLevel.MEMORY_ONLY_SER)
>> =================================================================
>>
>> This causes JVM out of memory error.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>
>

Re: SPARK Storagelevel issues

Posted by 周康 <zh...@gmail.com>.

testdf.persist(pyspark.storagelevel.StorageLevel.MEMORY_ONLY_SER) maybe
StorageLevel should change.And check you config "
spark.memory.storageFraction" which default value is 0.5

2017-07-28 3:04 GMT+08:00 Gourav Sengupta <go...@gmail.com>:

> Hi,
>
> I cached in a table in a large EMR cluster and it has a size of 62 MB.
> Therefore I know the size of the table while cached.
>
> But when I am trying to cache in the table in smaller cluster which still
> has a total of 3 GB Driver memory and two executors with close to 2.5 GB
> memory the job still keeps on failing giving JVM out of memory errors.
>
> Is there something that I am missing?
>
> CODE:
> =================================================================
> sparkSession =  spark.builder \
>                 .config("spark.rdd.compress", "true") \
>                 .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
> \
>                 .config("spark.executor.extraJavaOptions","-XX:+UseCompressedOops
> -XX:+PrintGCDetails -XX:+PrintGCTimeStamps") \
>                 .appName("test").enableHiveSupport().getOrCreate()
>
> testdf = sparkSession.sql("select * from tablename")
> testdf.persist(pyspark.storagelevel.StorageLevel.MEMORY_ONLY_SER)
> =================================================================
>
> This causes JVM out of memory error.
>
>
> Regards,
> Gourav Sengupta
>