You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Sean Owen (JIRA)" <ji...@apache.org> on 2017/05/20 09:37:04 UTC

[jira] [Commented] (SPARK-20809) PySpark: Java heap space issue despite apparently being within memory limits

    [ https://issues.apache.org/jira/browse/SPARK-20809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16018384#comment-16018384 ] 

Sean Owen commented on SPARK-20809:
-----------------------------------

You're setting driver memory in your program -- but that happens after the driver has launched. You need to look at the actual driver memory you allocated, which is probably only 512m.
Also, it's not clear this is just 1.2g. How big are sentences?

> PySpark: Java heap space issue despite apparently being within memory limits
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-20809
>                 URL: https://issues.apache.org/jira/browse/SPARK-20809
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 2.1.1
>         Environment: Linux x86_64
>            Reporter: James Porritt
>
> I have the following script:
> {code}
> import itertools
> import loremipsum
> from pyspark import SparkContext, SparkConf
> from pyspark.sql import SparkSession
> conf = SparkConf().set("spark.cores.max", "16") \
>     .set("spark.driver.memory", "16g") \
>     .set("spark.executor.memory", "16g") \
>     .set("spark.executor.memory_overhead", "16g") \
>     .set("spark.driver.maxResultsSize", "0")
> sc = SparkContext(appName="testRDD", conf=conf)
> ss = SparkSession(sc)
> j = itertools.cycle(range(8))
> rows = [(i, j.next(), ' '.join(map(lambda x: x[2], loremipsum.generate_sentences(600)))) for i in range(500)] * 100
> rrd = sc.parallelize(rows, 128)
> {code}
> When I run it with:
> {noformat}
> <system path>/spark-2.1.1-bin-hadoop2.7/bin/spark-submit <home directory>/writeTest.py
> {noformat}
> it fails with a 'Java heap space' error:
> {noformat}
> py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.readRDDFromFile.
> : java.lang.OutOfMemoryError: Java heap space
>         at org.apache.spark.api.python.PythonRDD$.readRDDFromFile(PythonRDD.scala:468)
>         at org.apache.spark.api.python.PythonRDD.readRDDFromFile(PythonRDD.scala)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:497)
>         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>         at py4j.Gateway.invoke(Gateway.java:280)
>         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>         at py4j.commands.CallCommand.execute(CallCommand.java:79)
>         at py4j.GatewayConnection.run(GatewayConnection.java:214)
>         at java.lang.Thread.run(Thread.java:745)
> {noformat}
> The data I create here approximates my actual data. The third element of each tuple should be around 25k, and there are 50k tuples overall. I estimate that I should have around 1.2G of data. 
> Why then does it fail? All parts of the system should have enough memory?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org