You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Paul Tremblay <pa...@gmail.com> on 2017/04/07 14:57:14 UTC

small job runs out of memory using wholeTextFiles

As part of my processing, I have the following code:

rdd = sc.wholeTextFiles("s3://paulhtremblay/noaa_tmp/", 100000)
rdd.count()

The s3 directory has about 8GB of data and 61,878 files. I am using Spark
2.1, and running it with 15 modes of m3.xlarge nodes on EMR.

The job fails with this error:

: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 35532 in stage 0.0 failed 4 times, most recent failure: Lost task
35532.3 in stage 0.0 (TID 35543,
ip-172-31-36-192.us-west-2.compute.internal, executor 6):
ExecutorLostFailure (executor 6 exited caused by one of the running
tasks) Reason: Container killed by YARN for exceeding memory limits.
7.4 GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.


I have run it dozens of times, increasing partitions, reducing the size of
my data set (the original is 60GB), and increasing the number of
partitions, but get the same error each time.

In contrast, if I run a simple:

rdd = sc.textFile("s3://paulhtremblay/noaa_tmp/")
rdd.coutn()

The job finishes in 15 minutes, even with just 3 nodes.

Thanks

-- 
Paul Henry Tremblay
Robert Half Technology