You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Henry Tremblay <pa...@gmail.com> on 2017/02/14 08:36:55 UTC
wholeTextfiles not parallel, runs out of memory
When I use wholeTextFiles, spark does not run in parallel, and yarn runs
out of memory.
I have documented the steps below. First I copy 6 s3 files to hdfs. Then
I create an rdd by:
sc.wholeTextFiles("/mnt/temp")
Then I process the files line by line using a simple function. When I
look at my nodes, I see only one executor is running. (I assume the
other is the name node?) I then get an error message that yarn has run
out of memory.
Steps below:
========================
[hadoop@ip-172-31-40-213 mnt]$ hadoop fs -ls /mnt/temp
Found 6 items
-rw-r--r-- 3 hadoop hadoop 3684566 2017-02-14 07:58
/mnt/temp/CC-MAIN-20170116095122-00570-ip-10-171-10-70.ec2.internal.warc.gz
-rw-r--r-- 3 hadoop hadoop 3486510 2017-02-14 08:01
/mnt/temp/CC-MAIN-20170116095122-00571-ip-10-171-10-70.ec2.internal.warc.gz
-rw-r--r-- 3 hadoop hadoop 3498649 2017-02-14 08:05
/mnt/temp/CC-MAIN-20170116095122-00572-ip-10-171-10-70.ec2.internal.warc.gz
-rw-r--r-- 3 hadoop hadoop 4007644 2017-02-14 08:06
/mnt/temp/CC-MAIN-20170116095122-00573-ip-10-171-10-70.ec2.internal.warc.gz
-rw-r--r-- 3 hadoop hadoop 3990553 2017-02-14 08:07
/mnt/temp/CC-MAIN-20170116095122-00574-ip-10-171-10-70.ec2.internal.warc.gz
-rw-r--r-- 3 hadoop hadoop 3689213 2017-02-14 07:54
/mnt/temp/CC-MAIN-20170116095122-00575-ip-10-171-10-70.ec2.internal.warc.gz
In [6]: rdd1 = sc.wholeTextFiles("mnt/temp"
In [7]: rdd1.count()
Out[7]: 6
def process_file(s):
text = s[1]
d = {}
l = text.split("\n")
final = []
the_id = "init"
for line in l:
if line[0:15] == 'WARC-Record-ID:':
the_id = line[15:]
d[the_id] = line
final.append(Row(**d))
return final
In [8]: rdd2 = rdd1.map(process_file)
In [9]: rdd2.take(1)
17/02/14 08:25:25 ERROR YarnScheduler: Lost executor 2 on
ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for
exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
boosting spark.yarn.executor.memoryOverhead.
17/02/14 08:25:25 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
17/02/14 08:25:25 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID
3, ip-172-31-35-32.us-west-2.compute.internal, executor 2):
ExecutorLostFailure (executor 2 exited caused by one of the running
tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5
GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
17/02/14 08:29:34 ERROR YarnScheduler: Lost executor 3 on
ip-172-31-45-106.us-west-2.compute.internal: Container killed by YARN
for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.
17/02/14 08:29:34 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID
4, ip-172-31-45-106.us-west-2.compute.internal, executor 3):
ExecutorLostFailure (executor 3 exited caused by one of the running
tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5
GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
17/02/14 08:29:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
17/02/14 08:33:44 ERROR YarnScheduler: Lost executor 4 on
ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for
exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
boosting spark.yarn.executor.memoryOverhead.
17/02/14 08:33:44 WARN TaskSetManager: Lost task 0.2 in stage 2.0 (TID
5, ip-172-31-35-32.us-west-2.compute.internal, executor 4):
ExecutorLostFailure (executor 4 exited caused by one of the running
tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5
GB of 5.5 GB physical memory used. Consider boosting
spark.yarn.executor.memoryOverhead.
17/02/14 08:33:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
--
Henry Tremblay
Robert Half Technology
Re: wholeTextfiles not parallel, runs out of memory
Posted by Jörn Franke <jo...@gmail.com>.
Well 1) the goal of wholetextfiles is to have only one executor 2) you use .gz i.e. you will have only one executor per file maximum
> On 14 Feb 2017, at 09:36, Henry Tremblay <pa...@gmail.com> wrote:
>
> When I use wholeTextFiles, spark does not run in parallel, and yarn runs out of memory.
> I have documented the steps below. First I copy 6 s3 files to hdfs. Then I create an rdd by:
>
>
> sc.wholeTextFiles("/mnt/temp")
>
>
> Then I process the files line by line using a simple function. When I look at my nodes, I see only one executor is running. (I assume the other is the name node?) I then get an error message that yarn has run out of memory.
>
>
> Steps below:
>
> ========================
>
> [hadoop@ip-172-31-40-213 mnt]$ hadoop fs -ls /mnt/temp
> Found 6 items
> -rw-r--r-- 3 hadoop hadoop 3684566 2017-02-14 07:58 /mnt/temp/CC-MAIN-20170116095122-00570-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3486510 2017-02-14 08:01 /mnt/temp/CC-MAIN-20170116095122-00571-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3498649 2017-02-14 08:05 /mnt/temp/CC-MAIN-20170116095122-00572-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 4007644 2017-02-14 08:06 /mnt/temp/CC-MAIN-20170116095122-00573-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3990553 2017-02-14 08:07 /mnt/temp/CC-MAIN-20170116095122-00574-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3689213 2017-02-14 07:54 /mnt/temp/CC-MAIN-20170116095122-00575-ip-10-171-10-70.ec2.internal.warc.gz
>
>
> In [6]: rdd1 = sc.wholeTextFiles("mnt/temp"
> In [7]: rdd1.count()
> Out[7]: 6
>
> def process_file(s):
> text = s[1]
> d = {}
> l = text.split("\n")
> final = []
> the_id = "init"
> for line in l:
> if line[0:15] == 'WARC-Record-ID:':
> the_id = line[15:]
> d[the_id] = line
> final.append(Row(**d))
> return final
>
>
> In [8]: rdd2 = rdd1.map(process_file)
> In [9]: rdd2.take(1)
>
>
> <lhkgadbhdpeiihec.png>
>
>
> 17/02/14 08:25:25 ERROR YarnScheduler: Lost executor 2 on ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:25:25 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:25:25 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3, ip-172-31-35-32.us-west-2.compute.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 ERROR YarnScheduler: Lost executor 3 on ip-172-31-45-106.us-west-2.compute.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 4, ip-172-31-45-106.us-west-2.compute.internal, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 ERROR YarnScheduler: Lost executor 4 on ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 WARN TaskSetManager: Lost task 0.2 in stage 2.0 (TID 5, ip-172-31-35-32.us-west-2.compute.internal, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
>
> --
> Henry Tremblay
> Robert Half Technology
Re: wholeTextfiles not parallel, runs out of memory
Posted by Gourav Sengupta <go...@gmail.com>.
Hi Henry,
Any reason why you are not using dataframes?
Regards,
Gourav Sengupta
On Tue, Feb 14, 2017 at 8:36 AM, Henry Tremblay <pa...@gmail.com>
wrote:
> When I use wholeTextFiles, spark does not run in parallel, and yarn runs
> out of memory.
>
> I have documented the steps below. First I copy 6 s3 files to hdfs. Then I
> create an rdd by:
>
>
> sc.wholeTextFiles("/mnt/temp")
>
>
> Then I process the files line by line using a simple function. When I look
> at my nodes, I see only one executor is running. (I assume the other is the
> name node?) I then get an error message that yarn has run out of memory.
>
>
> Steps below:
>
> ========================
>
>
> [hadoop@ip-172-31-40-213 mnt]$ hadoop fs -ls /mnt/temp
> Found 6 items
> -rw-r--r-- 3 hadoop hadoop 3684566 2017-02-14 07:58 /mnt/temp/CC-MAIN-
> 20170116095122-00570-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3486510 2017-02-14 08:01 /mnt/temp/CC-MAIN-
> 20170116095122-00571-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3498649 2017-02-14 08:05 /mnt/temp/CC-MAIN-
> 20170116095122-00572-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 4007644 2017-02-14 08:06 /mnt/temp/CC-MAIN-
> 20170116095122-00573-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3990553 2017-02-14 08:07 /mnt/temp/CC-MAIN-
> 20170116095122-00574-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3689213 2017-02-14 07:54 /mnt/temp/CC-MAIN-
> 20170116095122-00575-ip-10-171-10-70.ec2.internal.warc.gz
>
>
> In [6]: rdd1 = sc.wholeTextFiles("mnt/temp"
>
> In [7]: rdd1.count()
> Out[7]: 6
>
> def process_file(s):
> text = s[1]
> d = {}
> l = text.split("\n")
> final = []
> the_id = "init"
> for line in l:
> if line[0:15] == 'WARC-Record-ID:':
> the_id = line[15:]
> d[the_id] = line
> final.append(Row(**d))
> return final
>
>
> In [8]: rdd2 = rdd1.map(process_file)
> In [9]: rdd2.take(1)
>
>
>
>
>
> 17/02/14 08:25:25 ERROR YarnScheduler: Lost executor 2 on
> ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:25:25 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB
> physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:25:25 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3,
> ip-172-31-35-32.us-west-2.compute.internal, executor 2):
> ExecutorLostFailure (executor 2 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5
> GB physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:29:34 ERROR YarnScheduler: Lost executor 3 on
> ip-172-31-45-106.us-west-2.compute.internal: Container killed by YARN for
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 4,
> ip-172-31-45-106.us-west-2.compute.internal, executor 3):
> ExecutorLostFailure (executor 3 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5
> GB physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:29:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB
> physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:33:44 ERROR YarnScheduler: Lost executor 4 on
> ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 WARN TaskSetManager: Lost task 0.2 in stage 2.0 (TID 5,
> ip-172-31-35-32.us-west-2.compute.internal, executor 4):
> ExecutorLostFailure (executor 4 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5
> GB physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:33:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB
> physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
>
> --
> Henry Tremblay
> Robert Half Technology
>
>
Re: wholeTextfiles not parallel, runs out of memory
Posted by Henry Tremblay <pa...@gmail.com>.
My picture doesn't show the top of the web uri. I retook the picture
(see below), and you can see that only 2 Vcores were active. I would
expect 6, one per file.
I am using map because I need add information per line based on previous
lines. The data looks like this:
u'WARC/1.0',
u'WARC-Type: warcinfo',
u'WARC-Date: 2016-12-08T13:00:23Z',
u'WARC-Record-ID: <urn:uuid:1>',
u'Content-Length: 344',
u'Content-Type: application/warc-fields',
u'WARC-Filename:
CC-MAIN-20161202170900-00000-ip-10-31-129-80.ec2.internal.warc.gz',
....
<html>
<head>
FIRST
....
u'WARC/1.0'
u'WARC-Record-ID: <urn:uuid:2>',
.....
<html>
SECOND
...
I want the results of the second rdd to be:
u'WARC-Record-ID: <urn:uuid:1>',< html>
u'WARC-Record-ID: <urn:uuid:1>', FIRST
.....
u'WARC-Record-ID: <urn:uuid:2>',< html>
u'WARC-Record-ID: <urn:uuid:2>', SECOND
...
I am trying to give structure in the form of keywords to an other wise
flat file, and then process the file further by using flatMap, and then
turning it into a data frame.
Thanks!
Henry
On 02/14/2017 04:21 AM, Koert Kuipers wrote:
> you have 6 files, so you should be able to use up to 6 cores (which
> means maybe only 1 executor is active if you have 6+ cores per
> executor). you cannot archieve any parallelism beyond 6.
>
> executors died because they exceeded yarn memory limits. this is not
> an out-of-memory error (although it does mean you are using all
> memory, not sure why, these files are rather small). anyhow the error
> is because executors are using more off-heap memory than yarn
> expected. you need to increase spark.yarn.executor.memoryOverhead to
> deal with this.
>
> also i am not familiar with python api, but shouldn't you use flatMap
> instead of map to go from file to lines?
>
> On Tue, Feb 14, 2017 at 3:36 AM, Henry Tremblay
> <paulhtremblay@gmail.com <ma...@gmail.com>> wrote:
>
> When I use wholeTextFiles, spark does not run in parallel, and
> yarn runs out of memory.
>
> I have documented the steps below. First I copy 6 s3 files to
> hdfs. Then I create an rdd by:
>
>
> sc.wholeTextFiles("/mnt/temp")
>
>
> Then I process the files line by line using a simple function.
> When I look at my nodes, I see only one executor is running. (I
> assume the other is the name node?) I then get an error message
> that yarn has run out of memory.
>
>
> Steps below:
>
> ========================
>
>
> [hadoop@ip-172-31-40-213 mnt]$ hadoop fs -ls /mnt/temp
> Found 6 items
> -rw-r--r-- 3 hadoop hadoop 3684566 2017-02-14 07:58
> /mnt/temp/CC-MAIN-20170116095122-00570-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3486510 2017-02-14 08:01
> /mnt/temp/CC-MAIN-20170116095122-00571-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3498649 2017-02-14 08:05
> /mnt/temp/CC-MAIN-20170116095122-00572-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 4007644 2017-02-14 08:06
> /mnt/temp/CC-MAIN-20170116095122-00573-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3990553 2017-02-14 08:07
> /mnt/temp/CC-MAIN-20170116095122-00574-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3689213 2017-02-14 07:54
> /mnt/temp/CC-MAIN-20170116095122-00575-ip-10-171-10-70.ec2.internal.warc.gz
>
>
> In [6]: rdd1 = sc.wholeTextFiles("mnt/temp"
>
> In [7]: rdd1.count()
>
> Out[7]: 6
>
> def process_file(s):
> text = s[1]
> d = {}
> l = text.split("\n")
> final = []
> the_id = "init"
> for line in l:
> if line[0:15] == 'WARC-Record-ID:':
> the_id = line[15:]
> d[the_id] = line
> final.append(Row(**d))
> return final
>
>
> In [8]: rdd2 = rdd1.map(process_file)
> In [9]: rdd2.take(1)
>
>
>
>
>
> 17/02/14 08:25:25 ERROR YarnScheduler: Lost executor 2 on
> ip-172-31-35-32.us-west-2.compute.internal: Container killed by
> YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:25:25 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 5.5 GB of
> 5.5 GB physical memory used. Consider boosting
> spark.yarn.executor.memoryOverhead.
> 17/02/14 08:25:25 WARN TaskSetManager: Lost task 0.0 in stage 2.0
> (TID 3, ip-172-31-35-32.us-west-2.compute.internal, executor 2):
> ExecutorLostFailure (executor 2 exited caused by one of the
> running tasks) Reason: Container killed by YARN for exceeding
> memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 ERROR YarnScheduler: Lost executor 3 on
> ip-172-31-45-106.us-west-2.compute.internal: Container killed by
> YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 WARN TaskSetManager: Lost task 0.1 in stage 2.0
> (TID 4, ip-172-31-45-106.us-west-2.compute.internal, executor 3):
> ExecutorLostFailure (executor 3 exited caused by one of the
> running tasks) Reason: Container killed by YARN for exceeding
> memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 5.5 GB of
> 5.5 GB physical memory used. Consider boosting
> spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 ERROR YarnScheduler: Lost executor 4 on
> ip-172-31-35-32.us-west-2.compute.internal: Container killed by
> YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory
> used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 WARN TaskSetManager: Lost task 0.2 in stage 2.0
> (TID 5, ip-172-31-35-32.us-west-2.compute.internal, executor 4):
> ExecutorLostFailure (executor 4 exited caused by one of the
> running tasks) Reason: Container killed by YARN for exceeding
> memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 5.5 GB of
> 5.5 GB physical memory used. Consider boosting
> spark.yarn.executor.memoryOverhead.
>
> --
> Henry Tremblay
> Robert Half Technology
>
>
--
Henry Tremblay
Robert Half Technology
Re: wholeTextfiles not parallel, runs out of memory
Posted by Koert Kuipers <ko...@tresata.com>.
you have 6 files, so you should be able to use up to 6 cores (which means
maybe only 1 executor is active if you have 6+ cores per executor). you
cannot archieve any parallelism beyond 6.
executors died because they exceeded yarn memory limits. this is not an
out-of-memory error (although it does mean you are using all memory, not
sure why, these files are rather small). anyhow the error is because
executors are using more off-heap memory than yarn expected. you need to
increase spark.yarn.executor.memoryOverhead to deal with this.
also i am not familiar with python api, but shouldn't you use flatMap
instead of map to go from file to lines?
On Tue, Feb 14, 2017 at 3:36 AM, Henry Tremblay <pa...@gmail.com>
wrote:
> When I use wholeTextFiles, spark does not run in parallel, and yarn runs
> out of memory.
>
> I have documented the steps below. First I copy 6 s3 files to hdfs. Then I
> create an rdd by:
>
>
> sc.wholeTextFiles("/mnt/temp")
>
>
> Then I process the files line by line using a simple function. When I look
> at my nodes, I see only one executor is running. (I assume the other is the
> name node?) I then get an error message that yarn has run out of memory.
>
>
> Steps below:
>
> ========================
>
>
> [hadoop@ip-172-31-40-213 mnt]$ hadoop fs -ls /mnt/temp
> Found 6 items
> -rw-r--r-- 3 hadoop hadoop 3684566 2017-02-14 07:58 /mnt/temp/CC-MAIN-
> 20170116095122-00570-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3486510 2017-02-14 08:01 /mnt/temp/CC-MAIN-
> 20170116095122-00571-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3498649 2017-02-14 08:05 /mnt/temp/CC-MAIN-
> 20170116095122-00572-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 4007644 2017-02-14 08:06 /mnt/temp/CC-MAIN-
> 20170116095122-00573-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3990553 2017-02-14 08:07 /mnt/temp/CC-MAIN-
> 20170116095122-00574-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r-- 3 hadoop hadoop 3689213 2017-02-14 07:54 /mnt/temp/CC-MAIN-
> 20170116095122-00575-ip-10-171-10-70.ec2.internal.warc.gz
>
>
> In [6]: rdd1 = sc.wholeTextFiles("mnt/temp"
>
> In [7]: rdd1.count()
> Out[7]: 6
>
> def process_file(s):
> text = s[1]
> d = {}
> l = text.split("\n")
> final = []
> the_id = "init"
> for line in l:
> if line[0:15] == 'WARC-Record-ID:':
> the_id = line[15:]
> d[the_id] = line
> final.append(Row(**d))
> return final
>
>
> In [8]: rdd2 = rdd1.map(process_file)
> In [9]: rdd2.take(1)
>
>
>
>
>
> 17/02/14 08:25:25 ERROR YarnScheduler: Lost executor 2 on
> ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:25:25 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB
> physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:25:25 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3,
> ip-172-31-35-32.us-west-2.compute.internal, executor 2):
> ExecutorLostFailure (executor 2 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5
> GB physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:29:34 ERROR YarnScheduler: Lost executor 3 on
> ip-172-31-45-106.us-west-2.compute.internal: Container killed by YARN for
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 4,
> ip-172-31-45-106.us-west-2.compute.internal, executor 3):
> ExecutorLostFailure (executor 3 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5
> GB physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:29:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB
> physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:33:44 ERROR YarnScheduler: Lost executor 4 on
> ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 WARN TaskSetManager: Lost task 0.2 in stage 2.0 (TID 5,
> ip-172-31-35-32.us-west-2.compute.internal, executor 4):
> ExecutorLostFailure (executor 4 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5
> GB physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:33:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB
> physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
>
> --
> Henry Tremblay
> Robert Half Technology
>
>