You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Henry Tremblay <pa...@gmail.com> on 2017/02/14 08:36:55 UTC

wholeTextfiles not parallel, runs out of memory

When I use wholeTextFiles, spark does not run in parallel, and yarn runs 
out of memory.

I have documented the steps below. First I copy 6 s3 files to hdfs. Then 
I create an rdd by:


sc.wholeTextFiles("/mnt/temp")


Then I process the files line by line using a simple function. When I 
look at my nodes, I see only one executor is running. (I assume the 
other is the name node?) I then get an error message that yarn has run 
out of memory.


Steps below:

========================


[hadoop@ip-172-31-40-213 mnt]$ hadoop fs -ls /mnt/temp
Found 6 items
-rw-r--r--   3 hadoop hadoop    3684566 2017-02-14 07:58 
/mnt/temp/CC-MAIN-20170116095122-00570-ip-10-171-10-70.ec2.internal.warc.gz
-rw-r--r--   3 hadoop hadoop    3486510 2017-02-14 08:01 
/mnt/temp/CC-MAIN-20170116095122-00571-ip-10-171-10-70.ec2.internal.warc.gz
-rw-r--r--   3 hadoop hadoop    3498649 2017-02-14 08:05 
/mnt/temp/CC-MAIN-20170116095122-00572-ip-10-171-10-70.ec2.internal.warc.gz
-rw-r--r--   3 hadoop hadoop    4007644 2017-02-14 08:06 
/mnt/temp/CC-MAIN-20170116095122-00573-ip-10-171-10-70.ec2.internal.warc.gz
-rw-r--r--   3 hadoop hadoop    3990553 2017-02-14 08:07 
/mnt/temp/CC-MAIN-20170116095122-00574-ip-10-171-10-70.ec2.internal.warc.gz
-rw-r--r--   3 hadoop hadoop    3689213 2017-02-14 07:54 
/mnt/temp/CC-MAIN-20170116095122-00575-ip-10-171-10-70.ec2.internal.warc.gz


In [6]: rdd1 = sc.wholeTextFiles("mnt/temp"

In [7]: rdd1.count()

Out[7]: 6

def process_file(s):
     text = s[1]
     d = {}
     l =  text.split("\n")
     final = []
     the_id = "init"
     for line in l:
         if line[0:15] == 'WARC-Record-ID:':
             the_id = line[15:]
         d[the_id] = line
         final.append(Row(**d))
     return final


In [8]: rdd2 = rdd1.map(process_file)
In [9]: rdd2.take(1)





17/02/14 08:25:25 ERROR YarnScheduler: Lost executor 2 on 
ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for 
exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
17/02/14 08:25:25 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
17/02/14 08:25:25 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 
3, ip-172-31-35-32.us-west-2.compute.internal, executor 2): 
ExecutorLostFailure (executor 2 exited caused by one of the running 
tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 
GB of 5.5 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.
17/02/14 08:29:34 ERROR YarnScheduler: Lost executor 3 on 
ip-172-31-45-106.us-west-2.compute.internal: Container killed by YARN 
for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. 
Consider boosting spark.yarn.executor.memoryOverhead.
17/02/14 08:29:34 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 
4, ip-172-31-45-106.us-west-2.compute.internal, executor 3): 
ExecutorLostFailure (executor 3 exited caused by one of the running 
tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 
GB of 5.5 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.
17/02/14 08:29:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
17/02/14 08:33:44 ERROR YarnScheduler: Lost executor 4 on 
ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for 
exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider 
boosting spark.yarn.executor.memoryOverhead.
17/02/14 08:33:44 WARN TaskSetManager: Lost task 0.2 in stage 2.0 (TID 
5, ip-172-31-35-32.us-west-2.compute.internal, executor 4): 
ExecutorLostFailure (executor 4 exited caused by one of the running 
tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 
GB of 5.5 GB physical memory used. Consider boosting 
spark.yarn.executor.memoryOverhead.
17/02/14 08:33:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB 
physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

-- 
Henry Tremblay
Robert Half Technology

Re: wholeTextfiles not parallel, runs out of memory

Posted by Jörn Franke <jo...@gmail.com>.

Well 1) the goal of wholetextfiles is to have only one executor 2) you use .gz i.e. you will have only one executor per file maximum

> On 14 Feb 2017, at 09:36, Henry Tremblay <pa...@gmail.com> wrote:
> 
> When I use wholeTextFiles, spark does not run in parallel, and yarn runs out of memory. 
> I have documented the steps below. First I copy 6 s3 files to hdfs. Then I create an rdd by:
> 
> 
> sc.wholeTextFiles("/mnt/temp")
> 
> 
> Then I process the files line by line using a simple function. When I look at my nodes, I see only one executor is running. (I assume the other is the name node?) I then get an error message that yarn has run out of memory.
> 
> 
> Steps below:
> 
> ========================
> 
> [hadoop@ip-172-31-40-213 mnt]$ hadoop fs -ls /mnt/temp
> Found 6 items
> -rw-r--r--   3 hadoop hadoop    3684566 2017-02-14 07:58 /mnt/temp/CC-MAIN-20170116095122-00570-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3486510 2017-02-14 08:01 /mnt/temp/CC-MAIN-20170116095122-00571-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3498649 2017-02-14 08:05 /mnt/temp/CC-MAIN-20170116095122-00572-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    4007644 2017-02-14 08:06 /mnt/temp/CC-MAIN-20170116095122-00573-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3990553 2017-02-14 08:07 /mnt/temp/CC-MAIN-20170116095122-00574-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3689213 2017-02-14 07:54 /mnt/temp/CC-MAIN-20170116095122-00575-ip-10-171-10-70.ec2.internal.warc.gz
> 
> 
> In [6]: rdd1 = sc.wholeTextFiles("mnt/temp"
> In [7]: rdd1.count()
> Out[7]: 6
> 
> def process_file(s):
>     text = s[1]
>     d = {}
>     l =  text.split("\n")
>     final = []
>     the_id = "init"
>     for line in l:
>         if line[0:15] == 'WARC-Record-ID:':
>             the_id = line[15:]
>         d[the_id] = line
>         final.append(Row(**d))
>     return final
> 
> 
> In [8]: rdd2 = rdd1.map(process_file)
> In [9]: rdd2.take(1)
> 
> 
> <lhkgadbhdpeiihec.png>
> 
> 
> 17/02/14 08:25:25 ERROR YarnScheduler: Lost executor 2 on ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:25:25 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:25:25 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3, ip-172-31-35-32.us-west-2.compute.internal, executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 ERROR YarnScheduler: Lost executor 3 on ip-172-31-45-106.us-west-2.compute.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 4, ip-172-31-45-106.us-west-2.compute.internal, executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 ERROR YarnScheduler: Lost executor 4 on ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 WARN TaskSetManager: Lost task 0.2 in stage 2.0 (TID 5, ip-172-31-35-32.us-west-2.compute.internal, executor 4): ExecutorLostFailure (executor 4 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
> 
> -- 
> Henry Tremblay
> Robert Half Technology

Re: wholeTextfiles not parallel, runs out of memory

Posted by Gourav Sengupta <go...@gmail.com>.

Hi Henry,

Any reason why you are not using dataframes?


Regards,
Gourav Sengupta

On Tue, Feb 14, 2017 at 8:36 AM, Henry Tremblay <pa...@gmail.com>
wrote:

> When I use wholeTextFiles, spark does not run in parallel, and yarn runs
> out of memory.
>
> I have documented the steps below. First I copy 6 s3 files to hdfs. Then I
> create an rdd by:
>
>
> sc.wholeTextFiles("/mnt/temp")
>
>
> Then I process the files line by line using a simple function. When I look
> at my nodes, I see only one executor is running. (I assume the other is the
> name node?) I then get an error message that yarn has run out of memory.
>
>
> Steps below:
>
> ========================
>
>
> [hadoop@ip-172-31-40-213 mnt]$ hadoop fs -ls /mnt/temp
> Found 6 items
> -rw-r--r--   3 hadoop hadoop    3684566 2017-02-14 07:58 /mnt/temp/CC-MAIN-
> 20170116095122-00570-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3486510 2017-02-14 08:01 /mnt/temp/CC-MAIN-
> 20170116095122-00571-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3498649 2017-02-14 08:05 /mnt/temp/CC-MAIN-
> 20170116095122-00572-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    4007644 2017-02-14 08:06 /mnt/temp/CC-MAIN-
> 20170116095122-00573-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3990553 2017-02-14 08:07 /mnt/temp/CC-MAIN-
> 20170116095122-00574-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3689213 2017-02-14 07:54 /mnt/temp/CC-MAIN-
> 20170116095122-00575-ip-10-171-10-70.ec2.internal.warc.gz
>
>
> In [6]: rdd1 = sc.wholeTextFiles("mnt/temp"
>
> In [7]: rdd1.count()
> Out[7]: 6
>
> def process_file(s):
>     text = s[1]
>     d = {}
>     l =  text.split("\n")
>     final = []
>     the_id = "init"
>     for line in l:
>         if line[0:15] == 'WARC-Record-ID:':
>             the_id = line[15:]
>         d[the_id] = line
>         final.append(Row(**d))
>     return final
>
>
> In [8]: rdd2 = rdd1.map(process_file)
> In [9]: rdd2.take(1)
>
>
>
>
>
> 17/02/14 08:25:25 ERROR YarnScheduler: Lost executor 2 on
> ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:25:25 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB
> physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:25:25 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3,
> ip-172-31-35-32.us-west-2.compute.internal, executor 2):
> ExecutorLostFailure (executor 2 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5
> GB physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:29:34 ERROR YarnScheduler: Lost executor 3 on
> ip-172-31-45-106.us-west-2.compute.internal: Container killed by YARN for
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 4,
> ip-172-31-45-106.us-west-2.compute.internal, executor 3):
> ExecutorLostFailure (executor 3 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5
> GB physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:29:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB
> physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:33:44 ERROR YarnScheduler: Lost executor 4 on
> ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 WARN TaskSetManager: Lost task 0.2 in stage 2.0 (TID 5,
> ip-172-31-35-32.us-west-2.compute.internal, executor 4):
> ExecutorLostFailure (executor 4 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5
> GB physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:33:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB
> physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
>
> --
> Henry Tremblay
> Robert Half Technology
>
>

Re: wholeTextfiles not parallel, runs out of memory

Posted by Henry Tremblay <pa...@gmail.com>.

My picture doesn't show the top of the web uri. I retook the picture 
(see below), and you can see that only 2 Vcores were active. I would 
expect 6, one per file.

I am using map because I need add information per line based on previous 
lines. The data looks like this:

u'WARC/1.0',
  u'WARC-Type: warcinfo',
  u'WARC-Date: 2016-12-08T13:00:23Z',
  u'WARC-Record-ID: <urn:uuid:1>',
  u'Content-Length: 344',
  u'Content-Type: application/warc-fields',
  u'WARC-Filename: 
CC-MAIN-20161202170900-00000-ip-10-31-129-80.ec2.internal.warc.gz',

....

<html>

<head>

FIRST

....

u'WARC/1.0'

u'WARC-Record-ID: <urn:uuid:2>',

.....

<html>

SECOND

...


I want the results of the second rdd to be:

u'WARC-Record-ID: <urn:uuid:1>',< html>

u'WARC-Record-ID: <urn:uuid:1>', FIRST

.....


u'WARC-Record-ID: <urn:uuid:2>',< html>

u'WARC-Record-ID: <urn:uuid:2>', SECOND

...


I am trying to give structure in the form of keywords to an other wise 
flat file, and then process the file further by using flatMap, and then 
turning it into a data frame.


Thanks!

Henry






On 02/14/2017 04:21 AM, Koert Kuipers wrote:
> you have 6 files, so you should be able to use up to 6 cores (which 
> means maybe only 1 executor is active if you have 6+ cores per 
> executor). you cannot archieve any parallelism beyond 6.
>
> executors died because they exceeded yarn memory limits. this is not 
> an out-of-memory error (although it does mean you are using all 
> memory, not sure why, these files are rather small). anyhow the error 
> is because executors are using more off-heap memory than yarn 
> expected. you need to increase spark.yarn.executor.memoryOverhead to 
> deal with this.
>
> also i am not familiar with python api, but shouldn't you use flatMap 
> instead of map to go from file to lines?
>
> On Tue, Feb 14, 2017 at 3:36 AM, Henry Tremblay 
> <paulhtremblay@gmail.com <ma...@gmail.com>> wrote:
>
>     When I use wholeTextFiles, spark does not run in parallel, and
>     yarn runs out of memory.
>
>     I have documented the steps below. First I copy 6 s3 files to
>     hdfs. Then I create an rdd by:
>
>
>     sc.wholeTextFiles("/mnt/temp")
>
>
>     Then I process the files line by line using a simple function.
>     When I look at my nodes, I see only one executor is running. (I
>     assume the other is the name node?) I then get an error message
>     that yarn has run out of memory.
>
>
>     Steps below:
>
>     ========================
>
>
>     [hadoop@ip-172-31-40-213 mnt]$ hadoop fs -ls /mnt/temp
>     Found 6 items
>     -rw-r--r--   3 hadoop hadoop    3684566 2017-02-14 07:58
>     /mnt/temp/CC-MAIN-20170116095122-00570-ip-10-171-10-70.ec2.internal.warc.gz
>     -rw-r--r--   3 hadoop hadoop    3486510 2017-02-14 08:01
>     /mnt/temp/CC-MAIN-20170116095122-00571-ip-10-171-10-70.ec2.internal.warc.gz
>     -rw-r--r--   3 hadoop hadoop    3498649 2017-02-14 08:05
>     /mnt/temp/CC-MAIN-20170116095122-00572-ip-10-171-10-70.ec2.internal.warc.gz
>     -rw-r--r--   3 hadoop hadoop    4007644 2017-02-14 08:06
>     /mnt/temp/CC-MAIN-20170116095122-00573-ip-10-171-10-70.ec2.internal.warc.gz
>     -rw-r--r--   3 hadoop hadoop    3990553 2017-02-14 08:07
>     /mnt/temp/CC-MAIN-20170116095122-00574-ip-10-171-10-70.ec2.internal.warc.gz
>     -rw-r--r--   3 hadoop hadoop    3689213 2017-02-14 07:54
>     /mnt/temp/CC-MAIN-20170116095122-00575-ip-10-171-10-70.ec2.internal.warc.gz
>
>
>     In [6]: rdd1 = sc.wholeTextFiles("mnt/temp"
>
>     In [7]: rdd1.count()
>
>     Out[7]: 6
>
>     def process_file(s):
>         text = s[1]
>         d = {}
>         l =  text.split("\n")
>         final = []
>         the_id = "init"
>         for line in l:
>             if line[0:15] == 'WARC-Record-ID:':
>                 the_id = line[15:]
>             d[the_id] = line
>             final.append(Row(**d))
>         return final
>
>
>     In [8]: rdd2 = rdd1.map(process_file)
>     In [9]: rdd2.take(1)
>
>
>
>
>
>     17/02/14 08:25:25 ERROR YarnScheduler: Lost executor 2 on
>     ip-172-31-35-32.us-west-2.compute.internal: Container killed by
>     YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory
>     used. Consider boosting spark.yarn.executor.memoryOverhead.
>     17/02/14 08:25:25 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
>     Container killed by YARN for exceeding memory limits. 5.5 GB of
>     5.5 GB physical memory used. Consider boosting
>     spark.yarn.executor.memoryOverhead.
>     17/02/14 08:25:25 WARN TaskSetManager: Lost task 0.0 in stage 2.0
>     (TID 3, ip-172-31-35-32.us-west-2.compute.internal, executor 2):
>     ExecutorLostFailure (executor 2 exited caused by one of the
>     running tasks) Reason: Container killed by YARN for exceeding
>     memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
>     boosting spark.yarn.executor.memoryOverhead.
>     17/02/14 08:29:34 ERROR YarnScheduler: Lost executor 3 on
>     ip-172-31-45-106.us-west-2.compute.internal: Container killed by
>     YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory
>     used. Consider boosting spark.yarn.executor.memoryOverhead.
>     17/02/14 08:29:34 WARN TaskSetManager: Lost task 0.1 in stage 2.0
>     (TID 4, ip-172-31-45-106.us-west-2.compute.internal, executor 3):
>     ExecutorLostFailure (executor 3 exited caused by one of the
>     running tasks) Reason: Container killed by YARN for exceeding
>     memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
>     boosting spark.yarn.executor.memoryOverhead.
>     17/02/14 08:29:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
>     Container killed by YARN for exceeding memory limits. 5.5 GB of
>     5.5 GB physical memory used. Consider boosting
>     spark.yarn.executor.memoryOverhead.
>     17/02/14 08:33:44 ERROR YarnScheduler: Lost executor 4 on
>     ip-172-31-35-32.us-west-2.compute.internal: Container killed by
>     YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory
>     used. Consider boosting spark.yarn.executor.memoryOverhead.
>     17/02/14 08:33:44 WARN TaskSetManager: Lost task 0.2 in stage 2.0
>     (TID 5, ip-172-31-35-32.us-west-2.compute.internal, executor 4):
>     ExecutorLostFailure (executor 4 exited caused by one of the
>     running tasks) Reason: Container killed by YARN for exceeding
>     memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
>     boosting spark.yarn.executor.memoryOverhead.
>     17/02/14 08:33:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
>     Container killed by YARN for exceeding memory limits. 5.5 GB of
>     5.5 GB physical memory used. Consider boosting
>     spark.yarn.executor.memoryOverhead.
>
>     -- 
>     Henry Tremblay
>     Robert Half Technology
>
>

-- 
Henry Tremblay
Robert Half Technology

Re: wholeTextfiles not parallel, runs out of memory

Posted by Koert Kuipers <ko...@tresata.com>.

you have 6 files, so you should be able to use up to 6 cores (which means
maybe only 1 executor is active if you have 6+ cores per executor). you
cannot archieve any parallelism beyond 6.

executors died because they exceeded yarn memory limits. this is not an
out-of-memory error (although it does mean you are using all memory, not
sure why, these files are rather small). anyhow the error is because
executors are using more off-heap memory than yarn expected. you need to
increase spark.yarn.executor.memoryOverhead to deal with this.

also i am not familiar with python api, but shouldn't you use flatMap
instead of map to go from file to lines?

On Tue, Feb 14, 2017 at 3:36 AM, Henry Tremblay <pa...@gmail.com>
wrote:

> When I use wholeTextFiles, spark does not run in parallel, and yarn runs
> out of memory.
>
> I have documented the steps below. First I copy 6 s3 files to hdfs. Then I
> create an rdd by:
>
>
> sc.wholeTextFiles("/mnt/temp")
>
>
> Then I process the files line by line using a simple function. When I look
> at my nodes, I see only one executor is running. (I assume the other is the
> name node?) I then get an error message that yarn has run out of memory.
>
>
> Steps below:
>
> ========================
>
>
> [hadoop@ip-172-31-40-213 mnt]$ hadoop fs -ls /mnt/temp
> Found 6 items
> -rw-r--r--   3 hadoop hadoop    3684566 2017-02-14 07:58 /mnt/temp/CC-MAIN-
> 20170116095122-00570-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3486510 2017-02-14 08:01 /mnt/temp/CC-MAIN-
> 20170116095122-00571-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3498649 2017-02-14 08:05 /mnt/temp/CC-MAIN-
> 20170116095122-00572-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    4007644 2017-02-14 08:06 /mnt/temp/CC-MAIN-
> 20170116095122-00573-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3990553 2017-02-14 08:07 /mnt/temp/CC-MAIN-
> 20170116095122-00574-ip-10-171-10-70.ec2.internal.warc.gz
> -rw-r--r--   3 hadoop hadoop    3689213 2017-02-14 07:54 /mnt/temp/CC-MAIN-
> 20170116095122-00575-ip-10-171-10-70.ec2.internal.warc.gz
>
>
> In [6]: rdd1 = sc.wholeTextFiles("mnt/temp"
>
> In [7]: rdd1.count()
> Out[7]: 6
>
> def process_file(s):
>     text = s[1]
>     d = {}
>     l =  text.split("\n")
>     final = []
>     the_id = "init"
>     for line in l:
>         if line[0:15] == 'WARC-Record-ID:':
>             the_id = line[15:]
>         d[the_id] = line
>         final.append(Row(**d))
>     return final
>
>
> In [8]: rdd2 = rdd1.map(process_file)
> In [9]: rdd2.take(1)
>
>
>
>
>
> 17/02/14 08:25:25 ERROR YarnScheduler: Lost executor 2 on
> ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:25:25 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB
> physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:25:25 WARN TaskSetManager: Lost task 0.0 in stage 2.0 (TID 3,
> ip-172-31-35-32.us-west-2.compute.internal, executor 2):
> ExecutorLostFailure (executor 2 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5
> GB physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:29:34 ERROR YarnScheduler: Lost executor 3 on
> ip-172-31-45-106.us-west-2.compute.internal: Container killed by YARN for
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:29:34 WARN TaskSetManager: Lost task 0.1 in stage 2.0 (TID 4,
> ip-172-31-45-106.us-west-2.compute.internal, executor 3):
> ExecutorLostFailure (executor 3 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5
> GB physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:29:34 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB
> physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:33:44 ERROR YarnScheduler: Lost executor 4 on
> ip-172-31-35-32.us-west-2.compute.internal: Container killed by YARN for
> exceeding memory limits. 5.5 GB of 5.5 GB physical memory used. Consider
> boosting spark.yarn.executor.memoryOverhead.
> 17/02/14 08:33:44 WARN TaskSetManager: Lost task 0.2 in stage 2.0 (TID 5,
> ip-172-31-35-32.us-west-2.compute.internal, executor 4):
> ExecutorLostFailure (executor 4 exited caused by one of the running tasks)
> Reason: Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5
> GB physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
> 17/02/14 08:33:44 WARN YarnSchedulerBackend$YarnSchedulerEndpoint:
> Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB
> physical memory used. Consider boosting spark.yarn.executor.
> memoryOverhead.
>
> --
> Henry Tremblay
> Robert Half Technology
>
>