You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by bethesda <sw...@mac.com> on 2014/12/19 13:46:33 UTC

"Fetch Failure"

I have a job that runs fine on relatively small input datasets but then
reaches a threshold where I begin to consistently get "Fetch failure" for
the Failure Reason, late in the job, during a saveAsText() operation. 

The first error we are seeing on the "Details for Stage" page is
"ExecutorLostFailure"

My Shuffle Read is 3.3 GB and that's the only thing that seems high, we have
three servers and they are configured on this job for 5g memory, and the job
is running in spark-shell.  The first error in the shell is "Lost executor 2
on (servername): remote Akka client disassociated.

We are still trying to understand how to best diagnose jobs using the web ui
so it's likely that there's some helpful info here that we just don't know
how to interpret -- is there any kind of "troubleshooting guide" beyond the
Spark Configuration page?  I don't know if I'm providing enough info here.

thanks.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: "Fetch Failure"

Posted by Stefano Ghezzi <st...@icteam.it>.

i've eliminated fetch failed with this parameters (don't know which was 
the right one for the problem)
to the spark-submit running with 1.2.0

         --conf spark.shuffle.compress=false \
         --conf spark.file.transferTo=false \
         --conf spark.shuffle.manager=hash \
         --conf spark.akka.frameSize=50 \
         --conf spark.core.connection.ack.wait.timeout=600

..but me too i'm unable to finish a job...now i'm facing OOM's...still 
trying...but at
least fetch failed are gone

bye

Il 23/12/2014 21:10, Chen Song ha scritto:
> I tried both 1.1.1 and 1.2.0 (built against cdh5.1.0 and hadoop2.3) 
> but I am still seeing FetchFailedException.
>
> On Mon, Dec 22, 2014 at 8:27 AM, steghe <stefano.ghezzi@icteam.it 
> <ma...@icteam.it>> wrote:
>
>     Which version of spark are you running?
>
>     It could be related to this
>     https://issues.apache.org/jira/browse/SPARK-3633
>
>     fixed in 1.1.1 and 1.2.0
>
>
>
>
>
>     --
>     View this message in context:
>     http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
>     Sent from the Apache Spark User List mailing list archive at
>     Nabble.com.
>
>     ---------------------------------------------------------------------
>     To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>     <ma...@spark.apache.org>
>     For additional commands, e-mail: user-help@spark.apache.org
>     <ma...@spark.apache.org>
>
>
>
>
> -- 
> Chen Song
>


-- 
____________________________________________________________
Stefano Ghezzi                     ICTeam S.p.A
Project Manager - PMP
tel     035 4232129                fax 035 4522034
email   stefano.ghezzi@icteam.it   url http://www.icteam.com
mobile  335 7308587
____________________________________________________________

Re: "Fetch Failure"

Posted by Chen Song <ch...@gmail.com>.

I tried both 1.1.1 and 1.2.0 (built against cdh5.1.0 and hadoop2.3) but I
am still seeing FetchFailedException.

On Mon, Dec 22, 2014 at 8:27 AM, steghe <st...@icteam.it> wrote:

> Which version of spark are you running?
>
> It could be related to this
> https://issues.apache.org/jira/browse/SPARK-3633
>
> fixed in 1.1.1 and 1.2.0
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>


-- 
Chen Song

Re: "Fetch Failure"

Posted by steghe <st...@icteam.it>.

Which version of spark are you running?

It could be related to this
https://issues.apache.org/jira/browse/SPARK-3633

fixed in 1.1.1 and 1.2.0





--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787p20811.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: "Fetch Failure"

Posted by Jon Chase <jo...@gmail.com>.

Yes, same problem.

On Fri, Dec 19, 2014 at 11:29 AM, Sandy Ryza <sa...@cloudera.com>
wrote:

> Do you hit the same errors?  Is it now saying your containers are exceed
> ~10 GB?
>
> On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase <jo...@gmail.com> wrote:
>>
>> I'm actually already running 1.1.1.
>>
>> I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no
>> luck.  Still getting "ExecutorLostFailure (executor lost)".
>>
>>
>>
>> On Fri, Dec 19, 2014 at 10:43 AM, Rafal Kwasny <ra...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> Just upgrade to 1.1.1 - it was fixed some time ago
>>>
>>> /Raf
>>>
>>>
>>> sandy.ryza@cloudera.com wrote:
>>>
>>> Hi Jon,
>>>
>>> The fix for this is to increase spark.yarn.executor.memoryOverhead to
>>> something greater than it's default of 384.
>>>
>>> This will increase the gap between the executors heap size and what it
>>> requests from yarn. It's required because jvms take up some memory beyond
>>> their heap size.
>>>
>>> -Sandy
>>>
>>> On Dec 19, 2014, at 9:04 AM, Jon Chase <jo...@gmail.com> wrote:
>>>
>>> I'm getting the same error ("ExecutorLostFailure") - input RDD is 100k
>>> small files (~2MB each).  I do a simple map, then keyBy(), and then
>>> rdd.saveAsHadoopDataset(...).  Depending on the memory settings given to
>>> spark-submit, the time before the first ExecutorLostFailure varies (more
>>> memory == longer until failure) - but this usually happens after about 100
>>> files being processed.
>>>
>>> I'm running Spark 1.1.0 on AWS EMR w/Yarn.    It appears that Yarn is
>>> killing the executor b/c it thinks it's exceeding memory.  However, I can't
>>> repro any OOM issues when running locally, no matter the size of the data
>>> set.
>>>
>>> It seems like Yarn thinks the heap size is increasing according to the
>>> Yarn logs:
>>>
>>> 2014-12-18 22:06:43,505 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>>> container_1418928607193_0011_01_000002: 6.1 GB of 6.5 GB physical memory
>>> used; 13.8 GB of 32.5 GB virtual memory used
>>> 2014-12-18 22:06:46,516 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>>> container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory
>>> used; 13.9 GB of 32.5 GB virtual memory used
>>> 2014-12-18 22:06:49,524 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>>> container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory
>>> used; 14.0 GB of 32.5 GB virtual memory used
>>> 2014-12-18 22:06:52,531 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>>> container_1418928607193_0011_01_000002: 6.4 GB of 6.5 GB physical memory
>>> used; 14.1 GB of 32.5 GB virtual memory used
>>> 2014-12-18 22:06:55,538 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>>> container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory
>>> used; 14.2 GB of 32.5 GB virtual memory used
>>> 2014-12-18 22:06:58,549 INFO
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>>> container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory
>>> used; 14.3 GB of 32.5 GB virtual memory used
>>> 2014-12-18 22:06:58,549 WARN
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>>> (Container Monitor): Process tree for container:
>>> container_1418928607193_0011_01_000002 has processes older than 1 iteration
>>> running over the configured limit. Limit=6979321856, current usage =
>>> 6995812352
>>> 2014-12-18 22:06:58,549 WARN
>>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>>> (Container Monitor): Container
>>> [pid=24273,containerID=container_1418928607193_0011_01_000002] is running
>>> beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical
>>> memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container.
>>> Dump of the process-tree for container_1418928607193_0011_01_000002 :
>>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
>>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>>> |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c
>>> /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
>>> -Xms6144m -Xmx6144m  -verbose:gc -XX:+HeapDumpOnOutOfMemoryError
>>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
>>> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
>>> -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
>>> org.apache.spark.executor.CoarseGrainedExecutorBackend
>>> akka.tcp://sparkDriver@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
>>> 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1>
>>> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stdout
>>> 2>
>>> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stderr
>>> |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660
>>> /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m
>>> -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails
>>> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
>>> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
>>> -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
>>> org.apache.spark.executor.CoarseGrainedExecutorBackend
>>> akka.tcp://sparkDriver@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
>>> 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4
>>>
>>>
>>> I've analyzed some heap dumps and see nothing out of the ordinary.
>>> Would love to know what could be causing this.
>>>
>>>
>>> On Fri, Dec 19, 2014 at 7:46 AM, bethesda <sw...@mac.com> wrote:
>>>
>>>> I have a job that runs fine on relatively small input datasets but then
>>>> reaches a threshold where I begin to consistently get "Fetch failure"
>>>> for
>>>> the Failure Reason, late in the job, during a saveAsText() operation.
>>>>
>>>> The first error we are seeing on the "Details for Stage" page is
>>>> "ExecutorLostFailure"
>>>>
>>>> My Shuffle Read is 3.3 GB and that's the only thing that seems high, we
>>>> have
>>>> three servers and they are configured on this job for 5g memory, and
>>>> the job
>>>> is running in spark-shell.  The first error in the shell is "Lost
>>>> executor 2
>>>> on (servername): remote Akka client disassociated.
>>>>
>>>> We are still trying to understand how to best diagnose jobs using the
>>>> web ui
>>>> so it's likely that there's some helpful info here that we just don't
>>>> know
>>>> how to interpret -- is there any kind of "troubleshooting guide" beyond
>>>> the
>>>> Spark Configuration page?  I don't know if I'm providing enough info
>>>> here.
>>>>
>>>> thanks.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>>> .
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>>> For additional commands, e-mail: user-help@spark.apache.org
>>>>
>>>>
>>>
>>>
>>

Re: "Fetch Failure"

Posted by Sandy Ryza <sa...@cloudera.com>.

Do you hit the same errors?  Is it now saying your containers are exceed
~10 GB?

On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase <jo...@gmail.com> wrote:
>
> I'm actually already running 1.1.1.
>
> I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no
> luck.  Still getting "ExecutorLostFailure (executor lost)".
>
>
>
> On Fri, Dec 19, 2014 at 10:43 AM, Rafal Kwasny <ra...@gmail.com>
> wrote:
>
>> Hi,
>> Just upgrade to 1.1.1 - it was fixed some time ago
>>
>> /Raf
>>
>>
>> sandy.ryza@cloudera.com wrote:
>>
>> Hi Jon,
>>
>> The fix for this is to increase spark.yarn.executor.memoryOverhead to
>> something greater than it's default of 384.
>>
>> This will increase the gap between the executors heap size and what it
>> requests from yarn. It's required because jvms take up some memory beyond
>> their heap size.
>>
>> -Sandy
>>
>> On Dec 19, 2014, at 9:04 AM, Jon Chase <jo...@gmail.com> wrote:
>>
>> I'm getting the same error ("ExecutorLostFailure") - input RDD is 100k
>> small files (~2MB each).  I do a simple map, then keyBy(), and then
>> rdd.saveAsHadoopDataset(...).  Depending on the memory settings given to
>> spark-submit, the time before the first ExecutorLostFailure varies (more
>> memory == longer until failure) - but this usually happens after about 100
>> files being processed.
>>
>> I'm running Spark 1.1.0 on AWS EMR w/Yarn.    It appears that Yarn is
>> killing the executor b/c it thinks it's exceeding memory.  However, I can't
>> repro any OOM issues when running locally, no matter the size of the data
>> set.
>>
>> It seems like Yarn thinks the heap size is increasing according to the
>> Yarn logs:
>>
>> 2014-12-18 22:06:43,505 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.1 GB of 6.5 GB physical memory
>> used; 13.8 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:46,516 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory
>> used; 13.9 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:49,524 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory
>> used; 14.0 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:52,531 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.4 GB of 6.5 GB physical memory
>> used; 14.1 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:55,538 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory
>> used; 14.2 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:58,549 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory
>> used; 14.3 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:58,549 WARN
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Process tree for container:
>> container_1418928607193_0011_01_000002 has processes older than 1 iteration
>> running over the configured limit. Limit=6979321856, current usage =
>> 6995812352
>> 2014-12-18 22:06:58,549 WARN
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Container
>> [pid=24273,containerID=container_1418928607193_0011_01_000002] is running
>> beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical
>> memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container.
>> Dump of the process-tree for container_1418928607193_0011_01_000002 :
>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>> |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c
>> /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
>> -Xms6144m -Xmx6144m  -verbose:gc -XX:+HeapDumpOnOutOfMemoryError
>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
>> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
>> -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
>> org.apache.spark.executor.CoarseGrainedExecutorBackend
>> akka.tcp://sparkDriver@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
>> 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1>
>> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stdout
>> 2>
>> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stderr
>> |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660
>> /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m
>> -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails
>> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
>> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
>> -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
>> org.apache.spark.executor.CoarseGrainedExecutorBackend
>> akka.tcp://sparkDriver@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
>> 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4
>>
>>
>> I've analyzed some heap dumps and see nothing out of the ordinary.
>> Would love to know what could be causing this.
>>
>>
>> On Fri, Dec 19, 2014 at 7:46 AM, bethesda <sw...@mac.com> wrote:
>>
>>> I have a job that runs fine on relatively small input datasets but then
>>> reaches a threshold where I begin to consistently get "Fetch failure" for
>>> the Failure Reason, late in the job, during a saveAsText() operation.
>>>
>>> The first error we are seeing on the "Details for Stage" page is
>>> "ExecutorLostFailure"
>>>
>>> My Shuffle Read is 3.3 GB and that's the only thing that seems high, we
>>> have
>>> three servers and they are configured on this job for 5g memory, and the
>>> job
>>> is running in spark-shell.  The first error in the shell is "Lost
>>> executor 2
>>> on (servername): remote Akka client disassociated.
>>>
>>> We are still trying to understand how to best diagnose jobs using the
>>> web ui
>>> so it's likely that there's some helpful info here that we just don't
>>> know
>>> how to interpret -- is there any kind of "troubleshooting guide" beyond
>>> the
>>> Spark Configuration page?  I don't know if I'm providing enough info
>>> here.
>>>
>>> thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>>
>

Re: "Fetch Failure"

Posted by Jon Chase <jo...@gmail.com>.

Hmmm, I see this a lot (multiple times per second) in the stdout logs of my
application:

2014-12-19T16:12:35.748+0000: [GC (Allocation Failure) [ParNew:
286663K->12530K(306688K), 0.0074579 secs] 1470813K->1198034K(2063104K),
0.0075189 secs] [Times: user=0.03 sys=0.00, real=0.01 secs]


And finally I see

2014-12-19 16:12:36,116 ERROR [SIGTERM handler]
executor.CoarseGrainedExecutorBackend (SignalLogger.scala:handle(57)) -
RECEIVED SIGNAL 15: SIGTERM

which I assume is coming from Yarn, after which the log contains this and
then ends:

Heap
 par new generation   total 306688K, used 23468K [0x0000000080000000,
0x0000000094cc0000, 0x0000000094cc0000)
  eden space 272640K,   4% used [0x0000000080000000, 0x0000000080abff10,
0x0000000090a40000)
  from space 34048K,  36% used [0x0000000092b80000, 0x00000000937ab488,
0x0000000094cc0000)
  to   space 34048K,   0% used [0x0000000090a40000, 0x0000000090a40000,
0x0000000092b80000)
 concurrent mark-sweep generation total 1756416K, used 1186756K
[0x0000000094cc0000, 0x0000000100000000, 0x0000000100000000)
 Metaspace       used 52016K, capacity 52683K, committed 52848K, reserved
1095680K
  class space    used 7149K, capacity 7311K, committed 7392K, reserved
1048576K







On Fri, Dec 19, 2014 at 11:16 AM, Jon Chase <jo...@gmail.com> wrote:

> I'm actually already running 1.1.1.
>
> I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no
> luck.  Still getting "ExecutorLostFailure (executor lost)".
>
>
>
> On Fri, Dec 19, 2014 at 10:43 AM, Rafal Kwasny <ra...@gmail.com>
> wrote:
>
>> Hi,
>> Just upgrade to 1.1.1 - it was fixed some time ago
>>
>> /Raf
>>
>>
>> sandy.ryza@cloudera.com wrote:
>>
>> Hi Jon,
>>
>> The fix for this is to increase spark.yarn.executor.memoryOverhead to
>> something greater than it's default of 384.
>>
>> This will increase the gap between the executors heap size and what it
>> requests from yarn. It's required because jvms take up some memory beyond
>> their heap size.
>>
>> -Sandy
>>
>> On Dec 19, 2014, at 9:04 AM, Jon Chase <jo...@gmail.com> wrote:
>>
>> I'm getting the same error ("ExecutorLostFailure") - input RDD is 100k
>> small files (~2MB each).  I do a simple map, then keyBy(), and then
>> rdd.saveAsHadoopDataset(...).  Depending on the memory settings given to
>> spark-submit, the time before the first ExecutorLostFailure varies (more
>> memory == longer until failure) - but this usually happens after about 100
>> files being processed.
>>
>> I'm running Spark 1.1.0 on AWS EMR w/Yarn.    It appears that Yarn is
>> killing the executor b/c it thinks it's exceeding memory.  However, I can't
>> repro any OOM issues when running locally, no matter the size of the data
>> set.
>>
>> It seems like Yarn thinks the heap size is increasing according to the
>> Yarn logs:
>>
>> 2014-12-18 22:06:43,505 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.1 GB of 6.5 GB physical memory
>> used; 13.8 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:46,516 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory
>> used; 13.9 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:49,524 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory
>> used; 14.0 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:52,531 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.4 GB of 6.5 GB physical memory
>> used; 14.1 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:55,538 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory
>> used; 14.2 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:58,549 INFO
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
>> container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory
>> used; 14.3 GB of 32.5 GB virtual memory used
>> 2014-12-18 22:06:58,549 WARN
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Process tree for container:
>> container_1418928607193_0011_01_000002 has processes older than 1 iteration
>> running over the configured limit. Limit=6979321856, current usage =
>> 6995812352
>> 2014-12-18 22:06:58,549 WARN
>> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
>> (Container Monitor): Container
>> [pid=24273,containerID=container_1418928607193_0011_01_000002] is running
>> beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical
>> memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container.
>> Dump of the process-tree for container_1418928607193_0011_01_000002 :
>> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
>> |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c
>> /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
>> -Xms6144m -Xmx6144m  -verbose:gc -XX:+HeapDumpOnOutOfMemoryError
>> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
>> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
>> -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
>> org.apache.spark.executor.CoarseGrainedExecutorBackend
>> akka.tcp://sparkDriver@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
>> 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1>
>> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stdout
>> 2>
>> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stderr
>> |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660
>> /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m
>> -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails
>> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
>> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
>> -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
>> org.apache.spark.executor.CoarseGrainedExecutorBackend
>> akka.tcp://sparkDriver@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
>> 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4
>>
>>
>> I've analyzed some heap dumps and see nothing out of the ordinary.
>> Would love to know what could be causing this.
>>
>>
>> On Fri, Dec 19, 2014 at 7:46 AM, bethesda <sw...@mac.com> wrote:
>>
>>> I have a job that runs fine on relatively small input datasets but then
>>> reaches a threshold where I begin to consistently get "Fetch failure" for
>>> the Failure Reason, late in the job, during a saveAsText() operation.
>>>
>>> The first error we are seeing on the "Details for Stage" page is
>>> "ExecutorLostFailure"
>>>
>>> My Shuffle Read is 3.3 GB and that's the only thing that seems high, we
>>> have
>>> three servers and they are configured on this job for 5g memory, and the
>>> job
>>> is running in spark-shell.  The first error in the shell is "Lost
>>> executor 2
>>> on (servername): remote Akka client disassociated.
>>>
>>> We are still trying to understand how to best diagnose jobs using the
>>> web ui
>>> so it's likely that there's some helpful info here that we just don't
>>> know
>>> how to interpret -- is there any kind of "troubleshooting guide" beyond
>>> the
>>> Spark Configuration page?  I don't know if I'm providing enough info
>>> here.
>>>
>>> thanks.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>>
>>
>

Re: "Fetch Failure"

Posted by Jon Chase <jo...@gmail.com>.

I'm actually already running 1.1.1.

I also just tried --conf spark.yarn.executor.memoryOverhead=4096, but no
luck.  Still getting "ExecutorLostFailure (executor lost)".



On Fri, Dec 19, 2014 at 10:43 AM, Rafal Kwasny <ra...@gmail.com>
wrote:

> Hi,
> Just upgrade to 1.1.1 - it was fixed some time ago
>
> /Raf
>
>
> sandy.ryza@cloudera.com wrote:
>
> Hi Jon,
>
> The fix for this is to increase spark.yarn.executor.memoryOverhead to
> something greater than it's default of 384.
>
> This will increase the gap between the executors heap size and what it
> requests from yarn. It's required because jvms take up some memory beyond
> their heap size.
>
> -Sandy
>
> On Dec 19, 2014, at 9:04 AM, Jon Chase <jo...@gmail.com> wrote:
>
> I'm getting the same error ("ExecutorLostFailure") - input RDD is 100k
> small files (~2MB each).  I do a simple map, then keyBy(), and then
> rdd.saveAsHadoopDataset(...).  Depending on the memory settings given to
> spark-submit, the time before the first ExecutorLostFailure varies (more
> memory == longer until failure) - but this usually happens after about 100
> files being processed.
>
> I'm running Spark 1.1.0 on AWS EMR w/Yarn.    It appears that Yarn is
> killing the executor b/c it thinks it's exceeding memory.  However, I can't
> repro any OOM issues when running locally, no matter the size of the data
> set.
>
> It seems like Yarn thinks the heap size is increasing according to the
> Yarn logs:
>
> 2014-12-18 22:06:43,505 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
> container_1418928607193_0011_01_000002: 6.1 GB of 6.5 GB physical memory
> used; 13.8 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:46,516 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
> container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory
> used; 13.9 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:49,524 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
> container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory
> used; 14.0 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:52,531 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
> container_1418928607193_0011_01_000002: 6.4 GB of 6.5 GB physical memory
> used; 14.1 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:55,538 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
> container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory
> used; 14.2 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:58,549 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> (Container Monitor): Memory usage of ProcessTree 24273 for container-id
> container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory
> used; 14.3 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:58,549 WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> (Container Monitor): Process tree for container:
> container_1418928607193_0011_01_000002 has processes older than 1 iteration
> running over the configured limit. Limit=6979321856, current usage =
> 6995812352
> 2014-12-18 22:06:58,549 WARN
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> (Container Monitor): Container
> [pid=24273,containerID=container_1418928607193_0011_01_000002] is running
> beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical
> memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container.
> Dump of the process-tree for container_1418928607193_0011_01_000002 :
> |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> |- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c
> /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
> -Xms6144m -Xmx6144m  -verbose:gc -XX:+HeapDumpOnOutOfMemoryError
> -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
> -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
> org.apache.spark.executor.CoarseGrainedExecutorBackend
> akka.tcp://sparkDriver@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
> 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1>
> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stdout
> 2>
> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stderr
> |- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660
> /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m
> -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails
> -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
> -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
> -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
> org.apache.spark.executor.CoarseGrainedExecutorBackend
> akka.tcp://sparkDriver@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
> 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4
>
>
> I've analyzed some heap dumps and see nothing out of the ordinary.   Would
> love to know what could be causing this.
>
>
> On Fri, Dec 19, 2014 at 7:46 AM, bethesda <sw...@mac.com> wrote:
>
>> I have a job that runs fine on relatively small input datasets but then
>> reaches a threshold where I begin to consistently get "Fetch failure" for
>> the Failure Reason, late in the job, during a saveAsText() operation.
>>
>> The first error we are seeing on the "Details for Stage" page is
>> "ExecutorLostFailure"
>>
>> My Shuffle Read is 3.3 GB and that's the only thing that seems high, we
>> have
>> three servers and they are configured on this job for 5g memory, and the
>> job
>> is running in spark-shell.  The first error in the shell is "Lost
>> executor 2
>> on (servername): remote Akka client disassociated.
>>
>> We are still trying to understand how to best diagnose jobs using the web
>> ui
>> so it's likely that there's some helpful info here that we just don't know
>> how to interpret -- is there any kind of "troubleshooting guide" beyond
>> the
>> Spark Configuration page?  I don't know if I'm providing enough info here.
>>
>> thanks.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>
>
>

Re: "Fetch Failure"

Posted by sa...@cloudera.com.

Hi Jon,

The fix for this is to increase spark.yarn.executor.memoryOverhead to something greater than it's default of 384.

This will increase the gap between the executors heap size and what it requests from yarn. It's required because jvms take up some memory beyond their heap size.

-Sandy

> On Dec 19, 2014, at 9:04 AM, Jon Chase <jo...@gmail.com> wrote:
> 
> I'm getting the same error ("ExecutorLostFailure") - input RDD is 100k small files (~2MB each).  I do a simple map, then keyBy(), and then rdd.saveAsHadoopDataset(...).  Depending on the memory settings given to spark-submit, the time before the first ExecutorLostFailure varies (more memory == longer until failure) - but this usually happens after about 100 files being processed.  
> 
> I'm running Spark 1.1.0 on AWS EMR w/Yarn.    It appears that Yarn is killing the executor b/c it thinks it's exceeding memory.  However, I can't repro any OOM issues when running locally, no matter the size of the data set. 
> 
> It seems like Yarn thinks the heap size is increasing according to the Yarn logs:
> 
> 2014-12-18 22:06:43,505 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_000002: 6.1 GB of 6.5 GB physical memory used; 13.8 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:46,516 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory used; 13.9 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:49,524 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory used; 14.0 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:52,531 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_000002: 6.4 GB of 6.5 GB physical memory used; 14.1 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:55,538 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory used; 14.2 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:58,549 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Memory usage of ProcessTree 24273 for container-id container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory used; 14.3 GB of 32.5 GB virtual memory used
> 2014-12-18 22:06:58,549 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Process tree for container: container_1418928607193_0011_01_000002 has processes older than 1 iteration running over the configured limit. Limit=6979321856, current usage = 6995812352
> 2014-12-18 22:06:58,549 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl (Container Monitor): Container [pid=24273,containerID=container_1418928607193_0011_01_000002] is running beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container.
> Dump of the process-tree for container_1418928607193_0011_01_000002 :
> 	|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
> 	|- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p' -Xms6144m -Xmx6144m  -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stdout 2> /mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stderr 
> 	|- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660 /usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m -Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://sparkDriver@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler 1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 
> 
> 
> I've analyzed some heap dumps and see nothing out of the ordinary.   Would love to know what could be causing this.
> 
> 
>> On Fri, Dec 19, 2014 at 7:46 AM, bethesda <sw...@mac.com> wrote:
>> I have a job that runs fine on relatively small input datasets but then
>> reaches a threshold where I begin to consistently get "Fetch failure" for
>> the Failure Reason, late in the job, during a saveAsText() operation.
>> 
>> The first error we are seeing on the "Details for Stage" page is
>> "ExecutorLostFailure"
>> 
>> My Shuffle Read is 3.3 GB and that's the only thing that seems high, we have
>> three servers and they are configured on this job for 5g memory, and the job
>> is running in spark-shell.  The first error in the shell is "Lost executor 2
>> on (servername): remote Akka client disassociated.
>> 
>> We are still trying to understand how to best diagnose jobs using the web ui
>> so it's likely that there's some helpful info here that we just don't know
>> how to interpret -- is there any kind of "troubleshooting guide" beyond the
>> Spark Configuration page?  I don't know if I'm providing enough info here.
>> 
>> thanks.
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>

Re: "Fetch Failure"

Posted by Jon Chase <jo...@gmail.com>.

I'm getting the same error ("ExecutorLostFailure") - input RDD is 100k
small files (~2MB each).  I do a simple map, then keyBy(), and then
rdd.saveAsHadoopDataset(...).  Depending on the memory settings given to
spark-submit, the time before the first ExecutorLostFailure varies (more
memory == longer until failure) - but this usually happens after about 100
files being processed.

I'm running Spark 1.1.0 on AWS EMR w/Yarn.    It appears that Yarn is
killing the executor b/c it thinks it's exceeding memory.  However, I can't
repro any OOM issues when running locally, no matter the size of the data
set.

It seems like Yarn thinks the heap size is increasing according to the Yarn
logs:

2014-12-18 22:06:43,505 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_000002: 6.1 GB of 6.5 GB physical memory
used; 13.8 GB of 32.5 GB virtual memory used
2014-12-18 22:06:46,516 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory
used; 13.9 GB of 32.5 GB virtual memory used
2014-12-18 22:06:49,524 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_000002: 6.2 GB of 6.5 GB physical memory
used; 14.0 GB of 32.5 GB virtual memory used
2014-12-18 22:06:52,531 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_000002: 6.4 GB of 6.5 GB physical memory
used; 14.1 GB of 32.5 GB virtual memory used
2014-12-18 22:06:55,538 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory
used; 14.2 GB of 32.5 GB virtual memory used
2014-12-18 22:06:58,549 INFO
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Memory usage of ProcessTree 24273 for container-id
container_1418928607193_0011_01_000002: 6.5 GB of 6.5 GB physical memory
used; 14.3 GB of 32.5 GB virtual memory used
2014-12-18 22:06:58,549 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Process tree for container:
container_1418928607193_0011_01_000002 has processes older than 1 iteration
running over the configured limit. Limit=6979321856, current usage =
6995812352
2014-12-18 22:06:58,549 WARN
org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
(Container Monitor): Container
[pid=24273,containerID=container_1418928607193_0011_01_000002] is running
beyond physical memory limits. Current usage: 6.5 GB of 6.5 GB physical
memory used; 14.3 GB of 32.5 GB virtual memory used. Killing container.
Dump of the process-tree for container_1418928607193_0011_01_000002 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 24273 4304 24273 24273 (bash) 0 0 115630080 302 /bin/bash -c
/usr/java/latest/bin/java -server -XX:OnOutOfMemoryError='kill %p'
-Xms6144m -Xmx6144m  -verbose:gc -XX:+HeapDumpOnOutOfMemoryError
-XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
-Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://sparkDriver@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4 1>
/mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stdout
2>
/mnt/var/log/hadoop/userlogs/application_1418928607193_0011/container_1418928607193_0011_01_000002/stderr
|- 24277 24273 24273 24273 (java) 13808 1730 15204556800 1707660
/usr/java/latest/bin/java -server -XX:OnOutOfMemoryError=kill %p -Xms6144m
-Xmx6144m -verbose:gc -XX:+HeapDumpOnOutOfMemoryError -XX:+PrintGCDetails
-XX:+PrintGCDateStamps -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70
-Djava.io.tmpdir=/mnt1/var/lib/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1418928607193_0011/container_1418928607193_0011_01_000002/tmp
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://sparkDriver@ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal:54357/user/CoarseGrainedScheduler
1 ip-xx-xxx-xxx-xxx.eu-west-1.compute.internal 4


I've analyzed some heap dumps and see nothing out of the ordinary.   Would
love to know what could be causing this.


On Fri, Dec 19, 2014 at 7:46 AM, bethesda <sw...@mac.com> wrote:

> I have a job that runs fine on relatively small input datasets but then
> reaches a threshold where I begin to consistently get "Fetch failure" for
> the Failure Reason, late in the job, during a saveAsText() operation.
>
> The first error we are seeing on the "Details for Stage" page is
> "ExecutorLostFailure"
>
> My Shuffle Read is 3.3 GB and that's the only thing that seems high, we
> have
> three servers and they are configured on this job for 5g memory, and the
> job
> is running in spark-shell.  The first error in the shell is "Lost executor
> 2
> on (servername): remote Akka client disassociated.
>
> We are still trying to understand how to best diagnose jobs using the web
> ui
> so it's likely that there's some helpful info here that we just don't know
> how to interpret -- is there any kind of "troubleshooting guide" beyond the
> Spark Configuration page?  I don't know if I'm providing enough info here.
>
> thanks.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Fetch-Failure-tp20787.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>