You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Nathan Kronenfeld <nk...@oculusinfo.com> on 2014/03/24 14:13:02 UTC

Akka error with largish job (works fine for smaller versions)

What does this error mean:

@hadoop-s2.oculus.local:45186]: Error [Association failed with
[akka.tcp://spark@hadoop-s2.oculus.local:45186]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://spark@hadoop-s2.oculus.local:45186]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: hadoop-s2.oculus.loca\
l/192.168.0.47:45186
]

?

-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenfeld@oculusinfo.com

Re: Akka error with largish job (works fine for smaller versions)

Posted by Nathan Kronenfeld <nk...@oculusinfo.com>.

After digging deeper, I realized all the workers ran out of memory, giving
an hs_error.log file in /tmp/jvm-<PID> with the header:

# Native memory allocation (malloc) failed to allocate 2097152 bytes for
committing reserved memory.
# Possible reasons:
#   The system is out of physical RAM or swap space
#   In 32 bit mode, the process size limit was hit
# Possible solutions:
#   Reduce memory load on the system
#   Increase physical memory or swap space
#   Check if swap backing store is full
#   Use 64 bit Java on a 64 bit OS
#   Decrease Java heap size (-Xmx/-Xms)
#   Decrease number of Java threads
#   Decrease Java thread stack sizes (-Xss)
#   Set larger code cache with -XX:ReservedCodeCacheSize=
# This output file may be truncated or incomplete.
#
#  Out of Memory Error (os_linux.cpp:2761), pid=31426, tid=139549745604352
#
# JRE version: OpenJDK Runtime Environment (7.0_51-b02) (build
1.7.0_51-mockbuild_2014_01_15_01_3
9-b00)
# Java VM: OpenJDK 64-Bit Server VM (24.45-b08 mixed mode linux-amd64 )

We have 3 workers, each assigned 200G for spark.
The dataset is ~250g

All I'm doing is data.map(r => (getKey(r),
r)).sortByKey().map(_._2).coalesce(n).saveAsTextFile(), where n is the
original number of files in the dataset.

This worked fine under spark 0.8.1, with the same setup; I haven't changed
this code since upgrading to 0.9.0.

I took a look at a workers memory before it ran out using jmap and jhat;
they indicated file handles as the biggest memory user (which I guess makes
sense for a sort) - but the total was nowhere close to 200g, so I find
their output somewhat suspect.

On Tue, Mar 25, 2014 at 6:59 AM, Andrew Ash <an...@andrewash.com> wrote:

> Possibly one of your executors is in the middle of a large stop-the-world
> GC and doesn't respond to network traffic during that period?  If you
> shared some information about how each node in your cluster is set up (heap
> size, memory, CPU, etc) that might help with debugging.
>
> Andrew
>
>
> On Mon, Mar 24, 2014 at 9:13 PM, Nathan Kronenfeld <
> nkronenfeld@oculusinfo.com> wrote:
>
>> What does this error mean:
>>
>> @hadoop-s2.oculus.local:45186]: Error [Association failed with
>> [akka.tcp://spark@hadoop-s2.oculus.local:45186]] [
>> akka.remote.EndpointAssociationException: Association failed with
>> [akka.tcp://spark@hadoop-s2.oculus.local:45186]
>> Caused by:
>> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
>> Connection refused: hadoop-s2.oculus.loca\
>> l/192.168.0.47:45186
>> ]
>>
>> ?
>>
>> --
>> Nathan Kronenfeld
>> Senior Visualization Developer
>> Oculus Info Inc
>> 2 Berkeley Street, Suite 600,
>> Toronto, Ontario M5A 4J5
>> Phone:  +1-416-203-3003 x 238
>> Email:  nkronenfeld@oculusinfo.com
>>
>
>

-- 
Nathan Kronenfeld
Senior Visualization Developer
Oculus Info Inc
2 Berkeley Street, Suite 600,
Toronto, Ontario M5A 4J5
Phone:  +1-416-203-3003 x 238
Email:  nkronenfeld@oculusinfo.com

Re: Akka error with largish job (works fine for smaller versions)

Posted by Andrew Ash <an...@andrewash.com>.

Possibly one of your executors is in the middle of a large stop-the-world
GC and doesn't respond to network traffic during that period?  If you
shared some information about how each node in your cluster is set up (heap
size, memory, CPU, etc) that might help with debugging.

Andrew


On Mon, Mar 24, 2014 at 9:13 PM, Nathan Kronenfeld <
nkronenfeld@oculusinfo.com> wrote:

> What does this error mean:
>
> @hadoop-s2.oculus.local:45186]: Error [Association failed with
> [akka.tcp://spark@hadoop-s2.oculus.local:45186]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://spark@hadoop-s2.oculus.local:45186]
> Caused by:
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: hadoop-s2.oculus.loca\
> l/192.168.0.47:45186
> ]
>
> ?
>
> --
> Nathan Kronenfeld
> Senior Visualization Developer
> Oculus Info Inc
> 2 Berkeley Street, Suite 600,
> Toronto, Ontario M5A 4J5
> Phone:  +1-416-203-3003 x 238
> Email:  nkronenfeld@oculusinfo.com
>