You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by JOAQUIN GUANTER GONZALBEZ <jo...@telefonica.com> on 2016/02/15 16:42:35 UTC

Memory problems and missing heartbeats

Hello,

I am facing in my Project two different issues with Spark that are driving me crazy. I am currently running in EMR (Spark 1.5.2 + YARN), using the "--executor-memory 40G" option.

Problem #1
=========

Some of my processes get killed by YARN because the container is exceeding the physical memory YARN assigned it. I have been able to work around this issue by increasing the spark.yarn.executor.memoryOverhead parameter to 8G, but that doesn't seem like a good solution.

My understanding is that the JVM that will run my Spark process will get 40 GB of heap memory (-Xmx40G), and if there is memory pressure in the process then the GC should kick in to ensure that the heap never exceeds those 40 GB. My PermGen is set to 510MB, but that is a very long way from the 8GB I need to set as overhead. This seems to happen when I .cache() very big RDDs and I then perform operations that require shuffling (cogroup & co.).

- Who is using all that off heap memory?
- Are there any tools in the Spark ecosystem that might help me debug this?


Problem #2
=========

Some tasks fail because the heartbeat didn't get back to the master in 120 seconds. Again, I can more or less work around this by increasing the timeout to 5 minutes, but I don't feel this is addressing the real problem.

- Does the heartbeat have its own thread or would a long-running .map() block the heartbeat?
- What conditions would prevent the heartbeat from being sent?

Many thanks in advance for any help with this,
Ximo.

________________________________

Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e proceda a sua destruição

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


RE: Memory problems and missing heartbeats

Posted by JOAQUIN GUANTER GONZALBEZ <jo...@telefonica.com>.
A GC pause fits nicely with what I’m seeing. Many thanks for the link!

Ximo

De: Iulian Dragoș [mailto:iulian.dragos@typesafe.com]
Enviado el: martes, 16 de febrero de 2016 15:14
Para: JOAQUIN GUANTER GONZALBEZ <jo...@telefonica.com>
CC: user@spark.apache.org
Asunto: Re: Memory problems and missing heartbeats


Regarding your 2nd problem, my best guess is that you’re seeing GC pauses. It’s not unusual, given you’re using 40GB heaps. See for instance this blog post<http://gridgain.blogspot.ch/2014/06/jdk-g1-garbage-collector-pauses-for.html>

From conducting numerous tests, we have concluded that unless you are utilizing some off-heap technology (e.g. GridGain OffHeap), no Garbage Collector provided with JDK will render any kind of stable GC performance with heap sizes larger that 16GB. For example, on 50GB heaps we can often encounter up to 5 minute GC pauses, with average pauses of 2 to 4 seconds.

Not sure if Yarn can do this, but I would try to run with a smaller executor heap, and more executors per node.

iulian



________________________________

Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e proceda a sua destruição

RE: Memory problems and missing heartbeats

Posted by Ignacio Blasco <el...@gmail.com>.
Hi Ximo. Regarding to #1 you can try to increase the number of partitions
used for cogroup or reduce. AFAIK Spark needs to have enough memory space
to handle in memory all the data processed by a given partition, increasing
the number of partitions you can reduce that load. Probably we need to know
more about your workflow in order to assess if that is your case.

Nacho
El 16 feb. 2016 4:58 p. m., "JOAQUIN GUANTER GONZALBEZ" <
joaquin.guantergonzalbez@telefonica.com> escribió:

> Thanks. I'll take a look at Graphite to see if that helps me out with my
> first problem.
>
> Ximo.
>
> -----Mensaje original-----
> De: Arkadiusz Bicz [mailto:arkadiusz.bicz@gmail.com]
> Enviado el: martes, 16 de febrero de 2016 16:06
> Para: Iulian Dragoș <iu...@typesafe.com>
> CC: JOAQUIN GUANTER GONZALBEZ <jo...@telefonica.com>;
> user@spark.apache.org
> Asunto: Re: Memory problems and missing heartbeats
>
> I had similar as #2 problem when I used lot of caching and then doing
> shuffling It looks like when I cached too much there was no enough space
> for other spark tasks and it just hang on.
>
> That you can try to cache less and see if improve, also executor logs help
> a lot (watch out logs with information about spill) you can also monitor
> jobs jvms through spark monitoring
> http://spark.apache.org/docs/latest/monitoring.html and Graphite and
> Grafana.
>
> On Tue, Feb 16, 2016 at 2:14 PM, Iulian Dragoș <iu...@typesafe.com>
> wrote:
> > Regarding your 2nd problem, my best guess is that you’re seeing GC
> pauses.
> > It’s not unusual, given you’re using 40GB heaps. See for instance this
> > blog post
> >
> > From conducting numerous tests, we have concluded that unless you are
> > utilizing some off-heap technology (e.g. GridGain OffHeap), no Garbage
> > Collector provided with JDK will render any kind of stable GC
> > performance with heap sizes larger that 16GB. For example, on 50GB
> > heaps we can often encounter up to 5 minute GC pauses, with average
> pauses of 2 to 4 seconds.
> >
> > Not sure if Yarn can do this, but I would try to run with a smaller
> > executor heap, and more executors per node.
> >
> > iulian
> >
> >
>
> ________________________________
>
> Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario,
> puede contener información privilegiada o confidencial y es para uso
> exclusivo de la persona o entidad de destino. Si no es usted. el
> destinatario indicado, queda notificado de que la lectura, utilización,
> divulgación y/o copia sin autorización puede estar prohibida en virtud de
> la legislación vigente. Si ha recibido este mensaje por error, le rogamos
> que nos lo comunique inmediatamente por esta misma vía y proceda a su
> destrucción.
>
> The information contained in this transmission is privileged and
> confidential information intended only for the use of the individual or
> entity named above. If the reader of this message is not the intended
> recipient, you are hereby notified that any dissemination, distribution or
> copying of this communication is strictly prohibited. If you have received
> this transmission in error, do not read it. Please immediately reply to the
> sender that you have received this communication in error and then delete
> it.
>
> Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário,
> pode conter informação privilegiada ou confidencial e é para uso exclusivo
> da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário
> indicado, fica notificado de que a leitura, utilização, divulgação e/ou
> cópia sem autorização pode estar proibida em virtude da legislação vigente.
> Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique
> imediatamente por esta mesma via e proceda a sua destruição
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>

RE: Memory problems and missing heartbeats

Posted by JOAQUIN GUANTER GONZALBEZ <jo...@telefonica.com>.
Thanks. I'll take a look at Graphite to see if that helps me out with my first problem.

Ximo.

-----Mensaje original-----
De: Arkadiusz Bicz [mailto:arkadiusz.bicz@gmail.com]
Enviado el: martes, 16 de febrero de 2016 16:06
Para: Iulian Dragoș <iu...@typesafe.com>
CC: JOAQUIN GUANTER GONZALBEZ <jo...@telefonica.com>; user@spark.apache.org
Asunto: Re: Memory problems and missing heartbeats

I had similar as #2 problem when I used lot of caching and then doing shuffling It looks like when I cached too much there was no enough space for other spark tasks and it just hang on.

That you can try to cache less and see if improve, also executor logs help a lot (watch out logs with information about spill) you can also monitor jobs jvms through spark monitoring http://spark.apache.org/docs/latest/monitoring.html and Graphite and Grafana.

On Tue, Feb 16, 2016 at 2:14 PM, Iulian Dragoș <iu...@typesafe.com> wrote:
> Regarding your 2nd problem, my best guess is that you’re seeing GC pauses.
> It’s not unusual, given you’re using 40GB heaps. See for instance this
> blog post
>
> From conducting numerous tests, we have concluded that unless you are
> utilizing some off-heap technology (e.g. GridGain OffHeap), no Garbage
> Collector provided with JDK will render any kind of stable GC
> performance with heap sizes larger that 16GB. For example, on 50GB
> heaps we can often encounter up to 5 minute GC pauses, with average pauses of 2 to 4 seconds.
>
> Not sure if Yarn can do this, but I would try to run with a smaller
> executor heap, and more executors per node.
>
> iulian
>
>

________________________________

Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e proceda a sua destruição

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Memory problems and missing heartbeats

Posted by Arkadiusz Bicz <ar...@gmail.com>.
I had similar as #2 problem when I used lot of caching and then doing
shuffling It looks like when I cached too much there was no enough
space for other spark tasks and it just hang on.

That you can try to cache less and see if improve, also executor logs
help a lot (watch out logs with information about spill) you can also
monitor jobs jvms through spark monitoring
http://spark.apache.org/docs/latest/monitoring.html and Graphite and
Grafana.

On Tue, Feb 16, 2016 at 2:14 PM, Iulian Dragoș
<iu...@typesafe.com> wrote:
> Regarding your 2nd problem, my best guess is that you’re seeing GC pauses.
> It’s not unusual, given you’re using 40GB heaps. See for instance this blog
> post
>
> From conducting numerous tests, we have concluded that unless you are
> utilizing some off-heap technology (e.g. GridGain OffHeap), no Garbage
> Collector provided with JDK will render any kind of stable GC performance
> with heap sizes larger that 16GB. For example, on 50GB heaps we can often
> encounter up to 5 minute GC pauses, with average pauses of 2 to 4 seconds.
>
> Not sure if Yarn can do this, but I would try to run with a smaller executor
> heap, and more executors per node.
>
> iulian
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Memory problems and missing heartbeats

Posted by Iulian Dragoș <iu...@typesafe.com>.
Regarding your 2nd problem, my best guess is that you’re seeing GC pauses.
It’s not unusual, given you’re using 40GB heaps. See for instance this blog
post
<http://gridgain.blogspot.ch/2014/06/jdk-g1-garbage-collector-pauses-for.html>

>From conducting numerous tests, we have concluded that unless you are
utilizing some off-heap technology (e.g. GridGain OffHeap), no Garbage
Collector provided with JDK will render any kind of stable GC performance
with heap sizes larger that 16GB. For example, on 50GB heaps we can often
encounter up to 5 minute GC pauses, with average pauses of 2 to 4 seconds.

Not sure if Yarn can do this, but I would try to run with a smaller
executor heap, and more executors per node.

iulian

RE: Memory problems and missing heartbeats

Posted by JOAQUIN GUANTER GONZALBEZ <jo...@telefonica.com>.
Bumping this thread in hopes that someone will answer.

Ximo

-----Mensaje original-----
De: JOAQUIN GUANTER GONZALBEZ [mailto:joaquin.guantergonzalbez@telefonica.com]
Enviado el: lunes, 15 de febrero de 2016 16:43
Para: user@spark.apache.org
Asunto: Memory problems and missing heartbeats

Hello,

I am facing in my Project two different issues with Spark that are driving me crazy. I am currently running in EMR (Spark 1.5.2 + YARN), using the "--executor-memory 40G" option.

Problem #1
=========

Some of my processes get killed by YARN because the container is exceeding the physical memory YARN assigned it. I have been able to work around this issue by increasing the spark.yarn.executor.memoryOverhead parameter to 8G, but that doesn't seem like a good solution.

My understanding is that the JVM that will run my Spark process will get 40 GB of heap memory (-Xmx40G), and if there is memory pressure in the process then the GC should kick in to ensure that the heap never exceeds those 40 GB. My PermGen is set to 510MB, but that is a very long way from the 8GB I need to set as overhead. This seems to happen when I .cache() very big RDDs and I then perform operations that require shuffling (cogroup & co.).

- Who is using all that off heap memory?
- Are there any tools in the Spark ecosystem that might help me debug this?


Problem #2
=========

Some tasks fail because the heartbeat didn't get back to the master in 120 seconds. Again, I can more or less work around this by increasing the timeout to 5 minutes, but I don't feel this is addressing the real problem.

- Does the heartbeat have its own thread or would a long-running .map() block the heartbeat?
- What conditions would prevent the heartbeat from being sent?

Many thanks in advance for any help with this, Ximo.

________________________________

Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e proceda a sua destruição

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org For additional commands, e-mail: user-help@spark.apache.org


________________________________

Este mensaje y sus adjuntos se dirigen exclusivamente a su destinatario, puede contener información privilegiada o confidencial y es para uso exclusivo de la persona o entidad de destino. Si no es usted. el destinatario indicado, queda notificado de que la lectura, utilización, divulgación y/o copia sin autorización puede estar prohibida en virtud de la legislación vigente. Si ha recibido este mensaje por error, le rogamos que nos lo comunique inmediatamente por esta misma vía y proceda a su destrucción.

The information contained in this transmission is privileged and confidential information intended only for the use of the individual or entity named above. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, do not read it. Please immediately reply to the sender that you have received this communication in error and then delete it.

Esta mensagem e seus anexos se dirigem exclusivamente ao seu destinatário, pode conter informação privilegiada ou confidencial e é para uso exclusivo da pessoa ou entidade de destino. Se não é vossa senhoria o destinatário indicado, fica notificado de que a leitura, utilização, divulgação e/ou cópia sem autorização pode estar proibida em virtude da legislação vigente. Se recebeu esta mensagem por erro, rogamos-lhe que nos o comunique imediatamente por esta mesma via e proceda a sua destruição

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org