You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Irina Truong <ir...@parsely.com> on 2016/11/18 01:51:18 UTC

Long-running job OOMs driver process

We have an application that reads text files, converts them to dataframes,
and saves them in Parquet format. The application runs fine when processing
a few files, but we have several thousand produced every day. When running
the job for all files, we have spark-submit killed on OOM:

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 27226"...

The job is written in Python. We’re running it in Amazon EMR 5.0 (Spark
2.0.0) with spark-submit. We’re using a cluster with a master c3.2xlarge
instance (8 cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores
and 30g of RAM each). Spark config settings are as follows:

('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),

('spark.executors.instances', '3'),

('spark.yarn.executor.memoryOverhead', '9g'),

('spark.executor.cores', '15'),

('spark.executor.memory', '12g'),

('spark.scheduler.mode', 'FIFO'),

('spark.cleaner.ttl', '1800'),

The job processes each file in a thread, and we have 10 threads running
concurrently. The process will OOM after about 4 hours, at which point
Spark has processed over 20,000 jobs.

It seems like the driver is running out of memory, but each individual job
is quite small. Are there any known memory leaks for long-running Spark
applications on Yarn?

Re: Long-running job OOMs driver process

Posted by Steve Loughran <st...@hortonworks.com>.
On 18 Nov 2016, at 14:31, Keith Bourgoin <ke...@parsely.com>> wrote:

We thread the file processing to amortize the cost of things like getting files from S3.

Define cost here: actual $ amount, or merely time to read the data?

If it's read times, you should really be trying the new stuff coming in the hadoop-2.8+ s3a client, which has put a lot of work into higher performance reading of ORC & Parquet data, plus general improvements in listing/opening, etc, trying to cut down on slow metadata queries. You are still going to have delays of tens to hundreds of millis on every HTTP request (bigger ones for DNS problems and/or s3 load balancer overload), but once open, seek + read of s3 data will be much faster (not end-to-end read of an s3 file though, that's just bandwidth limitation after the HTTPS negotiation).

http://www.slideshare.net/steve_l/hadoop-hive-spark-and-object-stores

Also, do make sure you are using s3a URLs, if you weren't already

-Steve

Re: Long-running job OOMs driver process

Posted by Keith Bourgoin <ke...@parsely.com>.
Thanks for the input. I had read somewhere that s3:// was the way to go due
to some recent changes, but apparently that was outdated. I’m working on
creating some dummy data and a script to process it right now. I’ll post
here with code and logs when I can successfully reproduce the issue on
non-production data.

Yong, that's a good point about the web content. I had forgotten to mention
that when I first saw this a few months ago, on another project, I could
sometimes trigger the OOM by trying to view the web ui for the job. That's
another case I'll try to reproduce.

Thanks again!

Keith.

On Fri, Nov 18, 2016 at 10:30 AM Yong Zhang <ja...@hotmail.com> wrote:

> Just wondering, is it possible the memory usage keeping going up due to
> the web UI content?
>
>
> Yong
>
>
> ------------------------------
> *From:* Alexis Seigneurin <as...@ipponusa.com>
> *Sent:* Friday, November 18, 2016 10:17 AM
> *To:* Nathan Lande
> *Cc:* Keith Bourgoin; Irina Truong; user@spark.incubator.apache.org
> *Subject:* Re: Long-running job OOMs driver process
>
> +1 for using S3A.
>
> It would also depend on what format you're using. I agree with Steve that
> Parquet, for instance, is a good option. If you're using plain text files,
> some people use GZ files but they cannot be partitioned, thus putting a lot
> of pressure on the driver. It doesn't look like this is the issue you're
> running into, though, because it would not be a progressive slow down, but
> please provide as much detail as possible about your app.
>
> The cache could be an issue but the OOM would come from an executor, not
> from the driver.
>
> From what you're saying, Keith, it indeed looks like some memory is not
> being freed. Seeing the code would help. If you can, also send all the logs
> (with Spark at least in INFO level).
>
> Alexis
>
> On Fri, Nov 18, 2016 at 10:08 AM, Nathan Lande <na...@gmail.com>
> wrote:
>
> +1 to not threading.
>
> What does your load look like? If you are loading many files and cacheing
> them in N rdds rather than 1 rdd this could be an issue.
>
> If the above two things don't fix your oom issue, without knowing anything
> else about your job, I would focus on your cacheing strategy as a potential
> culprit. Try running without any cacheing to isolate the issue; bad
> cacheing strategy is the source of oom issues for me most of the time.
>
> On Nov 18, 2016 6:31 AM, "Keith Bourgoin" <ke...@parsely.com> wrote:
>
> Hi Alexis,
>
> Thanks for the response. I've been working with Irina on trying to sort
> this issue out.
>
> We thread the file processing to amortize the cost of things like getting
> files from S3. It's a pattern we've seen recommended in many places, but I
> don't have any of those links handy.  The problem isn't the threading, per
> se, but clearly some sort of memory leak in the driver itself.  Each file
> is a self-contained unit of work, so once it's done all memory related to
> it should be freed. Nothing in the script itself grows over time, so if it
> can do 10 concurrently, it should be able to run like that forever.
>
> I've hit this same issue working on another Spark app which wasn't
> threaded, but produced tens of thousands of jobs. Eventually, the Spark UI
> would get slow, then unresponsive, and then be killed due to OOM.
>
> I'll try to cook up some examples of this today, threaded and not. We were
> hoping that someone had seen this before and it rung a bell. Maybe there's
> a setting to clean up info from old jobs that we can adjust.
>
> Cheers,
>
> Keith.
>
> On Thu, Nov 17, 2016 at 9:50 PM Alexis Seigneurin <
> aseigneurin@ipponusa.com> wrote:
>
> Hi Irina,
>
> I would question the use of multiple threads in your application. Since
> Spark is going to run the processing of each DataFrame on all the cores of
> your cluster, the processes will be competing for resources. In fact, they
> would not only compete for CPU cores but also for memory.
>
> Spark is designed to run your processes in a sequence, and each process
> will be run in a distributed manner (multiple threads on multiple
> instances). I would suggest to follow this principle.
>
> Feel free to share to code if you can. It's always helpful so that we can
> give better advice.
>
> Alexis
>
> On Thu, Nov 17, 2016 at 8:51 PM, Irina Truong <ir...@parsely.com> wrote:
>
> We have an application that reads text files, converts them to dataframes,
> and saves them in Parquet format. The application runs fine when processing
> a few files, but we have several thousand produced every day. When running
> the job for all files, we have spark-submit killed on OOM:
>
> #
> # java.lang.OutOfMemoryError: Java heap space
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 27226"...
>
> The job is written in Python. We’re running it in Amazon EMR 5.0 (Spark
> 2.0.0) with spark-submit. We’re using a cluster with a master c3.2xlarge
> instance (8 cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores
> and 30g of RAM each). Spark config settings are as follows:
>
> ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
>
> ('spark.executors.instances', '3'),
>
> ('spark.yarn.executor.memoryOverhead', '9g'),
>
> ('spark.executor.cores', '15'),
>
> ('spark.executor.memory', '12g'),
>
> ('spark.scheduler.mode', 'FIFO'),
>
> ('spark.cleaner.ttl', '1800'),
>
> The job processes each file in a thread, and we have 10 threads running
> concurrently. The process will OOM after about 4 hours, at which point
> Spark has processed over 20,000 jobs.
>
> It seems like the driver is running out of memory, but each individual job
> is quite small. Are there any known memory leaks for long-running Spark
> applications on Yarn?
>
>
>
>
> --
>
> *Alexis Seigneurin *
> *Managing Consultant*
> (202) 459-1591 <202%20459.1591> - LinkedIn
> <http://www.linkedin.com/in/alexisseigneurin>
>
> <http://ipponusa.com/>
> Rate our service <https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>
>
>
>
>
> --
>
> *Alexis Seigneurin *
> *Managing Consultant*
> (202) 459-1591 <202%20459.1591> - LinkedIn
> <http://www.linkedin.com/in/alexisseigneurin>
>
> <http://ipponusa.com/>
> Rate our service <https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>
>

Re: Long-running job OOMs driver process

Posted by Yong Zhang <ja...@hotmail.com>.
Just wondering, is it possible the memory usage keeping going up due to the web UI content?


Yong


________________________________
From: Alexis Seigneurin <as...@ipponusa.com>
Sent: Friday, November 18, 2016 10:17 AM
To: Nathan Lande
Cc: Keith Bourgoin; Irina Truong; user@spark.incubator.apache.org
Subject: Re: Long-running job OOMs driver process

+1 for using S3A.

It would also depend on what format you're using. I agree with Steve that Parquet, for instance, is a good option. If you're using plain text files, some people use GZ files but they cannot be partitioned, thus putting a lot of pressure on the driver. It doesn't look like this is the issue you're running into, though, because it would not be a progressive slow down, but please provide as much detail as possible about your app.

The cache could be an issue but the OOM would come from an executor, not from the driver.

From what you're saying, Keith, it indeed looks like some memory is not being freed. Seeing the code would help. If you can, also send all the logs (with Spark at least in INFO level).

Alexis

On Fri, Nov 18, 2016 at 10:08 AM, Nathan Lande <na...@gmail.com>> wrote:

+1 to not threading.

What does your load look like? If you are loading many files and cacheing them in N rdds rather than 1 rdd this could be an issue.

If the above two things don't fix your oom issue, without knowing anything else about your job, I would focus on your cacheing strategy as a potential culprit. Try running without any cacheing to isolate the issue; bad cacheing strategy is the source of oom issues for me most of the time.

On Nov 18, 2016 6:31 AM, "Keith Bourgoin" <ke...@parsely.com>> wrote:
Hi Alexis,

Thanks for the response. I've been working with Irina on trying to sort this issue out.

We thread the file processing to amortize the cost of things like getting files from S3. It's a pattern we've seen recommended in many places, but I don't have any of those links handy.  The problem isn't the threading, per se, but clearly some sort of memory leak in the driver itself.  Each file is a self-contained unit of work, so once it's done all memory related to it should be freed. Nothing in the script itself grows over time, so if it can do 10 concurrently, it should be able to run like that forever.

I've hit this same issue working on another Spark app which wasn't threaded, but produced tens of thousands of jobs. Eventually, the Spark UI would get slow, then unresponsive, and then be killed due to OOM.

I'll try to cook up some examples of this today, threaded and not. We were hoping that someone had seen this before and it rung a bell. Maybe there's a setting to clean up info from old jobs that we can adjust.

Cheers,

Keith.

On Thu, Nov 17, 2016 at 9:50 PM Alexis Seigneurin <as...@ipponusa.com>> wrote:
Hi Irina,

I would question the use of multiple threads in your application. Since Spark is going to run the processing of each DataFrame on all the cores of your cluster, the processes will be competing for resources. In fact, they would not only compete for CPU cores but also for memory.

Spark is designed to run your processes in a sequence, and each process will be run in a distributed manner (multiple threads on multiple instances). I would suggest to follow this principle.

Feel free to share to code if you can. It's always helpful so that we can give better advice.

Alexis

On Thu, Nov 17, 2016 at 8:51 PM, Irina Truong <ir...@parsely.com>> wrote:

We have an application that reads text files, converts them to dataframes, and saves them in Parquet format. The application runs fine when processing a few files, but we have several thousand produced every day. When running the job for all files, we have spark-submit killed on OOM:


#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 27226"...


The job is written in Python. We're running it in Amazon EMR 5.0 (Spark 2.0.0) with spark-submit. We're using a cluster with a master c3.2xlarge instance (8 cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores and 30g of RAM each). Spark config settings are as follows:


('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),

('spark.executors.instances', '3'),

('spark.yarn.executor.memoryOverhead', '9g'),

('spark.executor.cores', '15'),

('spark.executor.memory', '12g'),

('spark.scheduler.mode', 'FIFO'),

('spark.cleaner.ttl', '1800'),


The job processes each file in a thread, and we have 10 threads running concurrently. The process will OOM after about 4 hours, at which point Spark has processed over 20,000 jobs.

It seems like the driver is running out of memory, but each individual job is quite small. Are there any known memory leaks for long-running Spark applications on Yarn?



--
Alexis Seigneurin
Managing Consultant
(202) 459-1591<tel:202%20459.1591> - LinkedIn<http://www.linkedin.com/in/alexisseigneurin>

[X]<http://ipponusa.com/>
Rate our service<https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>



--
Alexis Seigneurin
Managing Consultant
(202) 459-1591<tel:202%20459.1591> - LinkedIn<http://www.linkedin.com/in/alexisseigneurin>

[cid:E06DEFDD-9FA0-472D-8830-D7C7B1FEF8E9@home]<http://ipponusa.com/>
Rate our service<https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>

Re: Long-running job OOMs driver process

Posted by Alexis Seigneurin <as...@ipponusa.com>.
+1 for using S3A.

It would also depend on what format you're using. I agree with Steve that
Parquet, for instance, is a good option. If you're using plain text files,
some people use GZ files but they cannot be partitioned, thus putting a lot
of pressure on the driver. It doesn't look like this is the issue you're
running into, though, because it would not be a progressive slow down, but
please provide as much detail as possible about your app.

The cache could be an issue but the OOM would come from an executor, not
from the driver.

From what you're saying, Keith, it indeed looks like some memory is not
being freed. Seeing the code would help. If you can, also send all the logs
(with Spark at least in INFO level).

Alexis

On Fri, Nov 18, 2016 at 10:08 AM, Nathan Lande <na...@gmail.com>
wrote:

> +1 to not threading.
>
> What does your load look like? If you are loading many files and cacheing
> them in N rdds rather than 1 rdd this could be an issue.
>
> If the above two things don't fix your oom issue, without knowing anything
> else about your job, I would focus on your cacheing strategy as a potential
> culprit. Try running without any cacheing to isolate the issue; bad
> cacheing strategy is the source of oom issues for me most of the time.
>
> On Nov 18, 2016 6:31 AM, "Keith Bourgoin" <ke...@parsely.com> wrote:
>
>> Hi Alexis,
>>
>> Thanks for the response. I've been working with Irina on trying to sort
>> this issue out.
>>
>> We thread the file processing to amortize the cost of things like getting
>> files from S3. It's a pattern we've seen recommended in many places, but I
>> don't have any of those links handy.  The problem isn't the threading, per
>> se, but clearly some sort of memory leak in the driver itself.  Each file
>> is a self-contained unit of work, so once it's done all memory related to
>> it should be freed. Nothing in the script itself grows over time, so if it
>> can do 10 concurrently, it should be able to run like that forever.
>>
>> I've hit this same issue working on another Spark app which wasn't
>> threaded, but produced tens of thousands of jobs. Eventually, the Spark UI
>> would get slow, then unresponsive, and then be killed due to OOM.
>>
>> I'll try to cook up some examples of this today, threaded and not. We
>> were hoping that someone had seen this before and it rung a bell. Maybe
>> there's a setting to clean up info from old jobs that we can adjust.
>>
>> Cheers,
>>
>> Keith.
>>
>> On Thu, Nov 17, 2016 at 9:50 PM Alexis Seigneurin <
>> aseigneurin@ipponusa.com> wrote:
>>
>>> Hi Irina,
>>>
>>> I would question the use of multiple threads in your application. Since
>>> Spark is going to run the processing of each DataFrame on all the cores of
>>> your cluster, the processes will be competing for resources. In fact, they
>>> would not only compete for CPU cores but also for memory.
>>>
>>> Spark is designed to run your processes in a sequence, and each process
>>> will be run in a distributed manner (multiple threads on multiple
>>> instances). I would suggest to follow this principle.
>>>
>>> Feel free to share to code if you can. It's always helpful so that we
>>> can give better advice.
>>>
>>> Alexis
>>>
>>> On Thu, Nov 17, 2016 at 8:51 PM, Irina Truong <ir...@parsely.com> wrote:
>>>
>>> We have an application that reads text files, converts them to
>>> dataframes, and saves them in Parquet format. The application runs fine
>>> when processing a few files, but we have several thousand produced every
>>> day. When running the job for all files, we have spark-submit killed on OOM:
>>>
>>> #
>>> # java.lang.OutOfMemoryError: Java heap space
>>> # -XX:OnOutOfMemoryError="kill -9 %p"
>>> #   Executing /bin/sh -c "kill -9 27226"...
>>>
>>> The job is written in Python. We’re running it in Amazon EMR 5.0 (Spark
>>> 2.0.0) with spark-submit. We’re using a cluster with a master c3.2xlarge
>>> instance (8 cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores
>>> and 30g of RAM each). Spark config settings are as follows:
>>>
>>> ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
>>>
>>> ('spark.executors.instances', '3'),
>>>
>>> ('spark.yarn.executor.memoryOverhead', '9g'),
>>>
>>> ('spark.executor.cores', '15'),
>>>
>>> ('spark.executor.memory', '12g'),
>>>
>>> ('spark.scheduler.mode', 'FIFO'),
>>>
>>> ('spark.cleaner.ttl', '1800'),
>>>
>>> The job processes each file in a thread, and we have 10 threads running
>>> concurrently. The process will OOM after about 4 hours, at which point
>>> Spark has processed over 20,000 jobs.
>>>
>>> It seems like the driver is running out of memory, but each individual
>>> job is quite small. Are there any known memory leaks for long-running Spark
>>> applications on Yarn?
>>>
>>>
>>>
>>>
>>> --
>>>
>>> *Alexis Seigneurin*
>>> *Managing Consultant*
>>> (202) 459-1591 <202%20459.1591> - LinkedIn
>>> <http://www.linkedin.com/in/alexisseigneurin>
>>>
>>> <http://ipponusa.com/>
>>> Rate our service <https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>
>>>
>>


-- 

*Alexis Seigneurin*
*Managing Consultant*
(202) 459-1591 <202%20459.1591> - LinkedIn
<http://www.linkedin.com/in/alexisseigneurin>

<http://ipponusa.com/>
Rate our service <https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>

Re: Long-running job OOMs driver process

Posted by Nathan Lande <na...@gmail.com>.
+1 to not threading.

What does your load look like? If you are loading many files and cacheing
them in N rdds rather than 1 rdd this could be an issue.

If the above two things don't fix your oom issue, without knowing anything
else about your job, I would focus on your cacheing strategy as a potential
culprit. Try running without any cacheing to isolate the issue; bad
cacheing strategy is the source of oom issues for me most of the time.

On Nov 18, 2016 6:31 AM, "Keith Bourgoin" <ke...@parsely.com> wrote:

> Hi Alexis,
>
> Thanks for the response. I've been working with Irina on trying to sort
> this issue out.
>
> We thread the file processing to amortize the cost of things like getting
> files from S3. It's a pattern we've seen recommended in many places, but I
> don't have any of those links handy.  The problem isn't the threading, per
> se, but clearly some sort of memory leak in the driver itself.  Each file
> is a self-contained unit of work, so once it's done all memory related to
> it should be freed. Nothing in the script itself grows over time, so if it
> can do 10 concurrently, it should be able to run like that forever.
>
> I've hit this same issue working on another Spark app which wasn't
> threaded, but produced tens of thousands of jobs. Eventually, the Spark UI
> would get slow, then unresponsive, and then be killed due to OOM.
>
> I'll try to cook up some examples of this today, threaded and not. We were
> hoping that someone had seen this before and it rung a bell. Maybe there's
> a setting to clean up info from old jobs that we can adjust.
>
> Cheers,
>
> Keith.
>
> On Thu, Nov 17, 2016 at 9:50 PM Alexis Seigneurin <
> aseigneurin@ipponusa.com> wrote:
>
>> Hi Irina,
>>
>> I would question the use of multiple threads in your application. Since
>> Spark is going to run the processing of each DataFrame on all the cores of
>> your cluster, the processes will be competing for resources. In fact, they
>> would not only compete for CPU cores but also for memory.
>>
>> Spark is designed to run your processes in a sequence, and each process
>> will be run in a distributed manner (multiple threads on multiple
>> instances). I would suggest to follow this principle.
>>
>> Feel free to share to code if you can. It's always helpful so that we can
>> give better advice.
>>
>> Alexis
>>
>> On Thu, Nov 17, 2016 at 8:51 PM, Irina Truong <ir...@parsely.com> wrote:
>>
>> We have an application that reads text files, converts them to
>> dataframes, and saves them in Parquet format. The application runs fine
>> when processing a few files, but we have several thousand produced every
>> day. When running the job for all files, we have spark-submit killed on OOM:
>>
>> #
>> # java.lang.OutOfMemoryError: Java heap space
>> # -XX:OnOutOfMemoryError="kill -9 %p"
>> #   Executing /bin/sh -c "kill -9 27226"...
>>
>> The job is written in Python. We’re running it in Amazon EMR 5.0 (Spark
>> 2.0.0) with spark-submit. We’re using a cluster with a master c3.2xlarge
>> instance (8 cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores
>> and 30g of RAM each). Spark config settings are as follows:
>>
>> ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
>>
>> ('spark.executors.instances', '3'),
>>
>> ('spark.yarn.executor.memoryOverhead', '9g'),
>>
>> ('spark.executor.cores', '15'),
>>
>> ('spark.executor.memory', '12g'),
>>
>> ('spark.scheduler.mode', 'FIFO'),
>>
>> ('spark.cleaner.ttl', '1800'),
>>
>> The job processes each file in a thread, and we have 10 threads running
>> concurrently. The process will OOM after about 4 hours, at which point
>> Spark has processed over 20,000 jobs.
>>
>> It seems like the driver is running out of memory, but each individual
>> job is quite small. Are there any known memory leaks for long-running Spark
>> applications on Yarn?
>>
>>
>>
>>
>> --
>>
>> *Alexis Seigneurin*
>> *Managing Consultant*
>> (202) 459-1591 <202%20459.1591> - LinkedIn
>> <http://www.linkedin.com/in/alexisseigneurin>
>>
>> <http://ipponusa.com/>
>> Rate our service <https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>
>>
>

Re: Long-running job OOMs driver process

Posted by Keith Bourgoin <ke...@parsely.com>.
Hi Alexis,

Thanks for the response. I've been working with Irina on trying to sort
this issue out.

We thread the file processing to amortize the cost of things like getting
files from S3. It's a pattern we've seen recommended in many places, but I
don't have any of those links handy.  The problem isn't the threading, per
se, but clearly some sort of memory leak in the driver itself.  Each file
is a self-contained unit of work, so once it's done all memory related to
it should be freed. Nothing in the script itself grows over time, so if it
can do 10 concurrently, it should be able to run like that forever.

I've hit this same issue working on another Spark app which wasn't
threaded, but produced tens of thousands of jobs. Eventually, the Spark UI
would get slow, then unresponsive, and then be killed due to OOM.

I'll try to cook up some examples of this today, threaded and not. We were
hoping that someone had seen this before and it rung a bell. Maybe there's
a setting to clean up info from old jobs that we can adjust.

Cheers,

Keith.

On Thu, Nov 17, 2016 at 9:50 PM Alexis Seigneurin <as...@ipponusa.com>
wrote:

> Hi Irina,
>
> I would question the use of multiple threads in your application. Since
> Spark is going to run the processing of each DataFrame on all the cores of
> your cluster, the processes will be competing for resources. In fact, they
> would not only compete for CPU cores but also for memory.
>
> Spark is designed to run your processes in a sequence, and each process
> will be run in a distributed manner (multiple threads on multiple
> instances). I would suggest to follow this principle.
>
> Feel free to share to code if you can. It's always helpful so that we can
> give better advice.
>
> Alexis
>
> On Thu, Nov 17, 2016 at 8:51 PM, Irina Truong <ir...@parsely.com> wrote:
>
> We have an application that reads text files, converts them to dataframes,
> and saves them in Parquet format. The application runs fine when processing
> a few files, but we have several thousand produced every day. When running
> the job for all files, we have spark-submit killed on OOM:
>
> #
> # java.lang.OutOfMemoryError: Java heap space
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 27226"...
>
> The job is written in Python. We’re running it in Amazon EMR 5.0 (Spark
> 2.0.0) with spark-submit. We’re using a cluster with a master c3.2xlarge
> instance (8 cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores
> and 30g of RAM each). Spark config settings are as follows:
>
> ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
>
> ('spark.executors.instances', '3'),
>
> ('spark.yarn.executor.memoryOverhead', '9g'),
>
> ('spark.executor.cores', '15'),
>
> ('spark.executor.memory', '12g'),
>
> ('spark.scheduler.mode', 'FIFO'),
>
> ('spark.cleaner.ttl', '1800'),
>
> The job processes each file in a thread, and we have 10 threads running
> concurrently. The process will OOM after about 4 hours, at which point
> Spark has processed over 20,000 jobs.
>
> It seems like the driver is running out of memory, but each individual job
> is quite small. Are there any known memory leaks for long-running Spark
> applications on Yarn?
>
>
>
>
> --
>
> *Alexis Seigneurin*
> *Managing Consultant*
> (202) 459-1591 <202%20459.1591> - LinkedIn
> <http://www.linkedin.com/in/alexisseigneurin>
>
> <http://ipponusa.com/>
> Rate our service <https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>
>

Re: Long-running job OOMs driver process

Posted by Alexis Seigneurin <as...@ipponusa.com>.
Hi Irina,

I would question the use of multiple threads in your application. Since
Spark is going to run the processing of each DataFrame on all the cores of
your cluster, the processes will be competing for resources. In fact, they
would not only compete for CPU cores but also for memory.

Spark is designed to run your processes in a sequence, and each process
will be run in a distributed manner (multiple threads on multiple
instances). I would suggest to follow this principle.

Feel free to share to code if you can. It's always helpful so that we can
give better advice.

Alexis

On Thu, Nov 17, 2016 at 8:51 PM, Irina Truong <ir...@parsely.com> wrote:

> We have an application that reads text files, converts them to dataframes,
> and saves them in Parquet format. The application runs fine when processing
> a few files, but we have several thousand produced every day. When running
> the job for all files, we have spark-submit killed on OOM:
>
> #
> # java.lang.OutOfMemoryError: Java heap space
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 27226"...
>
> The job is written in Python. We’re running it in Amazon EMR 5.0 (Spark
> 2.0.0) with spark-submit. We’re using a cluster with a master c3.2xlarge
> instance (8 cores and 15g of RAM) and 3 core c3.4xlarge instances (16 cores
> and 30g of RAM each). Spark config settings are as follows:
>
> ('spark.serializer', 'org.apache.spark.serializer.KryoSerializer'),
>
> ('spark.executors.instances', '3'),
>
> ('spark.yarn.executor.memoryOverhead', '9g'),
>
> ('spark.executor.cores', '15'),
>
> ('spark.executor.memory', '12g'),
>
> ('spark.scheduler.mode', 'FIFO'),
>
> ('spark.cleaner.ttl', '1800'),
>
> The job processes each file in a thread, and we have 10 threads running
> concurrently. The process will OOM after about 4 hours, at which point
> Spark has processed over 20,000 jobs.
>
> It seems like the driver is running out of memory, but each individual job
> is quite small. Are there any known memory leaks for long-running Spark
> applications on Yarn?
>



-- 

*Alexis Seigneurin*
*Managing Consultant*
(202) 459-1591 <202%20459.1591> - LinkedIn
<http://www.linkedin.com/in/alexisseigneurin>

<http://ipponusa.com/>
Rate our service <https://www.recommendi.com/app/survey/Eh2ZnWUPTxY>