You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Aniket Bhatnagar <an...@gmail.com> on 2014/09/11 17:31:56 UTC

Out of memory with Spark Streaming

I am running a simple Spark Streaming program that pulls in data from
Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps
data and persists to a store.

The program is running in local mode right now and runs out of memory after
a while. I am yet to investigate heap dumps but I think Spark isn't
releasing memory after processing is complete. I have even tried changing
storage level to disk only.

Help!

Thanks,
Aniket

Re: Out of memory with Spark Streaming

Posted by Tim Smith <se...@gmail.com>.
I noticed that, by default, in CDH-5.1 (Spark 1.0.0), in both,
StandAlone and Yarn mode - no GC options are set when an executor is
launched. The only options passed in StandAlone mode are
"-XX:MaxPermSize=128m -Xms16384M -Xmx16384M" (when I give each
executor 16G).

In Yarn mode, even fewer JVM options are set - "-server
-XX:OnOutOfMemoryError=kill %p -Xms16384m -Xmx16384m"

Monitoring OS and heap usage side-by-side (using top and jmap), I see
that my physical memory usage is anywhere between 2x-5x of the heap
usage (all heap, not just live objects).

So I set this, SPARK_JAVA_OPTS="-XX:MaxPermSize=128m -XX:NewSize=1024m
-XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70
-XX:MaxHeapFreeRatio=70"

I am still monitoring but I think my app is more stable now, in
standalone mode, whereas earlier, under Yarn, the container would get
killed for too much memory usage.

How do I get Yarn to enforce SPARK_JAVA_OPTS? Setting
"spark.executor.extrajavaoptions" doesn't seem to work.



On Thu, Sep 11, 2014 at 1:50 PM, Tathagata Das
<ta...@gmail.com> wrote:
> Which version of spark are you running?
>
> If you are running the latest one, then could try running not a window but a
> simple event count on every 2 second batch, and see if you are still running
> out of memory?
>
> TD
>
>
> On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar
> <an...@gmail.com> wrote:
>>
>> I did change it to be 1 gb. It still ran out of memory but a little later.
>>
>> The streaming job isnt handling a lot of data. In every 2 seconds, it
>> doesn't get more than 50 records. Each record size is not more than 500
>> bytes.
>>
>> On Sep 11, 2014 10:54 PM, "Bharat Venkat" <bv...@gmail.com> wrote:
>>>
>>> You could set "spark.executor.memory" to something bigger than the
>>> default (512mb)
>>>
>>>
>>> On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar
>>> <an...@gmail.com> wrote:
>>>>
>>>> I am running a simple Spark Streaming program that pulls in data from
>>>> Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps
>>>> data and persists to a store.
>>>>
>>>> The program is running in local mode right now and runs out of memory
>>>> after a while. I am yet to investigate heap dumps but I think Spark isn't
>>>> releasing memory after processing is complete. I have even tried changing
>>>> storage level to disk only.
>>>>
>>>> Help!
>>>>
>>>> Thanks,
>>>> Aniket
>>>
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Re: Out of memory with Spark Streaming

Posted by Aniket Bhatnagar <an...@gmail.com>.
Thanks Chris for looking at this. I was putting data at roughly the same 50
records per batch max. This issue was purely because of a bug in my
persistence logic that was leaking memory.

Overall, I haven't seen a lot of lag with kinesis + spark setup and I am
able to process records at roughly the same rate as data as fed into
kinesis with acceptable latency.

Thanks,
Aniket
On Oct 31, 2014 1:15 AM, "Chris Fregly" <ch...@fregly.com> wrote:

> curious about why you're only seeing 50 records max per batch.
>
> how many receivers are you running?  what is the rate that you're putting
> data onto the stream?
>
> per the default AWS kinesis configuration, the producer can do 1000 PUTs
> per second with max 50k bytes per PUT and max 1mb per second per shard.
>
> on the consumer side, you can only do 5 GETs per second and 2mb per second
> per shard.
>
> my hunch is that the 5 GETs per second is what's limiting your consumption
> rate.
>
> can you verify that these numbers match what you're seeing?  if so, you
> may want to increase your shards and therefore the number of kinesis
> receivers.
>
> otherwise, this may require some further investigation on my part.  i
> wanna stay on top of this if it's an issue.
>
> thanks for posting this, aniket!
>
> -chris
>
> On Fri, Sep 12, 2014 at 5:34 AM, Aniket Bhatnagar <
> aniket.bhatnagar@gmail.com> wrote:
>
>> Hi all
>>
>> Sorry but this was totally my mistake. In my persistence logic, I was
>> creating async http client instance in RDD foreach but was never closing it
>> leading to memory leaks.
>>
>> Apologies for wasting everyone's time.
>>
>> Thanks,
>> Aniket
>>
>> On 12 September 2014 02:20, Tathagata Das <ta...@gmail.com>
>> wrote:
>>
>>> Which version of spark are you running?
>>>
>>> If you are running the latest one, then could try running not a window
>>> but a simple event count on every 2 second batch, and see if you are still
>>> running out of memory?
>>>
>>> TD
>>>
>>>
>>> On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar <
>>> aniket.bhatnagar@gmail.com> wrote:
>>>
>>>> I did change it to be 1 gb. It still ran out of memory but a little
>>>> later.
>>>>
>>>> The streaming job isnt handling a lot of data. In every 2 seconds, it
>>>> doesn't get more than 50 records. Each record size is not more than 500
>>>> bytes.
>>>>  On Sep 11, 2014 10:54 PM, "Bharat Venkat" <bv...@gmail.com>
>>>> wrote:
>>>>
>>>>> You could set "spark.executor.memory" to something bigger than the
>>>>> default (512mb)
>>>>>
>>>>>
>>>>> On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar <
>>>>> aniket.bhatnagar@gmail.com> wrote:
>>>>>
>>>>>> I am running a simple Spark Streaming program that pulls in data from
>>>>>> Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps
>>>>>> data and persists to a store.
>>>>>>
>>>>>> The program is running in local mode right now and runs out of memory
>>>>>> after a while. I am yet to investigate heap dumps but I think Spark isn't
>>>>>> releasing memory after processing is complete. I have even tried changing
>>>>>> storage level to disk only.
>>>>>>
>>>>>> Help!
>>>>>>
>>>>>> Thanks,
>>>>>> Aniket
>>>>>>
>>>>>
>>>>>
>>>
>>
>

Re: Out of memory with Spark Streaming

Posted by Chris Fregly <ch...@fregly.com>.
curious about why you're only seeing 50 records max per batch.

how many receivers are you running?  what is the rate that you're putting
data onto the stream?

per the default AWS kinesis configuration, the producer can do 1000 PUTs
per second with max 50k bytes per PUT and max 1mb per second per shard.

on the consumer side, you can only do 5 GETs per second and 2mb per second
per shard.

my hunch is that the 5 GETs per second is what's limiting your consumption
rate.

can you verify that these numbers match what you're seeing?  if so, you may
want to increase your shards and therefore the number of kinesis receivers.

otherwise, this may require some further investigation on my part.  i wanna
stay on top of this if it's an issue.

thanks for posting this, aniket!

-chris

On Fri, Sep 12, 2014 at 5:34 AM, Aniket Bhatnagar <
aniket.bhatnagar@gmail.com> wrote:

> Hi all
>
> Sorry but this was totally my mistake. In my persistence logic, I was
> creating async http client instance in RDD foreach but was never closing it
> leading to memory leaks.
>
> Apologies for wasting everyone's time.
>
> Thanks,
> Aniket
>
> On 12 September 2014 02:20, Tathagata Das <ta...@gmail.com>
> wrote:
>
>> Which version of spark are you running?
>>
>> If you are running the latest one, then could try running not a window
>> but a simple event count on every 2 second batch, and see if you are still
>> running out of memory?
>>
>> TD
>>
>>
>> On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar <
>> aniket.bhatnagar@gmail.com> wrote:
>>
>>> I did change it to be 1 gb. It still ran out of memory but a little
>>> later.
>>>
>>> The streaming job isnt handling a lot of data. In every 2 seconds, it
>>> doesn't get more than 50 records. Each record size is not more than 500
>>> bytes.
>>>  On Sep 11, 2014 10:54 PM, "Bharat Venkat" <bv...@gmail.com>
>>> wrote:
>>>
>>>> You could set "spark.executor.memory" to something bigger than the
>>>> default (512mb)
>>>>
>>>>
>>>> On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar <
>>>> aniket.bhatnagar@gmail.com> wrote:
>>>>
>>>>> I am running a simple Spark Streaming program that pulls in data from
>>>>> Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps
>>>>> data and persists to a store.
>>>>>
>>>>> The program is running in local mode right now and runs out of memory
>>>>> after a while. I am yet to investigate heap dumps but I think Spark isn't
>>>>> releasing memory after processing is complete. I have even tried changing
>>>>> storage level to disk only.
>>>>>
>>>>> Help!
>>>>>
>>>>> Thanks,
>>>>> Aniket
>>>>>
>>>>
>>>>
>>
>

Re: Out of memory with Spark Streaming

Posted by Aniket Bhatnagar <an...@gmail.com>.
Hi all

Sorry but this was totally my mistake. In my persistence logic, I was
creating async http client instance in RDD foreach but was never closing it
leading to memory leaks.

Apologies for wasting everyone's time.

Thanks,
Aniket

On 12 September 2014 02:20, Tathagata Das <ta...@gmail.com>
wrote:

> Which version of spark are you running?
>
> If you are running the latest one, then could try running not a window but
> a simple event count on every 2 second batch, and see if you are still
> running out of memory?
>
> TD
>
>
> On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar <
> aniket.bhatnagar@gmail.com> wrote:
>
>> I did change it to be 1 gb. It still ran out of memory but a little later.
>>
>> The streaming job isnt handling a lot of data. In every 2 seconds, it
>> doesn't get more than 50 records. Each record size is not more than 500
>> bytes.
>>  On Sep 11, 2014 10:54 PM, "Bharat Venkat" <bv...@gmail.com>
>> wrote:
>>
>>> You could set "spark.executor.memory" to something bigger than the
>>> default (512mb)
>>>
>>>
>>> On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar <
>>> aniket.bhatnagar@gmail.com> wrote:
>>>
>>>> I am running a simple Spark Streaming program that pulls in data from
>>>> Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps
>>>> data and persists to a store.
>>>>
>>>> The program is running in local mode right now and runs out of memory
>>>> after a while. I am yet to investigate heap dumps but I think Spark isn't
>>>> releasing memory after processing is complete. I have even tried changing
>>>> storage level to disk only.
>>>>
>>>> Help!
>>>>
>>>> Thanks,
>>>> Aniket
>>>>
>>>
>>>
>

Re: Out of memory with Spark Streaming

Posted by Tathagata Das <ta...@gmail.com>.
Which version of spark are you running?

If you are running the latest one, then could try running not a window but
a simple event count on every 2 second batch, and see if you are still
running out of memory?

TD


On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar <
aniket.bhatnagar@gmail.com> wrote:

> I did change it to be 1 gb. It still ran out of memory but a little later.
>
> The streaming job isnt handling a lot of data. In every 2 seconds, it
> doesn't get more than 50 records. Each record size is not more than 500
> bytes.
>  On Sep 11, 2014 10:54 PM, "Bharat Venkat" <bv...@gmail.com>
> wrote:
>
>> You could set "spark.executor.memory" to something bigger than the
>> default (512mb)
>>
>>
>> On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar <
>> aniket.bhatnagar@gmail.com> wrote:
>>
>>> I am running a simple Spark Streaming program that pulls in data from
>>> Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps
>>> data and persists to a store.
>>>
>>> The program is running in local mode right now and runs out of memory
>>> after a while. I am yet to investigate heap dumps but I think Spark isn't
>>> releasing memory after processing is complete. I have even tried changing
>>> storage level to disk only.
>>>
>>> Help!
>>>
>>> Thanks,
>>> Aniket
>>>
>>
>>

Re: Out of memory with Spark Streaming

Posted by Aniket Bhatnagar <an...@gmail.com>.
I did change it to be 1 gb. It still ran out of memory but a little later.

The streaming job isnt handling a lot of data. In every 2 seconds, it
doesn't get more than 50 records. Each record size is not more than 500
bytes.
 On Sep 11, 2014 10:54 PM, "Bharat Venkat" <bv...@gmail.com> wrote:

> You could set "spark.executor.memory" to something bigger than the
> default (512mb)
>
>
> On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar <
> aniket.bhatnagar@gmail.com> wrote:
>
>> I am running a simple Spark Streaming program that pulls in data from
>> Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps
>> data and persists to a store.
>>
>> The program is running in local mode right now and runs out of memory
>> after a while. I am yet to investigate heap dumps but I think Spark isn't
>> releasing memory after processing is complete. I have even tried changing
>> storage level to disk only.
>>
>> Help!
>>
>> Thanks,
>> Aniket
>>
>
>

Re: Out of memory with Spark Streaming

Posted by Bharat Venkat <bv...@gmail.com>.
You could set "spark.executor.memory" to something bigger than the default
(512mb)


On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar <
aniket.bhatnagar@gmail.com> wrote:

> I am running a simple Spark Streaming program that pulls in data from
> Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, maps
> data and persists to a store.
>
> The program is running in local mode right now and runs out of memory
> after a while. I am yet to investigate heap dumps but I think Spark isn't
> releasing memory after processing is complete. I have even tried changing
> storage level to disk only.
>
> Help!
>
> Thanks,
> Aniket
>