You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Indra Nath Bardhan <in...@gmail.com> on 2014/06/17 19:15:55 UTC

Topology Memory leaks

Hi All,

We have a topology which is running on 16 workers with 2GB heap each.

However we see that the topology worker RES memory usage keeps on piling up
i.e., starting at 1.1 G and keeps growing over and beyond the 2G mark till
it overwhelms the entire node.

This possibly indicates that

1) we either have slowly consuming bolts and thus need throttling in spout
2) OR a memory leak in the ZMQ buffer allocation or some of the JNI code.

Based on responses in certain other discussions, we tried making our
topology reliable and make use of the MAX_SPOUT_PENDING to throttle the
spouts. However this did not yield us much value, trying with a value of
1000 & 100, we see the same growth in the memory usage, although a bit
slower in the later case.

We also did a pmap of the offending pids and did not see much memory usage
by the native lib*so files.

Is there any way to identify the source of this native leak OR fix this ?
We need some urgent help on this.

[NOTE: Using Storm - 0.9.0_wip21]

Thanks,
Indra

Re: Topology Memory leaks

Posted by Indra Nath Bardhan <in...@gmail.com>.

As Michael mentioned, we see the exact same use case in the pmap where
there are lot of anon blocks indicating malloc or direct byte buffer
allocations and keeps on increasing with time.

Actually the topology is used for ingestion of content, where the bolts are
responsible for curating and enriching them.

Couple of bolts do have native calls through JNI which we suspect to be the
culprits and did find an issue where we were not freeing up memory and
fixed it. However even after the fix, we see the memory usage grow but at a
much slower rate than earlier.

Would be interesting to know if you found any other sources of leaks that
we need to be aware off, and how did you fix them. Based on current status,
we may have to do a daily rebalance or restart of topologies to remediate
this build up.

Thanks
Indra
On 18 Jun 2014 05:33, "P. Taylor Goetz" <pt...@gmail.com> wrote:

> I've not seen this, (un)fortunately. :)
>
> Are there any other relevant details you might be able to provide?
>
> Or better yet can you distill it down to a bare bones topology that
> reproduces it and share the code?
>
> -Taylor
>
> On Jun 17, 2014, at 6:27 PM, Michael Rose <mi...@fullcontact.com> wrote:
>
> I've run into similar leaks with one of our topologies. ZMQ vs. Netty
> didn't make any difference for us. We'd been looking into the Netty-based
> HTTP client we're using as a suspect, but maybe it is Storm.
>
> 8 workers, 1.5GB heap, CMS collector, Java 1.7.0_25-b15, Storm 0.9.0.1
>
> What kinds of things do your topologies do?
>
> One thing we'd observed is a bump in direct buffers. Usually starts around
> 100. Java can't account for the memory used, but the size & count of the
> allocations as shown by pmap is suspicious.
>
> ...
> 00007f30ac1bc000  63760K -----    [ anon ]
> 00007f30b0000000    864K rw---    [ anon ]
> 00007f30b00d8000  64672K -----    [ anon ]
> 00007f30b4000000    620K rw---    [ anon ]
> 00007f30b409b000  64916K -----    [ anon ]
> 00007f30b8000000   1780K rw---    [ anon ]
> 00007f30b81bd000  63756K -----    [ anon ]
> 00007f30bc000000   1376K rw---    [ anon ]
> 00007f30bc158000  64160K -----    [ anon ]
> 00007f30c0000000   1320K rw---    [ anon ]
> ...
>
>       "buffers":{
>          "direct":{
>             "count":721,
>             "memoryUsed":16659150,
>             "totalCapacity":16659150
>          },
>          "mapped":{
>             "count":0,
>             "memoryUsed":0,
>             "totalCapacity":0
>          }
>       },
>
> Do you have a similar bump in direct buffer counts?
>
> Michael
>
> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
> michael@fullcontact.com
>
>
> On Tue, Jun 17, 2014 at 11:15 AM, Indra Nath Bardhan <
> indranath.bardhan@gmail.com> wrote:
>
>> Hi All,
>>
>> We have a topology which is running on 16 workers with 2GB heap each.
>>
>> However we see that the topology worker RES memory usage keeps on piling
>> up i.e., starting at 1.1 G and keeps growing over and beyond the 2G mark
>> till it overwhelms the entire node.
>>
>> This possibly indicates that
>>
>> 1) we either have slowly consuming bolts and thus need throttling in spout
>> 2) OR a memory leak in the ZMQ buffer allocation or some of the JNI code.
>>
>> Based on responses in certain other discussions, we tried making our
>> topology reliable and make use of the MAX_SPOUT_PENDING to throttle the
>> spouts. However this did not yield us much value, trying with a value of
>> 1000 & 100, we see the same growth in the memory usage, although a bit
>> slower in the later case.
>>
>> We also did a pmap of the offending pids and did not see much memory
>> usage by the native lib*so files.
>>
>> Is there any way to identify the source of this native leak OR fix this ?
>> We need some urgent help on this.
>>
>> [NOTE: Using Storm - 0.9.0_wip21]
>>
>> Thanks,
>> Indra
>>
>>
>>
>>
>

Re: Topology Memory leaks

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

I've not seen this, (un)fortunately. :)

Are there any other relevant details you might be able to provide?

Or better yet can you distill it down to a bare bones topology that reproduces it and share the code?

-Taylor

> On Jun 17, 2014, at 6:27 PM, Michael Rose <mi...@fullcontact.com> wrote:
> 
> I've run into similar leaks with one of our topologies. ZMQ vs. Netty didn't make any difference for us. We'd been looking into the Netty-based HTTP client we're using as a suspect, but maybe it is Storm.
> 
> 8 workers, 1.5GB heap, CMS collector, Java 1.7.0_25-b15, Storm 0.9.0.1
> 
> What kinds of things do your topologies do?
> 
> One thing we'd observed is a bump in direct buffers. Usually starts around 100. Java can't account for the memory used, but the size & count of the allocations as shown by pmap is suspicious.
> 
> ...
> 00007f30ac1bc000  63760K -----    [ anon ]
> 00007f30b0000000    864K rw---    [ anon ]
> 00007f30b00d8000  64672K -----    [ anon ]
> 00007f30b4000000    620K rw---    [ anon ]
> 00007f30b409b000  64916K -----    [ anon ]
> 00007f30b8000000   1780K rw---    [ anon ]
> 00007f30b81bd000  63756K -----    [ anon ]
> 00007f30bc000000   1376K rw---    [ anon ]
> 00007f30bc158000  64160K -----    [ anon ]
> 00007f30c0000000   1320K rw---    [ anon ]
> ...
> 
>       "buffers":{
>          "direct":{
>             "count":721,
>             "memoryUsed":16659150,
>             "totalCapacity":16659150
>          },
>          "mapped":{
>             "count":0,
>             "memoryUsed":0,
>             "totalCapacity":0
>          }
>       },
> 
> Do you have a similar bump in direct buffer counts?
> 
> Michael
> 
> Michael Rose (@Xorlev)
> Senior Platform Engineer, FullContact
> michael@fullcontact.com
> 
> 
> 
>> On Tue, Jun 17, 2014 at 11:15 AM, Indra Nath Bardhan <in...@gmail.com> wrote:
>> Hi All,
>> 
>> We have a topology which is running on 16 workers with 2GB heap each.
>> 
>> However we see that the topology worker RES memory usage keeps on piling up i.e., starting at 1.1 G and keeps growing over and beyond the 2G mark till it overwhelms the entire node. 
>> 
>> This possibly indicates that 
>> 
>> 1) we either have slowly consuming bolts and thus need throttling in spout
>> 2) OR a memory leak in the ZMQ buffer allocation or some of the JNI code.
>> 
>> Based on responses in certain other discussions, we tried making our topology reliable and make use of the MAX_SPOUT_PENDING to throttle the spouts. However this did not yield us much value, trying with a value of 1000 & 100, we see the same growth in the memory usage, although a bit slower in the later case.
>> 
>> We also did a pmap of the offending pids and did not see much memory usage by the native lib*so files.
>> 
>> Is there any way to identify the source of this native leak OR fix this ? We need some urgent help on this.
>> 
>> [NOTE: Using Storm - 0.9.0_wip21]
>> 
>> Thanks,
>> Indra
>> 
>> 
>> 
>

Re: Topology Memory leaks

Posted by Michael Rose <mi...@fullcontact.com>.

I've run into similar leaks with one of our topologies. ZMQ vs. Netty
didn't make any difference for us. We'd been looking into the Netty-based
HTTP client we're using as a suspect, but maybe it is Storm.

8 workers, 1.5GB heap, CMS collector, Java 1.7.0_25-b15, Storm 0.9.0.1

What kinds of things do your topologies do?

One thing we'd observed is a bump in direct buffers. Usually starts around
100. Java can't account for the memory used, but the size & count of the
allocations as shown by pmap is suspicious.

...
00007f30ac1bc000  63760K -----    [ anon ]
00007f30b0000000    864K rw---    [ anon ]
00007f30b00d8000  64672K -----    [ anon ]
00007f30b4000000    620K rw---    [ anon ]
00007f30b409b000  64916K -----    [ anon ]
00007f30b8000000   1780K rw---    [ anon ]
00007f30b81bd000  63756K -----    [ anon ]
00007f30bc000000   1376K rw---    [ anon ]
00007f30bc158000  64160K -----    [ anon ]
00007f30c0000000   1320K rw---    [ anon ]
...

      "buffers":{
         "direct":{
            "count":721,
            "memoryUsed":16659150,
            "totalCapacity":16659150
         },
         "mapped":{
            "count":0,
            "memoryUsed":0,
            "totalCapacity":0
         }
      },

Do you have a similar bump in direct buffer counts?

Michael

Michael Rose (@Xorlev <https://twitter.com/xorlev>)
Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
michael@fullcontact.com


On Tue, Jun 17, 2014 at 11:15 AM, Indra Nath Bardhan <
indranath.bardhan@gmail.com> wrote:

> Hi All,
>
> We have a topology which is running on 16 workers with 2GB heap each.
>
> However we see that the topology worker RES memory usage keeps on piling
> up i.e., starting at 1.1 G and keeps growing over and beyond the 2G mark
> till it overwhelms the entire node.
>
> This possibly indicates that
>
> 1) we either have slowly consuming bolts and thus need throttling in spout
> 2) OR a memory leak in the ZMQ buffer allocation or some of the JNI code.
>
> Based on responses in certain other discussions, we tried making our
> topology reliable and make use of the MAX_SPOUT_PENDING to throttle the
> spouts. However this did not yield us much value, trying with a value of
> 1000 & 100, we see the same growth in the memory usage, although a bit
> slower in the later case.
>
> We also did a pmap of the offending pids and did not see much memory usage
> by the native lib*so files.
>
> Is there any way to identify the source of this native leak OR fix this ?
> We need some urgent help on this.
>
> [NOTE: Using Storm - 0.9.0_wip21]
>
> Thanks,
> Indra
>
>
>
>