You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Drew Goya <dr...@gradientx.com> on 2014/03/02 20:46:59 UTC

Netty Errors, chain reaction, topology breaks down

Hey All, I'm running a 0.9.0.1 storm topology in AWS EC2 and I occasionally
run into a strange and pretty catastrophic error.  One of my workers is
either overloaded or stuck and gets killed and restarted.  This usually
works fine but once in a while the whole topology breaks down, all the
workers are killed and restarted continually.  Looking through the logs it
looks like some netty errors on initialization kill the Async Loop.  The
topology is never able to recover, I have to kill it manually and relaunch
it.

Is this something anyone else has come across?  Any tips? Config settings I
could change?

This is a pastebin of the errors:  http://pastebin.com/XXZBsEj1

Re: Netty Errors, chain reaction, topology breaks down

Posted by Drew Goya <dr...@gradientx.com>.

I'm not sure what the storm release schedule looks like but this is what
I've done internally.

I've forked and patched storm:
https://github.com/hiloboy0119/incubator-storm

I'm running off the trunk but if you wanted to you can checkout the 0.9.1
tag and cherry-pick in the fix: d9d637eab5f3761a4768fc960b2574d45962b466

There isn't any real good documentation on building a storm release but
this is what I've figured out:

   1. Clone the repo.
   2. cd to incubator-storm
   3. mvn clean package
   4. cd to incubator-storm/storm-dist
   5. mvn package
   6. The final zips will be in incubator-storm/storm-dist

If that all sounds like a bit much you can switch back to 0mq or wait for
0.9.2 =)

On Tue, Mar 4, 2014 at 2:36 AM, Richards Peter <hb...@gmail.com>wrote:

> Hi Drew,
>
> Good that you identified the root cause for the issue.
>
> Our team is planning to migrate to storm 0.9.* series. We are planning to
> use Netty as the transport layer. Could you please give us a tentative
> release date for the build with this fix?
>
> Thanks,
> Richards Peter.
>

Re: Netty Errors, chain reaction, topology breaks down

Posted by 李家宏 <jh...@gmail.com>.

Seems like We are facing the same problem : a negative timeout value
exception was thrown from storm-netty-client. We are heading back to zmq.

Regards


2014-03-05 9:59 GMT+08:00 Richards Peter <hb...@gmail.com>:

> Thanks Drew and Ted,
>
> Richards Peter.
>



-- 

======================================================

Gvain

Email: jh.li.em@gmail.com

Re: Netty Errors, chain reaction, topology breaks down

Posted by Richards Peter <hb...@gmail.com>.

Thanks Drew and Ted,

Richards Peter.

Re: Netty Errors, chain reaction, topology breaks down

Posted by Ted Dunning <te...@gmail.com>.

Feedback from Drill is that the next version (4.x) of Netty works better
than the Storm current (3).

Drill also does more explicit memory management so this might be a red
herring.

On Tue, Mar 4, 2014 at 2:36 AM, Richards Peter <hb...@gmail.com>wrote:

> Hi Drew,
>
> Good that you identified the root cause for the issue.
>
> Our team is planning to migrate to storm 0.9.* series. We are planning to
> use Netty as the transport layer. Could you please give us a tentative
> release date for the build with this fix?
>
> Thanks,
> Richards Peter.
>

Re: Netty Errors, chain reaction, topology breaks down

Posted by Richards Peter <hb...@gmail.com>.

Hi Drew,

Good that you identified the root cause for the issue.

Our team is planning to migrate to storm 0.9.* series. We are planning to
use Netty as the transport layer. Could you please give us a tentative
release date for the build with this fix?

Thanks,
Richards Peter.

Re: Netty Errors, chain reaction, topology breaks down

Posted by Drew Goya <dr...@gradientx.com>.

Pull request sent:

https://github.com/apache/incubator-storm/pull/41


On Mon, Mar 3, 2014 at 12:03 PM, Drew Goya <dr...@gradientx.com> wrote:

> Hey All, dug into the netty Client.jar and the various versions/tags of
> the file.  So it looks like there were two commits which attempted to fix
> this issue:
>
> https://github.com/nathanmarz/storm/commit/213102b36f890
>
> and then
>
>
> https://github.com/nathanmarz/storm/commit/c638db0e88e3c56f808c8a76a88f94d7bf1988c4
>
> It looks like the affected method is getSleepTimeMs()
>
> In the 0.9.0 tag (
>
> In the 0.9.0.1 tag (
> https://github.com/nathanmarz/storm/blob/0.9.0.1/storm-netty/src/jvm/backtype/storm/messaging/netty/Client.java)
> the method is:
>
> private int getSleepTimeMs()
> {
>   int backoff = 1 << retries.get();
>   int sleepMs = base_sleep_ms * Math.max(1, random.nextInt(backoff));
>   if ( sleepMs > max_sleep_ms )
>     sleepMs = max_sleep_ms;
>   return sleepMs;
> }
>
> I put together a simple test which demonstrates the method is still
> broken, it is still possible to overflow sleepMs and end up with a large
> negative timeout:
>
> private static int getSleepTimeMs(int retries, int base_sleep_ms, int
> max_sleep_ms, Random random)
> {
>   int backoff = 1 << retries;
>   int sleepMs = base_sleep_ms * Math.max(1, random.nextInt(backoff));
>   if ( sleepMs > max_sleep_ms )
>     sleepMs = max_sleep_ms;
>   return sleepMs;
> }
>
> public static void main(String[] args) throws Exception{
>   Random random = new Random();
>   int base_sleep_ms = 100;
>   int max_sleep_ms = 1000;
>   for(int i = 0; i < 30; i++){
>     System.out.println(getSleepTimeMs(i, base_sleep_ms, max_sleep_ms,
> random));=
>   }
> }
>
> To fix the issue a few of the integers should be converted to longs.  I'll
> send a pull request in a few.
>
> On Mon, Mar 3, 2014 at 11:17 AM, Drew Goya <dr...@gradientx.com> wrote:
>
>> Thanks for sharing your experiences guys, we will be heading back to 0mq
>> as well.  It's a shame as we really got some nice throughput improvements
>> with Netty.
>>
>>
>> On Sun, Mar 2, 2014 at 5:18 PM, Michael Rose <mi...@fullcontact.com>wrote:
>>
>>> Right now we're having slow, off-heap memory leaks, unknown if these are
>>> linked to Netty (yet). When the workers inevitably get OOMed, the topology
>>> will rarely recover gracefully with similar Netty timeouts. Sounds like
>>> we'll be heading back to 0mq.
>>>
>>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>>> michael@fullcontact.com
>>>
>>>
>>> On Sun, Mar 2, 2014 at 5:44 PM, Sean Allen <se...@monkeysnatchbanana.com>wrote:
>>>
>>>> We have the same issue and after attempting a few fixes, we switched
>>>> back to using 0mq for now.
>>>>
>>>>
>>>> On Sun, Mar 2, 2014 at 2:46 PM, Drew Goya <dr...@gradientx.com> wrote:
>>>>
>>>>> Hey All, I'm running a 0.9.0.1 storm topology in AWS EC2 and I
>>>>> occasionally run into a strange and pretty catastrophic error.  One of my
>>>>> workers is either overloaded or stuck and gets killed and restarted.  This
>>>>> usually works fine but once in a while the whole topology breaks down, all
>>>>> the workers are killed and restarted continually.  Looking through the logs
>>>>> it looks like some netty errors on initialization kill the Async Loop.  The
>>>>> topology is never able to recover, I have to kill it manually and relaunch
>>>>> it.
>>>>>
>>>>> Is this something anyone else has come across?  Any tips? Config
>>>>> settings I could change?
>>>>>
>>>>> This is a pastebin of the errors:  http://pastebin.com/XXZBsEj1
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Ce n'est pas une signature
>>>>
>>>
>>>
>>
>

Re: Netty Errors, chain reaction, topology breaks down

Posted by Drew Goya <dr...@gradientx.com>.

Hey All, dug into the netty Client.jar and the various versions/tags of the
file.  So it looks like there were two commits which attempted to fix this
issue:

https://github.com/nathanmarz/storm/commit/213102b36f890

and then

https://github.com/nathanmarz/storm/commit/c638db0e88e3c56f808c8a76a88f94d7bf1988c4

It looks like the affected method is getSleepTimeMs()

In the 0.9.0 tag (

In the 0.9.0.1 tag (
https://github.com/nathanmarz/storm/blob/0.9.0.1/storm-netty/src/jvm/backtype/storm/messaging/netty/Client.java)
the method is:

private int getSleepTimeMs()
{
  int backoff = 1 << retries.get();
  int sleepMs = base_sleep_ms * Math.max(1, random.nextInt(backoff));
  if ( sleepMs > max_sleep_ms )
    sleepMs = max_sleep_ms;
  return sleepMs;
}

I put together a simple test which demonstrates the method is still broken,
it is still possible to overflow sleepMs and end up with a large negative
timeout:

private static int getSleepTimeMs(int retries, int base_sleep_ms, int
max_sleep_ms, Random random)
{
  int backoff = 1 << retries;
  int sleepMs = base_sleep_ms * Math.max(1, random.nextInt(backoff));
  if ( sleepMs > max_sleep_ms )
    sleepMs = max_sleep_ms;
  return sleepMs;
}

public static void main(String[] args) throws Exception{
  Random random = new Random();
  int base_sleep_ms = 100;
  int max_sleep_ms = 1000;
  for(int i = 0; i < 30; i++){
    System.out.println(getSleepTimeMs(i, base_sleep_ms, max_sleep_ms,
random));=
  }
}

To fix the issue a few of the integers should be converted to longs.  I'll
send a pull request in a few.

On Mon, Mar 3, 2014 at 11:17 AM, Drew Goya <dr...@gradientx.com> wrote:

> Thanks for sharing your experiences guys, we will be heading back to 0mq
> as well.  It's a shame as we really got some nice throughput improvements
> with Netty.
>
>
> On Sun, Mar 2, 2014 at 5:18 PM, Michael Rose <mi...@fullcontact.com>wrote:
>
>> Right now we're having slow, off-heap memory leaks, unknown if these are
>> linked to Netty (yet). When the workers inevitably get OOMed, the topology
>> will rarely recover gracefully with similar Netty timeouts. Sounds like
>> we'll be heading back to 0mq.
>>
>> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
>> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
>> michael@fullcontact.com
>>
>>
>> On Sun, Mar 2, 2014 at 5:44 PM, Sean Allen <se...@monkeysnatchbanana.com>wrote:
>>
>>> We have the same issue and after attempting a few fixes, we switched
>>> back to using 0mq for now.
>>>
>>>
>>> On Sun, Mar 2, 2014 at 2:46 PM, Drew Goya <dr...@gradientx.com> wrote:
>>>
>>>> Hey All, I'm running a 0.9.0.1 storm topology in AWS EC2 and I
>>>> occasionally run into a strange and pretty catastrophic error.  One of my
>>>> workers is either overloaded or stuck and gets killed and restarted.  This
>>>> usually works fine but once in a while the whole topology breaks down, all
>>>> the workers are killed and restarted continually.  Looking through the logs
>>>> it looks like some netty errors on initialization kill the Async Loop.  The
>>>> topology is never able to recover, I have to kill it manually and relaunch
>>>> it.
>>>>
>>>> Is this something anyone else has come across?  Any tips? Config
>>>> settings I could change?
>>>>
>>>> This is a pastebin of the errors:  http://pastebin.com/XXZBsEj1
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Ce n'est pas une signature
>>>
>>
>>
>

Re: Netty Errors, chain reaction, topology breaks down

Posted by Drew Goya <dr...@gradientx.com>.

Thanks for sharing your experiences guys, we will be heading back to 0mq as
well.  It's a shame as we really got some nice throughput improvements with
Netty.


On Sun, Mar 2, 2014 at 5:18 PM, Michael Rose <mi...@fullcontact.com>wrote:

> Right now we're having slow, off-heap memory leaks, unknown if these are
> linked to Netty (yet). When the workers inevitably get OOMed, the topology
> will rarely recover gracefully with similar Netty timeouts. Sounds like
> we'll be heading back to 0mq.
>
> Michael Rose (@Xorlev <https://twitter.com/xorlev>)
> Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
> michael@fullcontact.com
>
>
> On Sun, Mar 2, 2014 at 5:44 PM, Sean Allen <se...@monkeysnatchbanana.com>wrote:
>
>> We have the same issue and after attempting a few fixes, we switched back
>> to using 0mq for now.
>>
>>
>> On Sun, Mar 2, 2014 at 2:46 PM, Drew Goya <dr...@gradientx.com> wrote:
>>
>>> Hey All, I'm running a 0.9.0.1 storm topology in AWS EC2 and I
>>> occasionally run into a strange and pretty catastrophic error.  One of my
>>> workers is either overloaded or stuck and gets killed and restarted.  This
>>> usually works fine but once in a while the whole topology breaks down, all
>>> the workers are killed and restarted continually.  Looking through the logs
>>> it looks like some netty errors on initialization kill the Async Loop.  The
>>> topology is never able to recover, I have to kill it manually and relaunch
>>> it.
>>>
>>> Is this something anyone else has come across?  Any tips? Config
>>> settings I could change?
>>>
>>> This is a pastebin of the errors:  http://pastebin.com/XXZBsEj1
>>>
>>
>>
>>
>> --
>>
>> Ce n'est pas une signature
>>
>
>

Re: Netty Errors, chain reaction, topology breaks down

Posted by Michael Rose <mi...@fullcontact.com>.

Right now we're having slow, off-heap memory leaks, unknown if these are
linked to Netty (yet). When the workers inevitably get OOMed, the topology
will rarely recover gracefully with similar Netty timeouts. Sounds like
we'll be heading back to 0mq.

Michael Rose (@Xorlev <https://twitter.com/xorlev>)
Senior Platform Engineer, FullContact <http://www.fullcontact.com/>
michael@fullcontact.com


On Sun, Mar 2, 2014 at 5:44 PM, Sean Allen <se...@monkeysnatchbanana.com>wrote:

> We have the same issue and after attempting a few fixes, we switched back
> to using 0mq for now.
>
>
> On Sun, Mar 2, 2014 at 2:46 PM, Drew Goya <dr...@gradientx.com> wrote:
>
>> Hey All, I'm running a 0.9.0.1 storm topology in AWS EC2 and I
>> occasionally run into a strange and pretty catastrophic error.  One of my
>> workers is either overloaded or stuck and gets killed and restarted.  This
>> usually works fine but once in a while the whole topology breaks down, all
>> the workers are killed and restarted continually.  Looking through the logs
>> it looks like some netty errors on initialization kill the Async Loop.  The
>> topology is never able to recover, I have to kill it manually and relaunch
>> it.
>>
>> Is this something anyone else has come across?  Any tips? Config settings
>> I could change?
>>
>> This is a pastebin of the errors:  http://pastebin.com/XXZBsEj1
>>
>
>
>
> --
>
> Ce n'est pas une signature
>

Re: Netty Errors, chain reaction, topology breaks down

Posted by Sean Allen <se...@monkeysnatchbanana.com>.

We have the same issue and after attempting a few fixes, we switched back
to using 0mq for now.


On Sun, Mar 2, 2014 at 2:46 PM, Drew Goya <dr...@gradientx.com> wrote:

> Hey All, I'm running a 0.9.0.1 storm topology in AWS EC2 and I
> occasionally run into a strange and pretty catastrophic error.  One of my
> workers is either overloaded or stuck and gets killed and restarted.  This
> usually works fine but once in a while the whole topology breaks down, all
> the workers are killed and restarted continually.  Looking through the logs
> it looks like some netty errors on initialization kill the Async Loop.  The
> topology is never able to recover, I have to kill it manually and relaunch
> it.
>
> Is this something anyone else has come across?  Any tips? Config settings
> I could change?
>
> This is a pastebin of the errors:  http://pastebin.com/XXZBsEj1
>



-- 

Ce n'est pas une signature