You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Kashyap Mhaisekar <ka...@gmail.com> on 2015/09/03 06:07:29 UTC

Netty reconnect

Hi,
Has anyone experienced Netty reconnects repeatedly? My workers seem to be
eternally in reconnect state and topology doesn't serve messages at all. It
gets connected once in a while and then goes back to getting reconnecting.

Any fixes for this?
"Reconnect started for Netty-Client"

Thanks
Kashyap

unsubscribe

Posted by Carey Stewart <ca...@aspirent.com>.

Carey Stewart
Founding Partner
Aspirent Consulting, LLC

[cid:1EA9C5B9-BF29-4500-AD94-BA98858583AA@hsd1.ga.comcast.net.]

M 404.401.3162
E carey.stewart@aspirent.com<ma...@aspirent.com>
T: @AspirentConsult

[cid:D053A59A-29AC-4E9E-8810-A8DECADEF93A@hsd1.ga.comcast.net.]

This electronic message transmission and any files transmitted with it contain information from Aspirent Consulting, LLC that may be confidential or privileged or otherwise protected from disclosure under applicable law. The information is intended to be for the use of only the individual to whom this email is addressed. If you are not the intended recipient, you are hereby notified that any copying, distribution or use of the contents of this information is strictly prohibited. If you have received this electronic transmission in error, please immediately notify the sender by reply e-mail to info@aspirent.com<ma...@aspirent.com> and delete this e-mail and any attached files from your system and any destroy any copies you may have made, electronic or otherwise.

On Sep 3, 2015, at 7:52 AM, John Yost <so...@gmail.com>> wrote:

Hi Everyone,

When I see this, it is evidence that one or more of the workers are not starting up, which results in connections either not occuring or reconnecting occuring when supervisors kill workers that don't start up properly. I recommend checking the supervisor and nimbus logs to see if there are any root causes other than network issues causing the connect/reconnect.

--John

On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis <ni...@gmail.com>> wrote:
Hello Kashyap,

I have been having the same issue for some time now on my AWS cluster. To be honest, I do not know how to resolve it.

Regards,
Nick

2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <ka...@gmail.com>>:

Hi,
Has anyone experienced Netty reconnects repeatedly? My workers seem to be eternally in reconnect state and topology doesn't serve messages at all. It gets connected once in a while and then goes back to getting reconnecting.

Any fixes for this?
"Reconnect started for Netty-Client"

Thanks
Kashyap

--
Nikolaos Romanos Katsipoulakis,
University of Pittsburgh, PhD candidate

Re: Netty reconnect

Posted by Erik Weathers <ew...@groupon.com>.

Storm makes an attempt to be ok with having less workers than you've
configured, so I'd assume it was the upgrade to storm 0.9.5 that helped you.


*Explanation about the behavior from Nathan Marz, storm's original author:*

https://groups.google.com/forum/#!msg/storm-user/msqO6zNseM8/OvbjgJfoV6IJ

If Storm doesn’t have enough worker slots to run a topology, it will “pack”
all the tasks for that topology into whatever slots there are on the
cluster. Then, when there are more worker slots available (like you add
another machine), it will redistribute the tasks to the proper number of
workers. Of course, running a topology with less workers than intended will
probably lead to performance problems. So it’s best practice to have extra
capacity available to handle any failures.

- nathan marz


*Also here's a reference in the storm core code itself:*

Comment in nimbus.clj code:
;; otherwise, package rest of executors into available slots (up to how
much it needs)


On Tue, Sep 15, 2015 at 1:33 PM, Kashyap Mhaisekar <ka...@gmail.com>
wrote:

> Thanks Erik.
> I *think* the issue for me was that I had created more workers than what
> was possible. I have 5 machines with 4 core each and I should have had 20
> workers, but I ended up defining 25 workers. One of the machines went out
> of circulation for whatever reason (disk space filled up). This meant i had
> a 16 workers but defined 25 workers. Am not sure if this could be a reason,
> but after correcting this and migrating to 0.9.5, i think I saw the issue
> disappear. Am still validating though.
>
> Thanks
> Kashyap
>
> On Mon, Sep 14, 2015 at 6:53 PM, Erik Weathers <ew...@groupon.com>
> wrote:
>
>> That exception is certainly a *result* of the original worker death
>> you're experiencing.  As noted in earlier responses on this thread, it
>> seems like you're experiencing a cascading set of connection exceptions
>> which are obscuring the original root cause / original worker death.  This
>> is one of the pain points with storm: it can be hard to find the original
>> exception / reason for a spiraling set of worker deaths.   You should look
>> at the worker and supervisor logs to find out if "
>> myserver1.personal.com/10.2.72.176:6701" was already dead before you saw
>> this exception.
>>
>> Notably, with storm-0.9.3 we needed to revert to zero-mq instead of netty
>> to overcome a similar issue.  We haven't experienced the problems after
>> upgrading to 0.9.4 with netty (0.9.5 has also worked for us).  When we were
>> experiencing problems with 0.9.3 and netty, the original worker process
>> that was dying and invoking the cascading failures was "timing out".  i.e.,
>> The supervisor wasn't receiving heartbeats from the worker within the 30
>> second window, and then the supervisor *killed* the worker.  We noted that
>> the workers were supposed to write to their heartbeat file once a second,
>> but the frequency consistently increased, going from 1 second, to 2
>> seconds, to 5 seconds, ..., to eventually being longer than 30 seconds,
>> causing the supervisor to kill the worker.
>>
>> So long story short:  if you're experiencing the same thing as we were,
>> just upgrading to 0.9.4 or 0.9.5 might solve it.
>>
>> But before doing that you should find the initial worker death's cause
>> (be it a heartbeat timeout or an exception within the worker).
>>
>> - Erik
>>
>> On Fri, Sep 11, 2015 at 1:26 PM, Kashyap Mhaisekar <ka...@gmail.com>
>> wrote:
>>
>>> Ganesh, All
>>> Do you know if the answer to this is an upgrade to* 0.9.4 *or or* 0.9.5
>>> *or to version* 0.10.0-beta1*. My topology runs fine for 15 mins and
>>> then gives up with this -
>>> 2015-09-11 15:19:51 b.s.m.n.Client [INFO] failed to send requests to
>>> myserver1.personal.com/10.2.72.176:6701:
>>> java.nio.channels.ClosedChannelException: null
>>>         at
>>> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:405)
>>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>>         at
>>> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373)
>>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>>         at
>>> org.apache.storm.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
>>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>>         at
>>> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
>>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>>         at
>>> org.apache.storm.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
>>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>>         at
>>> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>>         at
>>> org.apache.storm.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>>         at
>>> org.apache.storm.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>>         at
>>> org.apache.storm.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> [na:1.7.0_79]
>>>         at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> [na:1.7.0_79]
>>>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
>>>
>>> and then with   ...
>>> 2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for
>>> Netty-Client-myserver5.personal.com/10.2.72.176:6701... [1]
>>> 2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for
>>> Netty-Client-myserver7.personal.com/10.2.72.72:6704... [1]
>>> 2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for
>>> Netty-Client-myserver3.personal.com/10.2.72.77:6702... [1]
>>>
>>>
>>> It restarts again and the whole thing repeats.
>>>
>>> Thanks
>>> kashyap
>>>
>>> On Fri, Sep 4, 2015 at 11:33 AM, Ganesh Chandrasekaran <
>>> gchandrasekaran@wayfair.com> wrote:
>>>
>>>> Kashyap,
>>>>
>>>>
>>>>
>>>> Yes you will need to upgrade Storm version on cluster as well.
>>>> Personally, I would run tests to see if it fixes existing issue before
>>>> upgrading.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Ganesh
>>>>
>>>>
>>>>
>>>> *From:* Joseph Beard [mailto:joseph@josephbeard.net]
>>>> *Sent:* Friday, September 04, 2015 12:07 PM
>>>>
>>>> *To:* user@storm.apache.org
>>>> *Subject:* Re: Netty reconnect
>>>>
>>>>
>>>>
>>>> We also ran into the same issue with Storm 0.9.4.  We chose to upgrade
>>>> to 0.10.0-beta1 which solved the problem and has been otherwise stable for
>>>> our needs.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Joe
>>>>
>>>> —
>>>>
>>>> Joseph Beard
>>>>
>>>> joseph@josephbeard.net
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sep 3, 2015, at 10:03 AM, Kashyap Mhaisekar <ka...@gmail.com>
>>>> wrote:
>>>>
>>>>
>>>>
>>>> Thanks for the advices. Will upgrade from 0.9.3 to 0.9.4. A lame
>>>> question - Does it mean that the existing clusters need to be rebuilt with
>>>> 0.9.4?
>>>>
>>>> Thanks
>>>> Kashyap
>>>>
>>>> On Sep 3, 2015 08:32, "Nick R. Katsipoulakis" <ni...@gmail.com>
>>>> wrote:
>>>>
>>>> Ganesh,
>>>>
>>>>
>>>>
>>>> No I am not.
>>>>
>>>>
>>>>
>>>> Cheers,
>>>>
>>>> Nick
>>>>
>>>>
>>>>
>>>> 2015-09-03 9:25 GMT-04:00 Ganesh Chandrasekaran <
>>>> gchandrasekaran@wayfair.com>:
>>>>
>>>> Are you using multilang protocol? I know that after upgrading to 0.9.4
>>>> it seemed like I was being affected by this bug -
>>>> https://issues.apache.org/jira/browse/STORM-738 and rolled back to
>>>> previous stable version of 0.8.2.
>>>>
>>>> I did not verify this thoroughly on my cluster though.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *From:* Nick R. Katsipoulakis [mailto:nick.katsip@gmail.com]
>>>> *Sent:* Thursday, September 03, 2015 9:08 AM
>>>>
>>>>
>>>> *To:* user@storm.apache.org
>>>> *Subject:* Re: Netty reconnect
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Hello again,
>>>>
>>>>
>>>>
>>>> I read STORM-404 and I saw that is resolved on version 0.9.4. However,
>>>> I have version 0.9.4 installed in my cluster, and I have seen similar
>>>> behavior in my workers.
>>>>
>>>>
>>>>
>>>> In fact, at random times I would see that some workers were considered
>>>> dead (Netty was dropping messages) and they would be restarted by the
>>>> nimbus.
>>>>
>>>>
>>>>
>>>> Currently, I only see dropped messages but not restarted workers.
>>>>
>>>>
>>>>
>>>> FYI, my cluster has the following information
>>>>
>>>>
>>>>
>>>>    - 3X AWS m4.xlarge instances for ZooKeeper and Nimbus
>>>>    - 4X AWS m4.xlarge instances for Supervisors (each one with 2
>>>>    workers)
>>>>
>>>> Thanks,
>>>>
>>>> Nick
>>>>
>>>>
>>>>
>>>> 2015-09-03 8:38 GMT-04:00 Ganesh Chandrasekaran <
>>>> gchandrasekaran@wayfair.com>:
>>>>
>>>> Agreed with Jitendra. We were using 0.9.3 version and facing the same
>>>> issue of netty reconnects which was the issue 404. Upgrading to 0.9.4 fixed
>>>> the issue.
>>>>
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Ganesh
>>>>
>>>>
>>>>
>>>> *From:* Jitendra Yadav [mailto:jeetuyadav200890@gmail.com]
>>>> *Sent:* Thursday, September 03, 2015 8:20 AM
>>>> *To:* user@storm.apache.org
>>>> *Subject:* Re: Netty reconnect
>>>>
>>>>
>>>>
>>>> I don't know your storm version, but it's worth to check these Jira's
>>>> and see if similar scenario occurring.
>>>>
>>>>
>>>>
>>>> https://issues.apache.org/jira/browse/STORM-404
>>>> https://issues.apache.org/jira/browse/STORM-450
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>> Jitendra
>>>>
>>>>
>>>>
>>>> On Thu, Sep 3, 2015 at 5:22 PM, John Yost <so...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Everyone,
>>>>
>>>> When I see this, it is evidence that one or more of the workers are not
>>>> starting up, which results in connections either not occuring or
>>>> reconnecting occuring when supervisors kill workers that don't start up
>>>> properly. I recommend checking the supervisor and nimbus logs to see if
>>>> there are any root causes other than network issues causing the
>>>> connect/reconnect.
>>>>
>>>> --John
>>>>
>>>>
>>>>
>>>> On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis <
>>>> nick.katsip@gmail.com> wrote:
>>>>
>>>> Hello Kashyap,
>>>>
>>>> I have been having the same issue for some time now on my AWS cluster.
>>>> To be honest, I do not know how to resolve it.
>>>>
>>>> Regards,
>>>>
>>>> Nick
>>>>
>>>>
>>>>
>>>> 2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <ka...@gmail.com>:
>>>>
>>>> Hi,
>>>> Has anyone experienced Netty reconnects repeatedly? My workers seem to
>>>> be eternally in reconnect state and topology doesn't serve messages at all.
>>>> It gets connected once in a while and then goes back to getting
>>>> reconnecting.
>>>>
>>>> Any fixes for this?
>>>> "Reconnect started for Netty-Client"
>>>>
>>>> Thanks
>>>> Kashyap
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Nikolaos Romanos Katsipoulakis,
>>>>
>>>> University of Pittsburgh, PhD candidate
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Nikolaos Romanos Katsipoulakis,
>>>>
>>>> University of Pittsburgh, PhD candidate
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Nikolaos Romanos Katsipoulakis,
>>>>
>>>> University of Pittsburgh, PhD candidate
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Netty reconnect

Posted by Kashyap Mhaisekar <ka...@gmail.com>.

Thanks Erik.
I *think* the issue for me was that I had created more workers than what
was possible. I have 5 machines with 4 core each and I should have had 20
workers, but I ended up defining 25 workers. One of the machines went out
of circulation for whatever reason (disk space filled up). This meant i had
a 16 workers but defined 25 workers. Am not sure if this could be a reason,
but after correcting this and migrating to 0.9.5, i think I saw the issue
disappear. Am still validating though.

Thanks
Kashyap

On Mon, Sep 14, 2015 at 6:53 PM, Erik Weathers <ew...@groupon.com>
wrote:

> That exception is certainly a *result* of the original worker death you're
> experiencing.  As noted in earlier responses on this thread, it seems like
> you're experiencing a cascading set of connection exceptions which are
> obscuring the original root cause / original worker death.  This is one of
> the pain points with storm: it can be hard to find the original exception /
> reason for a spiraling set of worker deaths.   You should look at the
> worker and supervisor logs to find out if "
> myserver1.personal.com/10.2.72.176:6701" was already dead before you saw
> this exception.
>
> Notably, with storm-0.9.3 we needed to revert to zero-mq instead of netty
> to overcome a similar issue.  We haven't experienced the problems after
> upgrading to 0.9.4 with netty (0.9.5 has also worked for us).  When we were
> experiencing problems with 0.9.3 and netty, the original worker process
> that was dying and invoking the cascading failures was "timing out".  i.e.,
> The supervisor wasn't receiving heartbeats from the worker within the 30
> second window, and then the supervisor *killed* the worker.  We noted that
> the workers were supposed to write to their heartbeat file once a second,
> but the frequency consistently increased, going from 1 second, to 2
> seconds, to 5 seconds, ..., to eventually being longer than 30 seconds,
> causing the supervisor to kill the worker.
>
> So long story short:  if you're experiencing the same thing as we were,
> just upgrading to 0.9.4 or 0.9.5 might solve it.
>
> But before doing that you should find the initial worker death's cause (be
> it a heartbeat timeout or an exception within the worker).
>
> - Erik
>
> On Fri, Sep 11, 2015 at 1:26 PM, Kashyap Mhaisekar <ka...@gmail.com>
> wrote:
>
>> Ganesh, All
>> Do you know if the answer to this is an upgrade to* 0.9.4 *or or* 0.9.5 *or
>> to version* 0.10.0-beta1*. My topology runs fine for 15 mins and then
>> gives up with this -
>> 2015-09-11 15:19:51 b.s.m.n.Client [INFO] failed to send requests to
>> myserver1.personal.com/10.2.72.176:6701:
>> java.nio.channels.ClosedChannelException: null
>>         at
>> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:405)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> org.apache.storm.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
>> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>>         at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> [na:1.7.0_79]
>>         at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> [na:1.7.0_79]
>>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
>>
>> and then with   ...
>> 2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for
>> Netty-Client-myserver5.personal.com/10.2.72.176:6701... [1]
>> 2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for
>> Netty-Client-myserver7.personal.com/10.2.72.72:6704... [1]
>> 2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for
>> Netty-Client-myserver3.personal.com/10.2.72.77:6702... [1]
>>
>>
>> It restarts again and the whole thing repeats.
>>
>> Thanks
>> kashyap
>>
>> On Fri, Sep 4, 2015 at 11:33 AM, Ganesh Chandrasekaran <
>> gchandrasekaran@wayfair.com> wrote:
>>
>>> Kashyap,
>>>
>>>
>>>
>>> Yes you will need to upgrade Storm version on cluster as well.
>>> Personally, I would run tests to see if it fixes existing issue before
>>> upgrading.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Ganesh
>>>
>>>
>>>
>>> *From:* Joseph Beard [mailto:joseph@josephbeard.net]
>>> *Sent:* Friday, September 04, 2015 12:07 PM
>>>
>>> *To:* user@storm.apache.org
>>> *Subject:* Re: Netty reconnect
>>>
>>>
>>>
>>> We also ran into the same issue with Storm 0.9.4.  We chose to upgrade
>>> to 0.10.0-beta1 which solved the problem and has been otherwise stable for
>>> our needs.
>>>
>>>
>>>
>>>
>>>
>>> Joe
>>>
>>> —
>>>
>>> Joseph Beard
>>>
>>> joseph@josephbeard.net
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Sep 3, 2015, at 10:03 AM, Kashyap Mhaisekar <ka...@gmail.com>
>>> wrote:
>>>
>>>
>>>
>>> Thanks for the advices. Will upgrade from 0.9.3 to 0.9.4. A lame
>>> question - Does it mean that the existing clusters need to be rebuilt with
>>> 0.9.4?
>>>
>>> Thanks
>>> Kashyap
>>>
>>> On Sep 3, 2015 08:32, "Nick R. Katsipoulakis" <ni...@gmail.com>
>>> wrote:
>>>
>>> Ganesh,
>>>
>>>
>>>
>>> No I am not.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Nick
>>>
>>>
>>>
>>> 2015-09-03 9:25 GMT-04:00 Ganesh Chandrasekaran <
>>> gchandrasekaran@wayfair.com>:
>>>
>>> Are you using multilang protocol? I know that after upgrading to 0.9.4
>>> it seemed like I was being affected by this bug -
>>> https://issues.apache.org/jira/browse/STORM-738 and rolled back to
>>> previous stable version of 0.8.2.
>>>
>>> I did not verify this thoroughly on my cluster though.
>>>
>>>
>>>
>>>
>>>
>>> *From:* Nick R. Katsipoulakis [mailto:nick.katsip@gmail.com]
>>> *Sent:* Thursday, September 03, 2015 9:08 AM
>>>
>>>
>>> *To:* user@storm.apache.org
>>> *Subject:* Re: Netty reconnect
>>>
>>>
>>>
>>>
>>>
>>> Hello again,
>>>
>>>
>>>
>>> I read STORM-404 and I saw that is resolved on version 0.9.4. However, I
>>> have version 0.9.4 installed in my cluster, and I have seen similar
>>> behavior in my workers.
>>>
>>>
>>>
>>> In fact, at random times I would see that some workers were considered
>>> dead (Netty was dropping messages) and they would be restarted by the
>>> nimbus.
>>>
>>>
>>>
>>> Currently, I only see dropped messages but not restarted workers.
>>>
>>>
>>>
>>> FYI, my cluster has the following information
>>>
>>>
>>>
>>>    - 3X AWS m4.xlarge instances for ZooKeeper and Nimbus
>>>    - 4X AWS m4.xlarge instances for Supervisors (each one with 2
>>>    workers)
>>>
>>> Thanks,
>>>
>>> Nick
>>>
>>>
>>>
>>> 2015-09-03 8:38 GMT-04:00 Ganesh Chandrasekaran <
>>> gchandrasekaran@wayfair.com>:
>>>
>>> Agreed with Jitendra. We were using 0.9.3 version and facing the same
>>> issue of netty reconnects which was the issue 404. Upgrading to 0.9.4 fixed
>>> the issue.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Ganesh
>>>
>>>
>>>
>>> *From:* Jitendra Yadav [mailto:jeetuyadav200890@gmail.com]
>>> *Sent:* Thursday, September 03, 2015 8:20 AM
>>> *To:* user@storm.apache.org
>>> *Subject:* Re: Netty reconnect
>>>
>>>
>>>
>>> I don't know your storm version, but it's worth to check these Jira's
>>> and see if similar scenario occurring.
>>>
>>>
>>>
>>> https://issues.apache.org/jira/browse/STORM-404
>>> https://issues.apache.org/jira/browse/STORM-450
>>>
>>>
>>>
>>> Thanks
>>>
>>> Jitendra
>>>
>>>
>>>
>>> On Thu, Sep 3, 2015 at 5:22 PM, John Yost <so...@gmail.com>
>>> wrote:
>>>
>>> Hi Everyone,
>>>
>>> When I see this, it is evidence that one or more of the workers are not
>>> starting up, which results in connections either not occuring or
>>> reconnecting occuring when supervisors kill workers that don't start up
>>> properly. I recommend checking the supervisor and nimbus logs to see if
>>> there are any root causes other than network issues causing the
>>> connect/reconnect.
>>>
>>> --John
>>>
>>>
>>>
>>> On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis <
>>> nick.katsip@gmail.com> wrote:
>>>
>>> Hello Kashyap,
>>>
>>> I have been having the same issue for some time now on my AWS cluster.
>>> To be honest, I do not know how to resolve it.
>>>
>>> Regards,
>>>
>>> Nick
>>>
>>>
>>>
>>> 2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <ka...@gmail.com>:
>>>
>>> Hi,
>>> Has anyone experienced Netty reconnects repeatedly? My workers seem to
>>> be eternally in reconnect state and topology doesn't serve messages at all.
>>> It gets connected once in a while and then goes back to getting
>>> reconnecting.
>>>
>>> Any fixes for this?
>>> "Reconnect started for Netty-Client"
>>>
>>> Thanks
>>> Kashyap
>>>
>>>
>>>
>>> --
>>>
>>> Nikolaos Romanos Katsipoulakis,
>>>
>>> University of Pittsburgh, PhD candidate
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Nikolaos Romanos Katsipoulakis,
>>>
>>> University of Pittsburgh, PhD candidate
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Nikolaos Romanos Katsipoulakis,
>>>
>>> University of Pittsburgh, PhD candidate
>>>
>>>
>>>
>>
>>
>

Re: Netty reconnect

Posted by Erik Weathers <ew...@groupon.com>.

That exception is certainly a *result* of the original worker death you're
experiencing.  As noted in earlier responses on this thread, it seems like
you're experiencing a cascading set of connection exceptions which are
obscuring the original root cause / original worker death.  This is one of
the pain points with storm: it can be hard to find the original exception /
reason for a spiraling set of worker deaths.   You should look at the
worker and supervisor logs to find out if "
myserver1.personal.com/10.2.72.176:6701" was already dead before you saw
this exception.

Notably, with storm-0.9.3 we needed to revert to zero-mq instead of netty
to overcome a similar issue.  We haven't experienced the problems after
upgrading to 0.9.4 with netty (0.9.5 has also worked for us).  When we were
experiencing problems with 0.9.3 and netty, the original worker process
that was dying and invoking the cascading failures was "timing out".  i.e.,
The supervisor wasn't receiving heartbeats from the worker within the 30
second window, and then the supervisor *killed* the worker.  We noted that
the workers were supposed to write to their heartbeat file once a second,
but the frequency consistently increased, going from 1 second, to 2
seconds, to 5 seconds, ..., to eventually being longer than 30 seconds,
causing the supervisor to kill the worker.

So long story short:  if you're experiencing the same thing as we were,
just upgrading to 0.9.4 or 0.9.5 might solve it.

But before doing that you should find the initial worker death's cause (be
it a heartbeat timeout or an exception within the worker).

- Erik

On Fri, Sep 11, 2015 at 1:26 PM, Kashyap Mhaisekar <ka...@gmail.com>
wrote:

> Ganesh, All
> Do you know if the answer to this is an upgrade to* 0.9.4 *or or* 0.9.5 *or
> to version* 0.10.0-beta1*. My topology runs fine for 15 mins and then
> gives up with this -
> 2015-09-11 15:19:51 b.s.m.n.Client [INFO] failed to send requests to
> myserver1.personal.com/10.2.72.176:6701:
> java.nio.channels.ClosedChannelException: null
>         at
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:405)
> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>         at
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373)
> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>         at
> org.apache.storm.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>         at
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>         at
> org.apache.storm.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>         at
> org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>         at
> org.apache.storm.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>         at
> org.apache.storm.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>         at
> org.apache.storm.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
> [storm-core-0.9.3-rc1.jar:0.9.3-rc1]
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [na:1.7.0_79]
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [na:1.7.0_79]
>         at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]
>
> and then with   ...
> 2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-myserver5.personal.com/10.2.72.176:6701... [1]
> 2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-myserver7.personal.com/10.2.72.72:6704... [1]
> 2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for
> Netty-Client-myserver3.personal.com/10.2.72.77:6702... [1]
>
>
> It restarts again and the whole thing repeats.
>
> Thanks
> kashyap
>
> On Fri, Sep 4, 2015 at 11:33 AM, Ganesh Chandrasekaran <
> gchandrasekaran@wayfair.com> wrote:
>
>> Kashyap,
>>
>>
>>
>> Yes you will need to upgrade Storm version on cluster as well.
>> Personally, I would run tests to see if it fixes existing issue before
>> upgrading.
>>
>>
>>
>> Thanks,
>>
>> Ganesh
>>
>>
>>
>> *From:* Joseph Beard [mailto:joseph@josephbeard.net]
>> *Sent:* Friday, September 04, 2015 12:07 PM
>>
>> *To:* user@storm.apache.org
>> *Subject:* Re: Netty reconnect
>>
>>
>>
>> We also ran into the same issue with Storm 0.9.4.  We chose to upgrade to
>> 0.10.0-beta1 which solved the problem and has been otherwise stable for our
>> needs.
>>
>>
>>
>>
>>
>> Joe
>>
>> —
>>
>> Joseph Beard
>>
>> joseph@josephbeard.net
>>
>>
>>
>>
>>
>>
>>
>> On Sep 3, 2015, at 10:03 AM, Kashyap Mhaisekar <ka...@gmail.com>
>> wrote:
>>
>>
>>
>> Thanks for the advices. Will upgrade from 0.9.3 to 0.9.4. A lame question
>> - Does it mean that the existing clusters need to be rebuilt with 0.9.4?
>>
>> Thanks
>> Kashyap
>>
>> On Sep 3, 2015 08:32, "Nick R. Katsipoulakis" <ni...@gmail.com>
>> wrote:
>>
>> Ganesh,
>>
>>
>>
>> No I am not.
>>
>>
>>
>> Cheers,
>>
>> Nick
>>
>>
>>
>> 2015-09-03 9:25 GMT-04:00 Ganesh Chandrasekaran <
>> gchandrasekaran@wayfair.com>:
>>
>> Are you using multilang protocol? I know that after upgrading to 0.9.4 it
>> seemed like I was being affected by this bug -
>> https://issues.apache.org/jira/browse/STORM-738 and rolled back to
>> previous stable version of 0.8.2.
>>
>> I did not verify this thoroughly on my cluster though.
>>
>>
>>
>>
>>
>> *From:* Nick R. Katsipoulakis [mailto:nick.katsip@gmail.com]
>> *Sent:* Thursday, September 03, 2015 9:08 AM
>>
>>
>> *To:* user@storm.apache.org
>> *Subject:* Re: Netty reconnect
>>
>>
>>
>>
>>
>> Hello again,
>>
>>
>>
>> I read STORM-404 and I saw that is resolved on version 0.9.4. However, I
>> have version 0.9.4 installed in my cluster, and I have seen similar
>> behavior in my workers.
>>
>>
>>
>> In fact, at random times I would see that some workers were considered
>> dead (Netty was dropping messages) and they would be restarted by the
>> nimbus.
>>
>>
>>
>> Currently, I only see dropped messages but not restarted workers.
>>
>>
>>
>> FYI, my cluster has the following information
>>
>>
>>
>>    - 3X AWS m4.xlarge instances for ZooKeeper and Nimbus
>>    - 4X AWS m4.xlarge instances for Supervisors (each one with 2 workers)
>>
>> Thanks,
>>
>> Nick
>>
>>
>>
>> 2015-09-03 8:38 GMT-04:00 Ganesh Chandrasekaran <
>> gchandrasekaran@wayfair.com>:
>>
>> Agreed with Jitendra. We were using 0.9.3 version and facing the same
>> issue of netty reconnects which was the issue 404. Upgrading to 0.9.4 fixed
>> the issue.
>>
>>
>>
>> Thanks,
>>
>> Ganesh
>>
>>
>>
>> *From:* Jitendra Yadav [mailto:jeetuyadav200890@gmail.com]
>> *Sent:* Thursday, September 03, 2015 8:20 AM
>> *To:* user@storm.apache.org
>> *Subject:* Re: Netty reconnect
>>
>>
>>
>> I don't know your storm version, but it's worth to check these Jira's and
>> see if similar scenario occurring.
>>
>>
>>
>> https://issues.apache.org/jira/browse/STORM-404
>> https://issues.apache.org/jira/browse/STORM-450
>>
>>
>>
>> Thanks
>>
>> Jitendra
>>
>>
>>
>> On Thu, Sep 3, 2015 at 5:22 PM, John Yost <so...@gmail.com>
>> wrote:
>>
>> Hi Everyone,
>>
>> When I see this, it is evidence that one or more of the workers are not
>> starting up, which results in connections either not occuring or
>> reconnecting occuring when supervisors kill workers that don't start up
>> properly. I recommend checking the supervisor and nimbus logs to see if
>> there are any root causes other than network issues causing the
>> connect/reconnect.
>>
>> --John
>>
>>
>>
>> On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis <
>> nick.katsip@gmail.com> wrote:
>>
>> Hello Kashyap,
>>
>> I have been having the same issue for some time now on my AWS cluster. To
>> be honest, I do not know how to resolve it.
>>
>> Regards,
>>
>> Nick
>>
>>
>>
>> 2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <ka...@gmail.com>:
>>
>> Hi,
>> Has anyone experienced Netty reconnects repeatedly? My workers seem to be
>> eternally in reconnect state and topology doesn't serve messages at all. It
>> gets connected once in a while and then goes back to getting reconnecting.
>>
>> Any fixes for this?
>> "Reconnect started for Netty-Client"
>>
>> Thanks
>> Kashyap
>>
>>
>>
>> --
>>
>> Nikolaos Romanos Katsipoulakis,
>>
>> University of Pittsburgh, PhD candidate
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Nikolaos Romanos Katsipoulakis,
>>
>> University of Pittsburgh, PhD candidate
>>
>>
>>
>>
>>
>> --
>>
>> Nikolaos Romanos Katsipoulakis,
>>
>> University of Pittsburgh, PhD candidate
>>
>>
>>
>
>

Re: Netty reconnect

Posted by Kashyap Mhaisekar <ka...@gmail.com>.

Ganesh, All
Do you know if the answer to this is an upgrade to* 0.9.4 *or or* 0.9.5 *or
to version* 0.10.0-beta1*. My topology runs fine for 15 mins and then gives
up with this -
2015-09-11 15:19:51 b.s.m.n.Client [INFO] failed to send requests to
myserver1.personal.com/10.2.72.176:6701:
java.nio.channels.ClosedChannelException: null
        at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.cleanUpWriteBuffer(AbstractNioWorker.java:405)
[storm-core-0.9.3-rc1.jar:0.9.3-rc1]
        at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.close(AbstractNioWorker.java:373)
[storm-core-0.9.3-rc1.jar:0.9.3-rc1]
        at
org.apache.storm.netty.channel.socket.nio.NioWorker.read(NioWorker.java:93)
[storm-core-0.9.3-rc1.jar:0.9.3-rc1]
        at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
[storm-core-0.9.3-rc1.jar:0.9.3-rc1]
        at
org.apache.storm.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
[storm-core-0.9.3-rc1.jar:0.9.3-rc1]
        at
org.apache.storm.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
[storm-core-0.9.3-rc1.jar:0.9.3-rc1]
        at
org.apache.storm.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
[storm-core-0.9.3-rc1.jar:0.9.3-rc1]
        at
org.apache.storm.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
[storm-core-0.9.3-rc1.jar:0.9.3-rc1]
        at
org.apache.storm.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
[storm-core-0.9.3-rc1.jar:0.9.3-rc1]
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_79]
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_79]
        at java.lang.Thread.run(Thread.java:745) [na:1.7.0_79]

and then with   ...
2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-myserver5.personal.com/10.2.72.176:6701... [1]
2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-myserver7.personal.com/10.2.72.72:6704... [1]
2015-09-11 15:20:12 b.s.m.n.Client [INFO] Reconnect started for
Netty-Client-myserver3.personal.com/10.2.72.77:6702... [1]


It restarts again and the whole thing repeats.

Thanks
kashyap

On Fri, Sep 4, 2015 at 11:33 AM, Ganesh Chandrasekaran <
gchandrasekaran@wayfair.com> wrote:

> Kashyap,
>
>
>
> Yes you will need to upgrade Storm version on cluster as well. Personally,
> I would run tests to see if it fixes existing issue before upgrading.
>
>
>
> Thanks,
>
> Ganesh
>
>
>
> *From:* Joseph Beard [mailto:joseph@josephbeard.net]
> *Sent:* Friday, September 04, 2015 12:07 PM
>
> *To:* user@storm.apache.org
> *Subject:* Re: Netty reconnect
>
>
>
> We also ran into the same issue with Storm 0.9.4.  We chose to upgrade to
> 0.10.0-beta1 which solved the problem and has been otherwise stable for our
> needs.
>
>
>
>
>
> Joe
>
> —
>
> Joseph Beard
>
> joseph@josephbeard.net
>
>
>
>
>
>
>
> On Sep 3, 2015, at 10:03 AM, Kashyap Mhaisekar <ka...@gmail.com>
> wrote:
>
>
>
> Thanks for the advices. Will upgrade from 0.9.3 to 0.9.4. A lame question
> - Does it mean that the existing clusters need to be rebuilt with 0.9.4?
>
> Thanks
> Kashyap
>
> On Sep 3, 2015 08:32, "Nick R. Katsipoulakis" <ni...@gmail.com>
> wrote:
>
> Ganesh,
>
>
>
> No I am not.
>
>
>
> Cheers,
>
> Nick
>
>
>
> 2015-09-03 9:25 GMT-04:00 Ganesh Chandrasekaran <
> gchandrasekaran@wayfair.com>:
>
> Are you using multilang protocol? I know that after upgrading to 0.9.4 it
> seemed like I was being affected by this bug -
> https://issues.apache.org/jira/browse/STORM-738 and rolled back to
> previous stable version of 0.8.2.
>
> I did not verify this thoroughly on my cluster though.
>
>
>
>
>
> *From:* Nick R. Katsipoulakis [mailto:nick.katsip@gmail.com]
> *Sent:* Thursday, September 03, 2015 9:08 AM
>
>
> *To:* user@storm.apache.org
> *Subject:* Re: Netty reconnect
>
>
>
>
>
> Hello again,
>
>
>
> I read STORM-404 and I saw that is resolved on version 0.9.4. However, I
> have version 0.9.4 installed in my cluster, and I have seen similar
> behavior in my workers.
>
>
>
> In fact, at random times I would see that some workers were considered
> dead (Netty was dropping messages) and they would be restarted by the
> nimbus.
>
>
>
> Currently, I only see dropped messages but not restarted workers.
>
>
>
> FYI, my cluster has the following information
>
>
>
>    - 3X AWS m4.xlarge instances for ZooKeeper and Nimbus
>    - 4X AWS m4.xlarge instances for Supervisors (each one with 2 workers)
>
> Thanks,
>
> Nick
>
>
>
> 2015-09-03 8:38 GMT-04:00 Ganesh Chandrasekaran <
> gchandrasekaran@wayfair.com>:
>
> Agreed with Jitendra. We were using 0.9.3 version and facing the same
> issue of netty reconnects which was the issue 404. Upgrading to 0.9.4 fixed
> the issue.
>
>
>
> Thanks,
>
> Ganesh
>
>
>
> *From:* Jitendra Yadav [mailto:jeetuyadav200890@gmail.com]
> *Sent:* Thursday, September 03, 2015 8:20 AM
> *To:* user@storm.apache.org
> *Subject:* Re: Netty reconnect
>
>
>
> I don't know your storm version, but it's worth to check these Jira's and
> see if similar scenario occurring.
>
>
>
> https://issues.apache.org/jira/browse/STORM-404
> https://issues.apache.org/jira/browse/STORM-450
>
>
>
> Thanks
>
> Jitendra
>
>
>
> On Thu, Sep 3, 2015 at 5:22 PM, John Yost <so...@gmail.com>
> wrote:
>
> Hi Everyone,
>
> When I see this, it is evidence that one or more of the workers are not
> starting up, which results in connections either not occuring or
> reconnecting occuring when supervisors kill workers that don't start up
> properly. I recommend checking the supervisor and nimbus logs to see if
> there are any root causes other than network issues causing the
> connect/reconnect.
>
> --John
>
>
>
> On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis <
> nick.katsip@gmail.com> wrote:
>
> Hello Kashyap,
>
> I have been having the same issue for some time now on my AWS cluster. To
> be honest, I do not know how to resolve it.
>
> Regards,
>
> Nick
>
>
>
> 2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <ka...@gmail.com>:
>
> Hi,
> Has anyone experienced Netty reconnects repeatedly? My workers seem to be
> eternally in reconnect state and topology doesn't serve messages at all. It
> gets connected once in a while and then goes back to getting reconnecting.
>
> Any fixes for this?
> "Reconnect started for Netty-Client"
>
> Thanks
> Kashyap
>
>
>
> --
>
> Nikolaos Romanos Katsipoulakis,
>
> University of Pittsburgh, PhD candidate
>
>
>
>
>
>
>
>
>
> --
>
> Nikolaos Romanos Katsipoulakis,
>
> University of Pittsburgh, PhD candidate
>
>
>
>
>
> --
>
> Nikolaos Romanos Katsipoulakis,
>
> University of Pittsburgh, PhD candidate
>
>
>

RE: Netty reconnect

Posted by Ganesh Chandrasekaran <gc...@wayfair.com>.

Kashyap,

Yes you will need to upgrade Storm version on cluster as well. Personally, I would run tests to see if it fixes existing issue before upgrading.

Thanks,
Ganesh

From: Joseph Beard [mailto:joseph@josephbeard.net]
Sent: Friday, September 04, 2015 12:07 PM
To: user@storm.apache.org
Subject: Re: Netty reconnect

We also ran into the same issue with Storm 0.9.4.  We chose to upgrade to 0.10.0-beta1 which solved the problem and has been otherwise stable for our needs.

Joe
—
Joseph Beard
joseph@josephbeard.net<ma...@josephbeard.net>

On Sep 3, 2015, at 10:03 AM, Kashyap Mhaisekar <ka...@gmail.com>> wrote:

Thanks for the advices. Will upgrade from 0.9.3 to 0.9.4. A lame question - Does it mean that the existing clusters need to be rebuilt with 0.9.4?
Thanks
Kashyap
On Sep 3, 2015 08:32, "Nick R. Katsipoulakis" <ni...@gmail.com>> wrote:
Ganesh,

No I am not.

Cheers,
Nick

2015-09-03 9:25 GMT-04:00 Ganesh Chandrasekaran <gc...@wayfair.com>>:
Are you using multilang protocol? I know that after upgrading to 0.9.4 it seemed like I was being affected by this bug - https://issues.apache.org/jira/browse/STORM-738 and rolled back to previous stable version of 0.8.2.
I did not verify this thoroughly on my cluster though.

From: Nick R. Katsipoulakis [mailto:nick.katsip@gmail.com<ma...@gmail.com>]
Sent: Thursday, September 03, 2015 9:08 AM

To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Netty reconnect

Hello again,

I read STORM-404 and I saw that is resolved on version 0.9.4. However, I have version 0.9.4 installed in my cluster, and I have seen similar behavior in my workers.

In fact, at random times I would see that some workers were considered dead (Netty was dropping messages) and they would be restarted by the nimbus.

Currently, I only see dropped messages but not restarted workers.

FYI, my cluster has the following information

  *   3X AWS m4.xlarge instances for ZooKeeper and Nimbus
  *   4X AWS m4.xlarge instances for Supervisors (each one with 2 workers)
Thanks,
Nick

2015-09-03 8:38 GMT-04:00 Ganesh Chandrasekaran <gc...@wayfair.com>>:
Agreed with Jitendra. We were using 0.9.3 version and facing the same issue of netty reconnects which was the issue 404. Upgrading to 0.9.4 fixed the issue.

Thanks,
Ganesh

From: Jitendra Yadav [mailto:jeetuyadav200890@gmail.com<ma...@gmail.com>]
Sent: Thursday, September 03, 2015 8:20 AM
To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Netty reconnect

I don't know your storm version, but it's worth to check these Jira's and see if similar scenario occurring.

https://issues.apache.org/jira/browse/STORM-404
https://issues.apache.org/jira/browse/STORM-450

Thanks
Jitendra

On Thu, Sep 3, 2015 at 5:22 PM, John Yost <so...@gmail.com>> wrote:
Hi Everyone,
When I see this, it is evidence that one or more of the workers are not starting up, which results in connections either not occuring or reconnecting occuring when supervisors kill workers that don't start up properly. I recommend checking the supervisor and nimbus logs to see if there are any root causes other than network issues causing the connect/reconnect.
--John

On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis <ni...@gmail.com>> wrote:
Hello Kashyap,
I have been having the same issue for some time now on my AWS cluster. To be honest, I do not know how to resolve it.
Regards,
Nick

2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <ka...@gmail.com>>:
Hi,
Has anyone experienced Netty reconnects repeatedly? My workers seem to be eternally in reconnect state and topology doesn't serve messages at all. It gets connected once in a while and then goes back to getting reconnecting.
Any fixes for this?
"Reconnect started for Netty-Client"
Thanks
Kashyap

--
Nikolaos Romanos Katsipoulakis,
University of Pittsburgh, PhD candidate

--
Nikolaos Romanos Katsipoulakis,
University of Pittsburgh, PhD candidate

--
Nikolaos Romanos Katsipoulakis,
University of Pittsburgh, PhD candidate

Re: Netty reconnect

Posted by Joseph Beard <jo...@josephbeard.net>.

We also ran into the same issue with Storm 0.9.4.  We chose to upgrade to 0.10.0-beta1 which solved the problem and has been otherwise stable for our needs.


Joe
—
Joseph Beard
joseph@josephbeard.net




> On Sep 3, 2015, at 10:03 AM, Kashyap Mhaisekar <ka...@gmail.com> wrote:
> 
> Thanks for the advices. Will upgrade from 0.9.3 to 0.9.4. A lame question - Does it mean that the existing clusters need to be rebuilt with 0.9.4?
> 
> Thanks
> Kashyap
> 
> On Sep 3, 2015 08:32, "Nick R. Katsipoulakis" <nick.katsip@gmail.com <ma...@gmail.com>> wrote:
> Ganesh, 
> 
> No I am not.
> 
> Cheers,
> Nick
> 
> 2015-09-03 9:25 GMT-04:00 Ganesh Chandrasekaran <gchandrasekaran@wayfair.com <ma...@wayfair.com>>:
> Are you using multilang protocol? I know that after upgrading to 0.9.4 it seemed like I was being affected by this bug - https://issues.apache.org/jira/browse/STORM-738 <https://issues.apache.org/jira/browse/STORM-738> and rolled back to previous stable version of 0.8.2.
> 
> I did not verify this thoroughly on my cluster though.
> 
>  
> 
>  
> 
> From: Nick R. Katsipoulakis [mailto:nick.katsip@gmail.com <ma...@gmail.com>] 
> Sent: Thursday, September 03, 2015 9:08 AM
> 
> 
> To: user@storm.apache.org <ma...@storm.apache.org>
> Subject: Re: Netty reconnect
> 
>  
> 
> Hello again, 
> 
>  
> 
> I read STORM-404 and I saw that is resolved on version 0.9.4. However, I have version 0.9.4 installed in my cluster, and I have seen similar behavior in my workers.
> 
>  
> 
> In fact, at random times I would see that some workers were considered dead (Netty was dropping messages) and they would be restarted by the nimbus.
> 
>  
> 
> Currently, I only see dropped messages but not restarted workers.
> 
>  
> 
> FYI, my cluster has the following information
> 
>  
> 
> 3X AWS m4.xlarge instances for ZooKeeper and Nimbus
> 4X AWS m4.xlarge instances for Supervisors (each one with 2 workers)
> Thanks,
> 
> Nick
> 
>  
> 
> 2015-09-03 8:38 GMT-04:00 Ganesh Chandrasekaran <gchandrasekaran@wayfair.com <ma...@wayfair.com>>:
> 
> Agreed with Jitendra. We were using 0.9.3 version and facing the same issue of netty reconnects which was the issue 404. Upgrading to 0.9.4 fixed the issue.
> 
>  
> 
> Thanks,
> 
> Ganesh
> 
>  
> 
> From: Jitendra Yadav [mailto:jeetuyadav200890@gmail.com <ma...@gmail.com>] 
> Sent: Thursday, September 03, 2015 8:20 AM
> To: user@storm.apache.org <ma...@storm.apache.org>
> Subject: Re: Netty reconnect
> 
>  
> 
> I don't know your storm version, but it's worth to check these Jira's and see if similar scenario occurring.
> 
>  
> 
> https://issues.apache.org/jira/browse/STORM-404 <https://issues.apache.org/jira/browse/STORM-404>
> https://issues.apache.org/jira/browse/STORM-450 <https://issues.apache.org/jira/browse/STORM-450>
>  
> 
> Thanks
> 
> Jitendra
> 
>  
> 
> On Thu, Sep 3, 2015 at 5:22 PM, John Yost <soozandjohnyost@gmail.com <ma...@gmail.com>> wrote:
> 
> Hi Everyone,
> 
> When I see this, it is evidence that one or more of the workers are not starting up, which results in connections either not occuring or reconnecting occuring when supervisors kill workers that don't start up properly. I recommend checking the supervisor and nimbus logs to see if there are any root causes other than network issues causing the connect/reconnect.
> 
> --John
> 
>  
> 
> On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis <nick.katsip@gmail.com <ma...@gmail.com>> wrote:
> 
> Hello Kashyap,
> 
> I have been having the same issue for some time now on my AWS cluster. To be honest, I do not know how to resolve it.
> 
> Regards,
> 
> Nick
> 
>  
> 
> 2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <kashyap.m@gmail.com <ma...@gmail.com>>:
> 
> Hi,
> Has anyone experienced Netty reconnects repeatedly? My workers seem to be eternally in reconnect state and topology doesn't serve messages at all. It gets connected once in a while and then goes back to getting reconnecting.
> 
> Any fixes for this?
> "Reconnect started for Netty-Client"
> 
> Thanks
> Kashyap
> 
> 
> 
> 
> --
> 
> Nikolaos Romanos Katsipoulakis,
> 
> University of Pittsburgh, PhD candidate
> 
>  
> 
>  
> 
> 
> 
> 
>  
> 
> --
> 
> Nikolaos Romanos Katsipoulakis,
> 
> University of Pittsburgh, PhD candidate
> 
> 
> 
> 
> -- 
> Nikolaos Romanos Katsipoulakis,
> University of Pittsburgh, PhD candidate

Re: Netty reconnect

Posted by Kashyap Mhaisekar <ka...@gmail.com>.

Thanks for the advices. Will upgrade from 0.9.3 to 0.9.4. A lame question -
Does it mean that the existing clusters need to be rebuilt with 0.9.4?

Thanks
Kashyap
On Sep 3, 2015 08:32, "Nick R. Katsipoulakis" <ni...@gmail.com> wrote:

> Ganesh,
>
> No I am not.
>
> Cheers,
> Nick
>
> 2015-09-03 9:25 GMT-04:00 Ganesh Chandrasekaran <
> gchandrasekaran@wayfair.com>:
>
>> Are you using multilang protocol? I know that after upgrading to 0.9.4 it
>> seemed like I was being affected by this bug -
>> https://issues.apache.org/jira/browse/STORM-738 and rolled back to
>> previous stable version of 0.8.2.
>>
>> I did not verify this thoroughly on my cluster though.
>>
>>
>>
>>
>>
>> *From:* Nick R. Katsipoulakis [mailto:nick.katsip@gmail.com]
>> *Sent:* Thursday, September 03, 2015 9:08 AM
>>
>> *To:* user@storm.apache.org
>> *Subject:* Re: Netty reconnect
>>
>>
>>
>> Hello again,
>>
>>
>>
>> I read STORM-404 and I saw that is resolved on version 0.9.4. However, I
>> have version 0.9.4 installed in my cluster, and I have seen similar
>> behavior in my workers.
>>
>>
>>
>> In fact, at random times I would see that some workers were considered
>> dead (Netty was dropping messages) and they would be restarted by the
>> nimbus.
>>
>>
>>
>> Currently, I only see dropped messages but not restarted workers.
>>
>>
>>
>> FYI, my cluster has the following information
>>
>>
>>
>>    - 3X AWS m4.xlarge instances for ZooKeeper and Nimbus
>>    - 4X AWS m4.xlarge instances for Supervisors (each one with 2 workers)
>>
>> Thanks,
>>
>> Nick
>>
>>
>>
>> 2015-09-03 8:38 GMT-04:00 Ganesh Chandrasekaran <
>> gchandrasekaran@wayfair.com>:
>>
>> Agreed with Jitendra. We were using 0.9.3 version and facing the same
>> issue of netty reconnects which was the issue 404. Upgrading to 0.9.4 fixed
>> the issue.
>>
>>
>>
>> Thanks,
>>
>> Ganesh
>>
>>
>>
>> *From:* Jitendra Yadav [mailto:jeetuyadav200890@gmail.com]
>> *Sent:* Thursday, September 03, 2015 8:20 AM
>> *To:* user@storm.apache.org
>> *Subject:* Re: Netty reconnect
>>
>>
>>
>> I don't know your storm version, but it's worth to check these Jira's and
>> see if similar scenario occurring.
>>
>>
>>
>> https://issues.apache.org/jira/browse/STORM-404
>> https://issues.apache.org/jira/browse/STORM-450
>>
>>
>>
>> Thanks
>>
>> Jitendra
>>
>>
>>
>> On Thu, Sep 3, 2015 at 5:22 PM, John Yost <so...@gmail.com>
>> wrote:
>>
>> Hi Everyone,
>>
>> When I see this, it is evidence that one or more of the workers are not
>> starting up, which results in connections either not occuring or
>> reconnecting occuring when supervisors kill workers that don't start up
>> properly. I recommend checking the supervisor and nimbus logs to see if
>> there are any root causes other than network issues causing the
>> connect/reconnect.
>>
>> --John
>>
>>
>>
>> On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis <
>> nick.katsip@gmail.com> wrote:
>>
>> Hello Kashyap,
>>
>> I have been having the same issue for some time now on my AWS cluster. To
>> be honest, I do not know how to resolve it.
>>
>> Regards,
>>
>> Nick
>>
>>
>>
>> 2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <ka...@gmail.com>:
>>
>> Hi,
>> Has anyone experienced Netty reconnects repeatedly? My workers seem to be
>> eternally in reconnect state and topology doesn't serve messages at all. It
>> gets connected once in a while and then goes back to getting reconnecting.
>>
>> Any fixes for this?
>> "Reconnect started for Netty-Client"
>>
>> Thanks
>> Kashyap
>>
>>
>>
>> --
>>
>> Nikolaos Romanos Katsipoulakis,
>>
>> University of Pittsburgh, PhD candidate
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Nikolaos Romanos Katsipoulakis,
>>
>> University of Pittsburgh, PhD candidate
>>
>
>
>
> --
> Nikolaos Romanos Katsipoulakis,
> University of Pittsburgh, PhD candidate
>

Re: Netty reconnect

Posted by "Nick R. Katsipoulakis" <ni...@gmail.com>.

Ganesh,

No I am not.

Cheers,
Nick

2015-09-03 9:25 GMT-04:00 Ganesh Chandrasekaran <gchandrasekaran@wayfair.com
>:

> Are you using multilang protocol? I know that after upgrading to 0.9.4 it
> seemed like I was being affected by this bug -
> https://issues.apache.org/jira/browse/STORM-738 and rolled back to
> previous stable version of 0.8.2.
>
> I did not verify this thoroughly on my cluster though.
>
>
>
>
>
> *From:* Nick R. Katsipoulakis [mailto:nick.katsip@gmail.com]
> *Sent:* Thursday, September 03, 2015 9:08 AM
>
> *To:* user@storm.apache.org
> *Subject:* Re: Netty reconnect
>
>
>
> Hello again,
>
>
>
> I read STORM-404 and I saw that is resolved on version 0.9.4. However, I
> have version 0.9.4 installed in my cluster, and I have seen similar
> behavior in my workers.
>
>
>
> In fact, at random times I would see that some workers were considered
> dead (Netty was dropping messages) and they would be restarted by the
> nimbus.
>
>
>
> Currently, I only see dropped messages but not restarted workers.
>
>
>
> FYI, my cluster has the following information
>
>
>
>    - 3X AWS m4.xlarge instances for ZooKeeper and Nimbus
>    - 4X AWS m4.xlarge instances for Supervisors (each one with 2 workers)
>
> Thanks,
>
> Nick
>
>
>
> 2015-09-03 8:38 GMT-04:00 Ganesh Chandrasekaran <
> gchandrasekaran@wayfair.com>:
>
> Agreed with Jitendra. We were using 0.9.3 version and facing the same
> issue of netty reconnects which was the issue 404. Upgrading to 0.9.4 fixed
> the issue.
>
>
>
> Thanks,
>
> Ganesh
>
>
>
> *From:* Jitendra Yadav [mailto:jeetuyadav200890@gmail.com]
> *Sent:* Thursday, September 03, 2015 8:20 AM
> *To:* user@storm.apache.org
> *Subject:* Re: Netty reconnect
>
>
>
> I don't know your storm version, but it's worth to check these Jira's and
> see if similar scenario occurring.
>
>
>
> https://issues.apache.org/jira/browse/STORM-404
> https://issues.apache.org/jira/browse/STORM-450
>
>
>
> Thanks
>
> Jitendra
>
>
>
> On Thu, Sep 3, 2015 at 5:22 PM, John Yost <so...@gmail.com>
> wrote:
>
> Hi Everyone,
>
> When I see this, it is evidence that one or more of the workers are not
> starting up, which results in connections either not occuring or
> reconnecting occuring when supervisors kill workers that don't start up
> properly. I recommend checking the supervisor and nimbus logs to see if
> there are any root causes other than network issues causing the
> connect/reconnect.
>
> --John
>
>
>
> On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis <
> nick.katsip@gmail.com> wrote:
>
> Hello Kashyap,
>
> I have been having the same issue for some time now on my AWS cluster. To
> be honest, I do not know how to resolve it.
>
> Regards,
>
> Nick
>
>
>
> 2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <ka...@gmail.com>:
>
> Hi,
> Has anyone experienced Netty reconnects repeatedly? My workers seem to be
> eternally in reconnect state and topology doesn't serve messages at all. It
> gets connected once in a while and then goes back to getting reconnecting.
>
> Any fixes for this?
> "Reconnect started for Netty-Client"
>
> Thanks
> Kashyap
>
>
>
> --
>
> Nikolaos Romanos Katsipoulakis,
>
> University of Pittsburgh, PhD candidate
>
>
>
>
>
>
>
>
>
> --
>
> Nikolaos Romanos Katsipoulakis,
>
> University of Pittsburgh, PhD candidate
>



-- 
Nikolaos Romanos Katsipoulakis,
University of Pittsburgh, PhD candidate

RE: Netty reconnect

Posted by Ganesh Chandrasekaran <gc...@wayfair.com>.

Are you using multilang protocol? I know that after upgrading to 0.9.4 it seemed like I was being affected by this bug - https://issues.apache.org/jira/browse/STORM-738 and rolled back to previous stable version of 0.8.2.
I did not verify this thoroughly on my cluster though.

From: Nick R. Katsipoulakis [mailto:nick.katsip@gmail.com]
Sent: Thursday, September 03, 2015 9:08 AM
To: user@storm.apache.org
Subject: Re: Netty reconnect

Hello again,

I read STORM-404 and I saw that is resolved on version 0.9.4. However, I have version 0.9.4 installed in my cluster, and I have seen similar behavior in my workers.

In fact, at random times I would see that some workers were considered dead (Netty was dropping messages) and they would be restarted by the nimbus.

Currently, I only see dropped messages but not restarted workers.

FYI, my cluster has the following information

  *   3X AWS m4.xlarge instances for ZooKeeper and Nimbus
  *   4X AWS m4.xlarge instances for Supervisors (each one with 2 workers)
Thanks,
Nick

2015-09-03 8:38 GMT-04:00 Ganesh Chandrasekaran <gc...@wayfair.com>>:
Agreed with Jitendra. We were using 0.9.3 version and facing the same issue of netty reconnects which was the issue 404. Upgrading to 0.9.4 fixed the issue.

Thanks,
Ganesh

From: Jitendra Yadav [mailto:jeetuyadav200890@gmail.com<ma...@gmail.com>]
Sent: Thursday, September 03, 2015 8:20 AM
To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Netty reconnect

I don't know your storm version, but it's worth to check these Jira's and see if similar scenario occurring.

https://issues.apache.org/jira/browse/STORM-404
https://issues.apache.org/jira/browse/STORM-450

Thanks
Jitendra

On Thu, Sep 3, 2015 at 5:22 PM, John Yost <so...@gmail.com>> wrote:
Hi Everyone,
When I see this, it is evidence that one or more of the workers are not starting up, which results in connections either not occuring or reconnecting occuring when supervisors kill workers that don't start up properly. I recommend checking the supervisor and nimbus logs to see if there are any root causes other than network issues causing the connect/reconnect.
--John

On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis <ni...@gmail.com>> wrote:
Hello Kashyap,
I have been having the same issue for some time now on my AWS cluster. To be honest, I do not know how to resolve it.
Regards,
Nick

2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <ka...@gmail.com>>:

Hi,
Has anyone experienced Netty reconnects repeatedly? My workers seem to be eternally in reconnect state and topology doesn't serve messages at all. It gets connected once in a while and then goes back to getting reconnecting.

Any fixes for this?
"Reconnect started for Netty-Client"

Thanks
Kashyap

--
Nikolaos Romanos Katsipoulakis,
University of Pittsburgh, PhD candidate

--
Nikolaos Romanos Katsipoulakis,
University of Pittsburgh, PhD candidate

Re: Netty reconnect

Posted by "Nick R. Katsipoulakis" <ni...@gmail.com>.

Hello again,

I read STORM-404 and I saw that is resolved on version 0.9.4. However, I
have version 0.9.4 installed in my cluster, and I have seen similar
behavior in my workers.

In fact, at random times I would see that some workers were considered dead
(Netty was dropping messages) and they would be restarted by the nimbus.

Currently, I only see dropped messages but not restarted workers.

FYI, my cluster has the following information


   - 3X AWS m4.xlarge instances for ZooKeeper and Nimbus
   - 4X AWS m4.xlarge instances for Supervisors (each one with 2 workers)

Thanks,
Nick

2015-09-03 8:38 GMT-04:00 Ganesh Chandrasekaran <gchandrasekaran@wayfair.com
>:

> Agreed with Jitendra. We were using 0.9.3 version and facing the same
> issue of netty reconnects which was the issue 404. Upgrading to 0.9.4 fixed
> the issue.
>
>
>
> Thanks,
>
> Ganesh
>
>
>
> *From:* Jitendra Yadav [mailto:jeetuyadav200890@gmail.com]
> *Sent:* Thursday, September 03, 2015 8:20 AM
> *To:* user@storm.apache.org
> *Subject:* Re: Netty reconnect
>
>
>
> I don't know your storm version, but it's worth to check these Jira's and
> see if similar scenario occurring.
>
>
>
> https://issues.apache.org/jira/browse/STORM-404
> https://issues.apache.org/jira/browse/STORM-450
>
>
>
> Thanks
>
> Jitendra
>
>
>
> On Thu, Sep 3, 2015 at 5:22 PM, John Yost <so...@gmail.com>
> wrote:
>
> Hi Everyone,
>
> When I see this, it is evidence that one or more of the workers are not
> starting up, which results in connections either not occuring or
> reconnecting occuring when supervisors kill workers that don't start up
> properly. I recommend checking the supervisor and nimbus logs to see if
> there are any root causes other than network issues causing the
> connect/reconnect.
>
> --John
>
>
>
> On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis <
> nick.katsip@gmail.com> wrote:
>
> Hello Kashyap,
>
> I have been having the same issue for some time now on my AWS cluster. To
> be honest, I do not know how to resolve it.
>
> Regards,
>
> Nick
>
>
>
> 2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <ka...@gmail.com>:
>
> Hi,
> Has anyone experienced Netty reconnects repeatedly? My workers seem to be
> eternally in reconnect state and topology doesn't serve messages at all. It
> gets connected once in a while and then goes back to getting reconnecting.
>
> Any fixes for this?
> "Reconnect started for Netty-Client"
>
> Thanks
> Kashyap
>
>
>
> --
>
> Nikolaos Romanos Katsipoulakis,
>
> University of Pittsburgh, PhD candidate
>
>
>
>
>



-- 
Nikolaos Romanos Katsipoulakis,
University of Pittsburgh, PhD candidate

RE: Netty reconnect

Posted by Ganesh Chandrasekaran <gc...@wayfair.com>.

Agreed with Jitendra. We were using 0.9.3 version and facing the same issue of netty reconnects which was the issue 404. Upgrading to 0.9.4 fixed the issue.

Thanks,
Ganesh

From: Jitendra Yadav [mailto:jeetuyadav200890@gmail.com]
Sent: Thursday, September 03, 2015 8:20 AM
To: user@storm.apache.org
Subject: Re: Netty reconnect

I don't know your storm version, but it's worth to check these Jira's and see if similar scenario occurring.

https://issues.apache.org/jira/browse/STORM-404
https://issues.apache.org/jira/browse/STORM-450

Thanks
Jitendra

On Thu, Sep 3, 2015 at 5:22 PM, John Yost <so...@gmail.com>> wrote:
Hi Everyone,
When I see this, it is evidence that one or more of the workers are not starting up, which results in connections either not occuring or reconnecting occuring when supervisors kill workers that don't start up properly. I recommend checking the supervisor and nimbus logs to see if there are any root causes other than network issues causing the connect/reconnect.
--John

On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis <ni...@gmail.com>> wrote:
Hello Kashyap,
I have been having the same issue for some time now on my AWS cluster. To be honest, I do not know how to resolve it.
Regards,
Nick

2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <ka...@gmail.com>>:

Hi,
Has anyone experienced Netty reconnects repeatedly? My workers seem to be eternally in reconnect state and topology doesn't serve messages at all. It gets connected once in a while and then goes back to getting reconnecting.

Any fixes for this?
"Reconnect started for Netty-Client"

Thanks
Kashyap

--
Nikolaos Romanos Katsipoulakis,
University of Pittsburgh, PhD candidate

Re: Netty reconnect

Posted by Jitendra Yadav <je...@gmail.com>.

I don't know your storm version, but it's worth to check these Jira's and
see if similar scenario occurring.

https://issues.apache.org/jira/browse/STORM-404
https://issues.apache.org/jira/browse/STORM-450

Thanks
Jitendra

On Thu, Sep 3, 2015 at 5:22 PM, John Yost <so...@gmail.com> wrote:

> Hi Everyone,
>
> When I see this, it is evidence that one or more of the workers are not
> starting up, which results in connections either not occuring or
> reconnecting occuring when supervisors kill workers that don't start up
> properly. I recommend checking the supervisor and nimbus logs to see if
> there are any root causes other than network issues causing the
> connect/reconnect.
>
> --John
>
> On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis <
> nick.katsip@gmail.com> wrote:
>
>> Hello Kashyap,
>>
>> I have been having the same issue for some time now on my AWS cluster. To
>> be honest, I do not know how to resolve it.
>>
>> Regards,
>> Nick
>>
>> 2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <ka...@gmail.com>:
>>
>>> Hi,
>>> Has anyone experienced Netty reconnects repeatedly? My workers seem to
>>> be eternally in reconnect state and topology doesn't serve messages at all.
>>> It gets connected once in a while and then goes back to getting
>>> reconnecting.
>>>
>>> Any fixes for this?
>>> "Reconnect started for Netty-Client"
>>>
>>> Thanks
>>> Kashyap
>>>
>>
>>
>>
>> --
>> Nikolaos Romanos Katsipoulakis,
>> University of Pittsburgh, PhD candidate
>>
>
>

Re: Netty reconnect

Posted by John Yost <so...@gmail.com>.

Hi Everyone,

When I see this, it is evidence that one or more of the workers are not
starting up, which results in connections either not occuring or
reconnecting occuring when supervisors kill workers that don't start up
properly. I recommend checking the supervisor and nimbus logs to see if
there are any root causes other than network issues causing the
connect/reconnect.

--John

On Thu, Sep 3, 2015 at 7:32 AM, Nick R. Katsipoulakis <nick.katsip@gmail.com
> wrote:

> Hello Kashyap,
>
> I have been having the same issue for some time now on my AWS cluster. To
> be honest, I do not know how to resolve it.
>
> Regards,
> Nick
>
> 2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <ka...@gmail.com>:
>
>> Hi,
>> Has anyone experienced Netty reconnects repeatedly? My workers seem to be
>> eternally in reconnect state and topology doesn't serve messages at all. It
>> gets connected once in a while and then goes back to getting reconnecting.
>>
>> Any fixes for this?
>> "Reconnect started for Netty-Client"
>>
>> Thanks
>> Kashyap
>>
>
>
>
> --
> Nikolaos Romanos Katsipoulakis,
> University of Pittsburgh, PhD candidate
>

Re: Netty reconnect

Posted by "Nick R. Katsipoulakis" <ni...@gmail.com>.

Hello Kashyap,

I have been having the same issue for some time now on my AWS cluster. To
be honest, I do not know how to resolve it.

Regards,
Nick

2015-09-03 0:07 GMT-04:00 Kashyap Mhaisekar <ka...@gmail.com>:

> Hi,
> Has anyone experienced Netty reconnects repeatedly? My workers seem to be
> eternally in reconnect state and topology doesn't serve messages at all. It
> gets connected once in a while and then goes back to getting reconnecting.
>
> Any fixes for this?
> "Reconnect started for Netty-Client"
>
> Thanks
> Kashyap
>



-- 
Nikolaos Romanos Katsipoulakis,
University of Pittsburgh, PhD candidate