You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Andrew Montalenti <an...@parsely.com> on 2014/06/18 22:32:58 UTC

v0.9.2-incubating and .ser files

I built the v0.9.2-incubating rc-3 locally and once verifying that it
worked for our topology, pushed it into our cluster. So far, so good.

One thing for the community to be aware of. If you try to upgrade an
existing v0.9.1-incubating or 0.8 cluster to v0.9.2-incubating, you may hit
exceptions upon nimbus/supervisor startup about stormcode.ser/stormconf.ser.

The issue is that the new cluster will try to re-submit the topologies that
were already running before the upgrade. These will fail because Storm's
Clojure version has been upgraded from 1.4 -> 1.5, thus the serialization
formats & IDs have changed. This would be true basically if any class
serial IDs change that happen to be in these .ser files (stormconf.ser &
stormcode.ser, as defined in Storm's internal config
<https://github.com/apache/incubator-storm/blob/master/storm-core/src/clj/backtype/storm/config.clj#L143-L153>
).

The solution is to clear out the storm data directories on your worker
nodes/nimbus nodes and restart the cluster.

I have some open source tooling that submits topologies to the nimbus using
StormSubmitter. This upgrade also made me realize that due to the use
of serialized
Java files
<https://github.com/apache/incubator-storm/blob/master/storm-core/src/jvm/backtype/storm/utils/Utils.java#L73-L97>,
it is very important the StormSubmitter class used for submitting and the
running Storm cluster be precisely the same version / classpath. I describe
this more in the GH issue here:

https://github.com/Parsely/streamparse/issues/27

I wonder if maybe it's worth it to consider using a less finicky
serialization format within Storm itself. Would that change be welcome as a
pull request?

It would make it easier to script Storm clusters without consideration for
client/server Storm version mismatches, which I presume was the original
reasoning behind putting Storm functionality behind a Thrift API anyway.
And it would prevent crashed topologies during minor Storm version upgrades.

Re: v0.9.2-incubating and .ser files

Posted by Andrew Montalenti <an...@parsely.com>.
Really good news here. I dug into this issue more -- including doing a
detailed analysis with stack traces that I put up on Github:

https://gist.github.com/amontalenti/8ff0c31a7b95a6dea3d2
(now updated with the below e-mail's text since the exploration was
obsoleted by the fix)

The issue had nothing to do with Storm and everything to do with Ubuntu
14.04 and its interaction with Xen network kernel drivers in EC2.

I was staring at the results of this research and thinking, "What could
possibly cause the network subsystem of Storm to just hang?"

My first impulse: firewalls. Maybe as the network was ramping up, I was
hitting up against a firewall rule?

I checked our munin monitoring graphs and noticed a bunch of eth0 errors
correlated with our topologies running. I checked our production Storm
0.8.2 cluster -- no errors. Ah hah! It must be firewall rules or something!

That led me to run dmesg on the supervisor nodes. I found a bunch of
entries like this:

xen_netfront: xennet: skb rides the rocket: 20 slots
xen_netfront: xennet: skb rides the rocket: 19 slots

That's odd. I also saw some entries related to ufw (Ubuntu's firewall
service). So, I try running `ufw disable`. No change.

I then dig in more to these error messages and I come across this open bug
on Launchpad:

https://bugs.launchpad.net/ubuntu/+source/linux-lts-raring/+bug/1195474

I dig in there and come across the current workaround, running:

sudo ethtool -K eth0 sg off

On the server. I issue that command, restart my topology, and VOILA, the
Storm topology is *now running at full performance*.

Back in my earliest days as a professional programmer, I had a friend named
Jimmy. I once spent 3 days debugging a JVM garbage collection issue with
him. We ran profilers, did detailed code traces, extensive logging, etc.
And in the end, the fix to the problem was a single line of code change --
a mistaken allocation of an expensive object that was happening in a tight
loop. At that moment, I coined "Jimmy's Law", which is:

"The amount of time it takes to discover a bug's fix is inversely
proportional to the lines of code changed by the fix, with infinite time
converging to one line."

After hours of investigating and debugging this issue, that's certainly how
I feel. Shame on me for upgrading my Storm cluster and Ubuntu version
simultaneously!

Now for the really good news: I'm 14 million tuples into my Storm
0.9.2-incubating cluster (running with Netty) and everything is humming
along, running fast. 92 tasks, 8 workers, 2 supervisors. My simplest Python
bolt has 1.10ms process latencies -- some of the fastest I've seen.

Thanks for the help investigating, and here's to an awesome 0.9.2 release!

p.s. I'm glad that *something* positive came out of this, at least -- my
contribution to sync up the storm-0mq driver for those who prefer it. Glad
to continue to help with that, if for no other reason than to have a
reliable second transport so Storm community members can debug *actual* Netty
issues they may come across.


On Thu, Jun 19, 2014 at 8:53 PM, P. Taylor Goetz <pt...@gmail.com> wrote:

> Okay. Keep me posted. I still plan on looking at and testing your patch to
> storm-0mq, but probably won't get to that until early next week.
>
> -Taylor
>
> On Jun 19, 2014, at 7:43 PM, Andrew Montalenti <an...@parsely.com> wrote:
>
> FYI, the issue happened with both zmq and netty transports. We will
> investigate more tomorrow. We think the issue only happens with more than
> one supervisor and multiple workers.
> On Jun 19, 2014 7:32 PM, "P. Taylor Goetz" <pt...@gmail.com> wrote:
>
>> Hi Andrew,
>>
>> Thanks for pointing this out. I agree with your point about bit rot.
>>
>> However, we had to remove the the 0mq transport due to license
>> incompatibilities with Apache, so any kind of release test suite would have
>> to be maintained outside of Apache since it would likely pull in
>> LGPL-licensed dependencies. So if something like you're suggesting could be
>> accomplished in the storm-0mq project, that would be the best option.
>>
>> I'm open to pull requests, help, contributions, etc. to storm-0mq. It
>> just can't be part of Apache.
>>
>> I'll test out your changes to storm-0mq to see if I can reproduce the
>> issue you're seeing. As Nathan mentioned, any additional information
>> (thread dumps, etc.) you could provide would help.
>>
>> Thanks (and sorry for the inconvenience),
>>
>> Taylor
>>
>>
>> On Jun 19, 2014, at 6:09 PM, Andrew Montalenti <an...@parsely.com>
>> wrote:
>>
>> Another interesting 0.9.2 issue I came across: the IConnection interface
>> has changed, meaning any pluggable transports no longer work without a code
>> change.
>>
>> I implemented changes to storm-0mq to get it to be compatible with this
>> interface change in my fork here.
>>
>> https://github.com/Parsely/storm-0mq/compare/ptgoetz:master...master
>>
>> I tested that and it nominally works in distributed mode with two
>> independent workers in my cluster. Don't know what the performance impact
>> is of the interface change.
>>
>> I get that zmq is no longer part of storm core, but maintaining a stable
>> interface for pluggable components like this transport is probably
>> something that should be in the release test suite. Otherwise bitrot will
>> take its toll. I am glad to volunteer help with this.
>>
>> My team is now debugging an issue where Storm stops asking our spout for
>> next tuples after awhile of running the topology, causing the tool go to
>> basically freeze with no errors in the logs. At first blush, seems like a
>> regression from 0.9.1. But we'll have more detailed info once we isolate
>> some variables soon.
>> On Jun 18, 2014 4:32 PM, "Andrew Montalenti" <an...@parsely.com> wrote:
>>
>>> I built the v0.9.2-incubating rc-3 locally and once verifying that it
>>> worked for our topology, pushed it into our cluster. So far, so good.
>>>
>>> One thing for the community to be aware of. If you try to upgrade an
>>> existing v0.9.1-incubating or 0.8 cluster to v0.9.2-incubating, you may hit
>>> exceptions upon nimbus/supervisor startup about stormcode.ser/stormconf.ser.
>>>
>>> The issue is that the new cluster will try to re-submit the topologies
>>> that were already running before the upgrade. These will fail because
>>> Storm's Clojure version has been upgraded from 1.4 -> 1.5, thus the
>>> serialization formats & IDs have changed. This would be true basically if
>>> any class serial IDs change that happen to be in these .ser files
>>> (stormconf.ser & stormcode.ser, as defined in Storm's internal config
>>> <https://github.com/apache/incubator-storm/blob/master/storm-core/src/clj/backtype/storm/config.clj#L143-L153>
>>> ).
>>>
>>> The solution is to clear out the storm data directories on your worker
>>> nodes/nimbus nodes and restart the cluster.
>>>
>>> I have some open source tooling that submits topologies to the nimbus
>>> using StormSubmitter. This upgrade also made me realize that due to the use
>>> of serialized Java files
>>> <https://github.com/apache/incubator-storm/blob/master/storm-core/src/jvm/backtype/storm/utils/Utils.java#L73-L97>,
>>> it is very important the StormSubmitter class used for submitting and the
>>> running Storm cluster be precisely the same version / classpath. I describe
>>> this more in the GH issue here:
>>>
>>> https://github.com/Parsely/streamparse/issues/27
>>>
>>> I wonder if maybe it's worth it to consider using a less finicky
>>> serialization format within Storm itself. Would that change be welcome as a
>>> pull request?
>>>
>>> It would make it easier to script Storm clusters without consideration
>>> for client/server Storm version mismatches, which I presume was the
>>> original reasoning behind putting Storm functionality behind a Thrift API
>>> anyway. And it would prevent crashed topologies during minor Storm version
>>> upgrades.
>>>
>>
>>

Re: v0.9.2-incubating and .ser files

Posted by "P. Taylor Goetz" <pt...@gmail.com>.
Okay. Keep me posted. I still plan on looking at and testing your patch to storm-0mq, but probably won't get to that until early next week.

-Taylor

> On Jun 19, 2014, at 7:43 PM, Andrew Montalenti <an...@parsely.com> wrote:
> 
> FYI, the issue happened with both zmq and netty transports. We will investigate more tomorrow. We think the issue only happens with more than one supervisor and multiple workers.
> 
>> On Jun 19, 2014 7:32 PM, "P. Taylor Goetz" <pt...@gmail.com> wrote:
>> Hi Andrew,
>> 
>> Thanks for pointing this out. I agree with your point about bit rot.
>> 
>> However, we had to remove the the 0mq transport due to license incompatibilities with Apache, so any kind of release test suite would have to be maintained outside of Apache since it would likely pull in LGPL-licensed dependencies. So if something like you’re suggesting could be accomplished in the storm-0mq project, that would be the best option.
>> 
>> I’m open to pull requests, help, contributions, etc. to storm-0mq. It just can’t be part of Apache.
>> 
>> I’ll test out your changes to storm-0mq to see if I can reproduce the issue you’re seeing. As Nathan mentioned, any additional information (thread dumps, etc.) you could provide would help.
>> 
>> Thanks (and sorry for the inconvenience),
>> 
>> Taylor
>> 
>> 
>>> On Jun 19, 2014, at 6:09 PM, Andrew Montalenti <an...@parsely.com> wrote:
>>> 
>>> Another interesting 0.9.2 issue I came across: the IConnection interface has changed, meaning any pluggable transports no longer work without a code change.
>>> 
>>> I implemented changes to storm-0mq to get it to be compatible with this interface change in my fork here.
>>> 
>>> https://github.com/Parsely/storm-0mq/compare/ptgoetz:master...master
>>> 
>>> I tested that and it nominally works in distributed mode with two independent workers in my cluster. Don't know what the performance impact is of the interface change.
>>> 
>>> I get that zmq is no longer part of storm core, but maintaining a stable interface for pluggable components like this transport is probably something that should be in the release test suite. Otherwise bitrot will take its toll. I am glad to volunteer help with this.
>>> 
>>> My team is now debugging an issue where Storm stops asking our spout for next tuples after awhile of running the topology, causing the tool go to basically freeze with no errors in the logs. At first blush, seems like a regression from 0.9.1. But we'll have more detailed info once we isolate some variables soon.
>>> 
>>>> On Jun 18, 2014 4:32 PM, "Andrew Montalenti" <an...@parsely.com> wrote:
>>>> I built the v0.9.2-incubating rc-3 locally and once verifying that it worked for our topology, pushed it into our cluster. So far, so good.
>>>> 
>>>> One thing for the community to be aware of. If you try to upgrade an existing v0.9.1-incubating or 0.8 cluster to v0.9.2-incubating, you may hit exceptions upon nimbus/supervisor startup about stormcode.ser/stormconf.ser.
>>>> 
>>>> The issue is that the new cluster will try to re-submit the topologies that were already running before the upgrade. These will fail because Storm's Clojure version has been upgraded from 1.4 -> 1.5, thus the serialization formats & IDs have changed. This would be true basically if any class serial IDs change that happen to be in these .ser files (stormconf.ser & stormcode.ser, as defined in Storm's internal config).
>>>> 
>>>> The solution is to clear out the storm data directories on your worker nodes/nimbus nodes and restart the cluster.
>>>> 
>>>> I have some open source tooling that submits topologies to the nimbus using StormSubmitter. This upgrade also made me realize that due to the use of serialized Java files, it is very important the StormSubmitter class used for submitting and the running Storm cluster be precisely the same version / classpath. I describe this more in the GH issue here:
>>>> 
>>>> https://github.com/Parsely/streamparse/issues/27
>>>> 
>>>> I wonder if maybe it's worth it to consider using a less finicky serialization format within Storm itself. Would that change be welcome as a pull request?
>>>> 
>>>> It would make it easier to script Storm clusters without consideration for client/server Storm version mismatches, which I presume was the original reasoning behind putting Storm functionality behind a Thrift API anyway. And it would prevent crashed topologies during minor Storm version upgrades.
>> 

Re: v0.9.2-incubating and .ser files

Posted by Andrew Montalenti <an...@parsely.com>.
FYI, the issue happened with both zmq and netty transports. We will
investigate more tomorrow. We think the issue only happens with more than
one supervisor and multiple workers.
On Jun 19, 2014 7:32 PM, "P. Taylor Goetz" <pt...@gmail.com> wrote:

> Hi Andrew,
>
> Thanks for pointing this out. I agree with your point about bit rot.
>
> However, we had to remove the the 0mq transport due to license
> incompatibilities with Apache, so any kind of release test suite would have
> to be maintained outside of Apache since it would likely pull in
> LGPL-licensed dependencies. So if something like you're suggesting could be
> accomplished in the storm-0mq project, that would be the best option.
>
> I'm open to pull requests, help, contributions, etc. to storm-0mq. It just
> can't be part of Apache.
>
> I'll test out your changes to storm-0mq to see if I can reproduce the
> issue you're seeing. As Nathan mentioned, any additional information
> (thread dumps, etc.) you could provide would help.
>
> Thanks (and sorry for the inconvenience),
>
> Taylor
>
>
> On Jun 19, 2014, at 6:09 PM, Andrew Montalenti <an...@parsely.com> wrote:
>
> Another interesting 0.9.2 issue I came across: the IConnection interface
> has changed, meaning any pluggable transports no longer work without a code
> change.
>
> I implemented changes to storm-0mq to get it to be compatible with this
> interface change in my fork here.
>
> https://github.com/Parsely/storm-0mq/compare/ptgoetz:master...master
>
> I tested that and it nominally works in distributed mode with two
> independent workers in my cluster. Don't know what the performance impact
> is of the interface change.
>
> I get that zmq is no longer part of storm core, but maintaining a stable
> interface for pluggable components like this transport is probably
> something that should be in the release test suite. Otherwise bitrot will
> take its toll. I am glad to volunteer help with this.
>
> My team is now debugging an issue where Storm stops asking our spout for
> next tuples after awhile of running the topology, causing the tool go to
> basically freeze with no errors in the logs. At first blush, seems like a
> regression from 0.9.1. But we'll have more detailed info once we isolate
> some variables soon.
> On Jun 18, 2014 4:32 PM, "Andrew Montalenti" <an...@parsely.com> wrote:
>
>> I built the v0.9.2-incubating rc-3 locally and once verifying that it
>> worked for our topology, pushed it into our cluster. So far, so good.
>>
>> One thing for the community to be aware of. If you try to upgrade an
>> existing v0.9.1-incubating or 0.8 cluster to v0.9.2-incubating, you may hit
>> exceptions upon nimbus/supervisor startup about stormcode.ser/stormconf.ser.
>>
>> The issue is that the new cluster will try to re-submit the topologies
>> that were already running before the upgrade. These will fail because
>> Storm's Clojure version has been upgraded from 1.4 -> 1.5, thus the
>> serialization formats & IDs have changed. This would be true basically if
>> any class serial IDs change that happen to be in these .ser files
>> (stormconf.ser & stormcode.ser, as defined in Storm's internal config
>> <https://github.com/apache/incubator-storm/blob/master/storm-core/src/clj/backtype/storm/config.clj#L143-L153>
>> ).
>>
>> The solution is to clear out the storm data directories on your worker
>> nodes/nimbus nodes and restart the cluster.
>>
>> I have some open source tooling that submits topologies to the nimbus
>> using StormSubmitter. This upgrade also made me realize that due to the use
>> of serialized Java files
>> <https://github.com/apache/incubator-storm/blob/master/storm-core/src/jvm/backtype/storm/utils/Utils.java#L73-L97>,
>> it is very important the StormSubmitter class used for submitting and the
>> running Storm cluster be precisely the same version / classpath. I describe
>> this more in the GH issue here:
>>
>> https://github.com/Parsely/streamparse/issues/27
>>
>> I wonder if maybe it's worth it to consider using a less finicky
>> serialization format within Storm itself. Would that change be welcome as a
>> pull request?
>>
>> It would make it easier to script Storm clusters without consideration
>> for client/server Storm version mismatches, which I presume was the
>> original reasoning behind putting Storm functionality behind a Thrift API
>> anyway. And it would prevent crashed topologies during minor Storm version
>> upgrades.
>>
>
>

Re: v0.9.2-incubating and .ser files

Posted by "P. Taylor Goetz" <pt...@gmail.com>.
Hi Andrew,

Thanks for pointing this out. I agree with your point about bit rot.

However, we had to remove the the 0mq transport due to license incompatibilities with Apache, so any kind of release test suite would have to be maintained outside of Apache since it would likely pull in LGPL-licensed dependencies. So if something like you’re suggesting could be accomplished in the storm-0mq project, that would be the best option.

I’m open to pull requests, help, contributions, etc. to storm-0mq. It just can’t be part of Apache.

I’ll test out your changes to storm-0mq to see if I can reproduce the issue you’re seeing. As Nathan mentioned, any additional information (thread dumps, etc.) you could provide would help.

Thanks (and sorry for the inconvenience),

Taylor


On Jun 19, 2014, at 6:09 PM, Andrew Montalenti <an...@parsely.com> wrote:

> Another interesting 0.9.2 issue I came across: the IConnection interface has changed, meaning any pluggable transports no longer work without a code change.
> 
> I implemented changes to storm-0mq to get it to be compatible with this interface change in my fork here.
> 
> https://github.com/Parsely/storm-0mq/compare/ptgoetz:master...master
> 
> I tested that and it nominally works in distributed mode with two independent workers in my cluster. Don't know what the performance impact is of the interface change.
> 
> I get that zmq is no longer part of storm core, but maintaining a stable interface for pluggable components like this transport is probably something that should be in the release test suite. Otherwise bitrot will take its toll. I am glad to volunteer help with this.
> 
> My team is now debugging an issue where Storm stops asking our spout for next tuples after awhile of running the topology, causing the tool go to basically freeze with no errors in the logs. At first blush, seems like a regression from 0.9.1. But we'll have more detailed info once we isolate some variables soon.
> 
> On Jun 18, 2014 4:32 PM, "Andrew Montalenti" <an...@parsely.com> wrote:
> I built the v0.9.2-incubating rc-3 locally and once verifying that it worked for our topology, pushed it into our cluster. So far, so good.
> 
> One thing for the community to be aware of. If you try to upgrade an existing v0.9.1-incubating or 0.8 cluster to v0.9.2-incubating, you may hit exceptions upon nimbus/supervisor startup about stormcode.ser/stormconf.ser.
> 
> The issue is that the new cluster will try to re-submit the topologies that were already running before the upgrade. These will fail because Storm's Clojure version has been upgraded from 1.4 -> 1.5, thus the serialization formats & IDs have changed. This would be true basically if any class serial IDs change that happen to be in these .ser files (stormconf.ser & stormcode.ser, as defined in Storm's internal config).
> 
> The solution is to clear out the storm data directories on your worker nodes/nimbus nodes and restart the cluster.
> 
> I have some open source tooling that submits topologies to the nimbus using StormSubmitter. This upgrade also made me realize that due to the use of serialized Java files, it is very important the StormSubmitter class used for submitting and the running Storm cluster be precisely the same version / classpath. I describe this more in the GH issue here:
> 
> https://github.com/Parsely/streamparse/issues/27
> 
> I wonder if maybe it's worth it to consider using a less finicky serialization format within Storm itself. Would that change be welcome as a pull request?
> 
> It would make it easier to script Storm clusters without consideration for client/server Storm version mismatches, which I presume was the original reasoning behind putting Storm functionality behind a Thrift API anyway. And it would prevent crashed topologies during minor Storm version upgrades.


Re: v0.9.2-incubating and .ser files

Posted by Nathan Marz <na...@nathanmarz.com>.
A stack dump of all workers would be useful in the case of a topology
freeze.


On Thu, Jun 19, 2014 at 3:52 PM, Nathan Marz <na...@nathanmarz.com> wrote:

> There were  a bunch of changes to the internals, so a regression is
> certainly possible. Let us know as many details as possible if you are able
> to reproduce it.
>
>
> On Thu, Jun 19, 2014 at 3:09 PM, Andrew Montalenti <an...@parsely.com>
> wrote:
>
>> Another interesting 0.9.2 issue I came across: the IConnection interface
>> has changed, meaning any pluggable transports no longer work without a code
>> change.
>>
>> I implemented changes to storm-0mq to get it to be compatible with this
>> interface change in my fork here.
>>
>> https://github.com/Parsely/storm-0mq/compare/ptgoetz:master...master
>>
>> I tested that and it nominally works in distributed mode with two
>> independent workers in my cluster. Don't know what the performance impact
>> is of the interface change.
>>
>> I get that zmq is no longer part of storm core, but maintaining a stable
>> interface for pluggable components like this transport is probably
>> something that should be in the release test suite. Otherwise bitrot will
>> take its toll. I am glad to volunteer help with this.
>>
>> My team is now debugging an issue where Storm stops asking our spout for
>> next tuples after awhile of running the topology, causing the tool go to
>> basically freeze with no errors in the logs. At first blush, seems like a
>> regression from 0.9.1. But we'll have more detailed info once we isolate
>> some variables soon.
>>  On Jun 18, 2014 4:32 PM, "Andrew Montalenti" <an...@parsely.com> wrote:
>>
>>> I built the v0.9.2-incubating rc-3 locally and once verifying that it
>>> worked for our topology, pushed it into our cluster. So far, so good.
>>>
>>> One thing for the community to be aware of. If you try to upgrade an
>>> existing v0.9.1-incubating or 0.8 cluster to v0.9.2-incubating, you may hit
>>> exceptions upon nimbus/supervisor startup about stormcode.ser/stormconf.ser.
>>>
>>> The issue is that the new cluster will try to re-submit the topologies
>>> that were already running before the upgrade. These will fail because
>>> Storm's Clojure version has been upgraded from 1.4 -> 1.5, thus the
>>> serialization formats & IDs have changed. This would be true basically if
>>> any class serial IDs change that happen to be in these .ser files
>>> (stormconf.ser & stormcode.ser, as defined in Storm's internal config
>>> <https://github.com/apache/incubator-storm/blob/master/storm-core/src/clj/backtype/storm/config.clj#L143-L153>
>>> ).
>>>
>>> The solution is to clear out the storm data directories on your worker
>>> nodes/nimbus nodes and restart the cluster.
>>>
>>> I have some open source tooling that submits topologies to the nimbus
>>> using StormSubmitter. This upgrade also made me realize that due to the use
>>> of serialized Java files
>>> <https://github.com/apache/incubator-storm/blob/master/storm-core/src/jvm/backtype/storm/utils/Utils.java#L73-L97>,
>>> it is very important the StormSubmitter class used for submitting and the
>>> running Storm cluster be precisely the same version / classpath. I describe
>>> this more in the GH issue here:
>>>
>>> https://github.com/Parsely/streamparse/issues/27
>>>
>>> I wonder if maybe it's worth it to consider using a less finicky
>>> serialization format within Storm itself. Would that change be welcome as a
>>> pull request?
>>>
>>> It would make it easier to script Storm clusters without consideration
>>> for client/server Storm version mismatches, which I presume was the
>>> original reasoning behind putting Storm functionality behind a Thrift API
>>> anyway. And it would prevent crashed topologies during minor Storm version
>>> upgrades.
>>>
>>
>
>
> --
> Twitter: @nathanmarz
> http://nathanmarz.com
>



-- 
Twitter: @nathanmarz
http://nathanmarz.com

Re: v0.9.2-incubating and .ser files

Posted by Nathan Marz <na...@nathanmarz.com>.
There were  a bunch of changes to the internals, so a regression is
certainly possible. Let us know as many details as possible if you are able
to reproduce it.


On Thu, Jun 19, 2014 at 3:09 PM, Andrew Montalenti <an...@parsely.com>
wrote:

> Another interesting 0.9.2 issue I came across: the IConnection interface
> has changed, meaning any pluggable transports no longer work without a code
> change.
>
> I implemented changes to storm-0mq to get it to be compatible with this
> interface change in my fork here.
>
> https://github.com/Parsely/storm-0mq/compare/ptgoetz:master...master
>
> I tested that and it nominally works in distributed mode with two
> independent workers in my cluster. Don't know what the performance impact
> is of the interface change.
>
> I get that zmq is no longer part of storm core, but maintaining a stable
> interface for pluggable components like this transport is probably
> something that should be in the release test suite. Otherwise bitrot will
> take its toll. I am glad to volunteer help with this.
>
> My team is now debugging an issue where Storm stops asking our spout for
> next tuples after awhile of running the topology, causing the tool go to
> basically freeze with no errors in the logs. At first blush, seems like a
> regression from 0.9.1. But we'll have more detailed info once we isolate
> some variables soon.
> On Jun 18, 2014 4:32 PM, "Andrew Montalenti" <an...@parsely.com> wrote:
>
>> I built the v0.9.2-incubating rc-3 locally and once verifying that it
>> worked for our topology, pushed it into our cluster. So far, so good.
>>
>> One thing for the community to be aware of. If you try to upgrade an
>> existing v0.9.1-incubating or 0.8 cluster to v0.9.2-incubating, you may hit
>> exceptions upon nimbus/supervisor startup about stormcode.ser/stormconf.ser.
>>
>> The issue is that the new cluster will try to re-submit the topologies
>> that were already running before the upgrade. These will fail because
>> Storm's Clojure version has been upgraded from 1.4 -> 1.5, thus the
>> serialization formats & IDs have changed. This would be true basically if
>> any class serial IDs change that happen to be in these .ser files
>> (stormconf.ser & stormcode.ser, as defined in Storm's internal config
>> <https://github.com/apache/incubator-storm/blob/master/storm-core/src/clj/backtype/storm/config.clj#L143-L153>
>> ).
>>
>> The solution is to clear out the storm data directories on your worker
>> nodes/nimbus nodes and restart the cluster.
>>
>> I have some open source tooling that submits topologies to the nimbus
>> using StormSubmitter. This upgrade also made me realize that due to the use
>> of serialized Java files
>> <https://github.com/apache/incubator-storm/blob/master/storm-core/src/jvm/backtype/storm/utils/Utils.java#L73-L97>,
>> it is very important the StormSubmitter class used for submitting and the
>> running Storm cluster be precisely the same version / classpath. I describe
>> this more in the GH issue here:
>>
>> https://github.com/Parsely/streamparse/issues/27
>>
>> I wonder if maybe it's worth it to consider using a less finicky
>> serialization format within Storm itself. Would that change be welcome as a
>> pull request?
>>
>> It would make it easier to script Storm clusters without consideration
>> for client/server Storm version mismatches, which I presume was the
>> original reasoning behind putting Storm functionality behind a Thrift API
>> anyway. And it would prevent crashed topologies during minor Storm version
>> upgrades.
>>
>


-- 
Twitter: @nathanmarz
http://nathanmarz.com

Re: v0.9.2-incubating and .ser files

Posted by Andrew Montalenti <an...@parsely.com>.
Another interesting 0.9.2 issue I came across: the IConnection interface
has changed, meaning any pluggable transports no longer work without a code
change.

I implemented changes to storm-0mq to get it to be compatible with this
interface change in my fork here.

https://github.com/Parsely/storm-0mq/compare/ptgoetz:master...master

I tested that and it nominally works in distributed mode with two
independent workers in my cluster. Don't know what the performance impact
is of the interface change.

I get that zmq is no longer part of storm core, but maintaining a stable
interface for pluggable components like this transport is probably
something that should be in the release test suite. Otherwise bitrot will
take its toll. I am glad to volunteer help with this.

My team is now debugging an issue where Storm stops asking our spout for
next tuples after awhile of running the topology, causing the tool go to
basically freeze with no errors in the logs. At first blush, seems like a
regression from 0.9.1. But we'll have more detailed info once we isolate
some variables soon.
On Jun 18, 2014 4:32 PM, "Andrew Montalenti" <an...@parsely.com> wrote:

> I built the v0.9.2-incubating rc-3 locally and once verifying that it
> worked for our topology, pushed it into our cluster. So far, so good.
>
> One thing for the community to be aware of. If you try to upgrade an
> existing v0.9.1-incubating or 0.8 cluster to v0.9.2-incubating, you may hit
> exceptions upon nimbus/supervisor startup about stormcode.ser/stormconf.ser.
>
> The issue is that the new cluster will try to re-submit the topologies
> that were already running before the upgrade. These will fail because
> Storm's Clojure version has been upgraded from 1.4 -> 1.5, thus the
> serialization formats & IDs have changed. This would be true basically if
> any class serial IDs change that happen to be in these .ser files
> (stormconf.ser & stormcode.ser, as defined in Storm's internal config
> <https://github.com/apache/incubator-storm/blob/master/storm-core/src/clj/backtype/storm/config.clj#L143-L153>
> ).
>
> The solution is to clear out the storm data directories on your worker
> nodes/nimbus nodes and restart the cluster.
>
> I have some open source tooling that submits topologies to the nimbus
> using StormSubmitter. This upgrade also made me realize that due to the use
> of serialized Java files
> <https://github.com/apache/incubator-storm/blob/master/storm-core/src/jvm/backtype/storm/utils/Utils.java#L73-L97>,
> it is very important the StormSubmitter class used for submitting and the
> running Storm cluster be precisely the same version / classpath. I describe
> this more in the GH issue here:
>
> https://github.com/Parsely/streamparse/issues/27
>
> I wonder if maybe it's worth it to consider using a less finicky
> serialization format within Storm itself. Would that change be welcome as a
> pull request?
>
> It would make it easier to script Storm clusters without consideration for
> client/server Storm version mismatches, which I presume was the original
> reasoning behind putting Storm functionality behind a Thrift API anyway.
> And it would prevent crashed topologies during minor Storm version upgrades.
>