You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by "P. Taylor Goetz" <pt...@gmail.com> on 2013/11/13 18:45:31 UTC

Re: [storm-user] [SOLVED] Heartbeat deadlocking?

This turned out to be a red herring (I work with Brian).

The root cause was that GC logging for workers had been turned on without specifying an output file. As a result the GC logging went to standard output without being redirected to logback. Eventually the buffer filled and the JVM would hang (and thus stop heartbeating) and be killed. The amount of time the worker would last depended on memory pressure and allocated heap size (obviously, in hindsight — the more GC, the more GC logging, the faster the buffer fills).

The symptoms were workers timing out and being killed for no apparent reason.

There’s a relevant issue on github here: https://github.com/nathanmarz/storm/issues/489

- Taylor

On Nov 11, 2013, at 4:55 PM, Brian O'Neill <bo...@alumni.brown.edu> wrote:

> 
> We are still trying to diagnose our heartbeat issue.  
> With one of our topologies, workers consistently stop heart beating after a variable amount of time. 
> (On the worker, CPU is fine, memory is fine.)
> 
> To help diagnose, we dropped some debugs statements into the timer.clj, and we see the timer loop seize up.
> 
> The last line of output we see is:
> “Doing loop for (timer_27)”
> 
> With the following code:
> 		(while @active
>                      (try
> 		            (log-warn "Doing loop for (timer_27) ")
>                             (let [[time-millis _ _ :as elem] (locking lock (.peek queue))]
>                               (if (and elem (>= (current-time-millis) time-millis))
>                                 ;; imperative to not run the function inside the timer lock
>                                 ;; otherwise, it's possible to deadlock if function deals with other locks
>                                 ;; (like the submit lock)
>                                 (let [afn (locking lock (second (.poll queue)))]
>                                   (log-warn "Doing timer if stm (timer_35) " (pr-str afn))
> 
> 
> And then the output to the log halts.
> 
> Any ideas?  (are we maybe hitting the dead lock mentioned in the comments?)
> 
> -brian
> 
> 
> ---
> Brian O'Neill
> Chief Architect
> Health Market Science
> The Science of Better Results
> 2700 Horizon Drive • King of Prussia, PA • 19406
> M: 215.588.6024 • @boneill42  •  
> healthmarketscience.com
> 
> This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited.
>  
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups "storm-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to storm-user+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.


Re: [storm-user] [SOLVED] Heartbeat deadlocking?

Posted by Philip O'Toole <ph...@loggly.com>.
Ha!

We saw exactly the same behaviour a few months back -- our topologies would
just hang for no reason. It took us a weekend to track it down. We finally
noticed we had inadvertently added CONSOLE to our Storm log4j config. It
drove us nuts, but we couldn't believe that change was responsible -- until
we removed CONSOLE and all was good.

Philip


On Wed, Nov 13, 2013 at 9:45 AM, P. Taylor Goetz <pt...@gmail.com> wrote:

> This turned out to be a red herring (I work with Brian).
>
> The root cause was that GC logging for workers had been turned on without
> specifying an output file. As a result the GC logging went to standard
> output without being redirected to logback. Eventually the buffer filled
> and the JVM would hang (and thus stop heartbeating) and be killed. The
> amount of time the worker would last depended on memory pressure and
> allocated heap size (obviously, in hindsight — the more GC, the more GC
> logging, the faster the buffer fills).
>
> The symptoms were workers timing out and being killed for no apparent
> reason.
>
> There’s a relevant issue on github here:
> https://github.com/nathanmarz/storm/issues/489
>
> - Taylor
>
> On Nov 11, 2013, at 4:55 PM, Brian O'Neill <bo...@alumni.brown.edu> wrote:
>
>
> We are still trying to diagnose our heartbeat issue.
> With one of our topologies, workers consistently stop heart beating after
> a variable amount of time.
> (On the worker, CPU is fine, memory is fine.)
>
> To help diagnose, we dropped some debugs statements into the timer.clj,
> and we see the timer loop seize up.
>
> The last line of output we see is:
> “Doing loop for (timer_27)”
>
> With the following code:
> (while @active
>                      (try
>            * (log-warn "Doing loop for (timer_27) ")*
>                             (let [[time-millis _ _ :as elem] (locking lock
> (.peek queue))]
>                               (if (and elem (>= (current-time-millis)
> time-millis))
>                                 ;; imperative to not run the function
> inside the timer lock
>                                 ;; otherwise, it's possible to deadlock if
> function deals with other locks
>                                 ;; (like the submit lock)
>                                 (let [afn (locking lock (second (.poll
> queue)))]
>                          *         (log-warn "Doing timer if stm
> (timer_35) " (pr-str afn))*
>
>
> And then the output to the log halts.
>
> Any ideas?  (are we maybe hitting the dead lock mentioned in the comments?)
>
> -brian
>
>
> ---
> Brian O'Neill
> Chief Architect
> *Health Market Science*
> *The Science of Better Results*
> 2700 Horizon Drive • King of Prussia, PA • 19406
> M: 215.588.6024 • @boneill42 <http://www.twitter.com/boneill42>  •
> healthmarketscience.com
>
> This information transmitted in this email message is for the intended
> recipient only and may contain confidential and/or privileged material. If
> you received this email in error and are not the intended recipient, or the
> person responsible to deliver it to the intended recipient, please contact
> the sender at the email above and delete this email and any attachments and
> destroy any copies thereof. Any review, retransmission, dissemination,
> copying or other use of, or taking any action in reliance upon, this
> information by persons or entities other than the intended recipient is
> strictly prohibited.
>
>
>
> --
> You received this message because you are subscribed to the Google Groups
> "storm-user" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to storm-user+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>