You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@storm.apache.org by "Rick Kellogg (JIRA)" <ji...@apache.org> on 2015/10/09 02:29:27 UTC

[jira] [Updated] (STORM-144) Provide visibiilty into buffers between components

     [ https://issues.apache.org/jira/browse/STORM-144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rick Kellogg updated STORM-144:
-------------------------------
    Component/s: storm-core

> Provide visibiilty into buffers between components
> --------------------------------------------------
>
>                 Key: STORM-144
>                 URL: https://issues.apache.org/jira/browse/STORM-144
>             Project: Apache Storm
>          Issue Type: Improvement
>          Components: storm-core
>            Reporter: James Xu
>            Priority: Minor
>
> https://github.com/nathanmarz/storm/issues/222
> It would be nice to see how many tuples are in the input and output buffers of Storm components to understand where things are getting bottled up. 0mq doesn't currently provide this visibility so it's not clear how to implement this.
> ----------
> nahap: maybe now that you have internal message buffers in storm 0.8 you could use these as an indicator. it is not perfect but better than nothing
> ----------
> dkincaid: Based on my understanding of how messages get moved between bolts I think there are two places that they can get "stuck" in a queue. The first is in the inbound and outbound worker message queues. Prior to 0.8.0 those were unbounded LinkedBlockingQueue's. Version 0.8.0 changed to use LMAX Disruptor queues which are bounded. At this point then we should focus on the Disruptor queues.
> The second place messages can get "stuck" is in the ZeroMQ sockets that are used to send messages between machines.
> It seems to me that the first thing to do here would be to provide visibility into the size of the Disruptor queues in some manner.
> Next, we should look for a way to provide some visibility into the queuing of messages within ZeroMQ. I'm far from an expert on ZeroMQ, but from looking at the documentation for the zmq_getsockopt call it looks promising:
> ZMQ_BACKLOG: Retrieve maximum length of the queue of outstanding connections
> The ZMQ_BACKLOG option shall retrieve the maximum length of the queue of outstanding peer connections for the specified socket; this only applies to connection-oriented transports. For details refer to your operating system documentation for the listen function.
> Maybe that won't show the actual number of messages waiting in the queue, but should still be an indication of a backup.
> Since the rest of the stats for workers, bolts, etc are sent to Zookeeper does it make sense to send a snapshot count of these queues at the same time? Personally I'd like to be able to see average size over the time period as well as max and min, but then we'd be starting to throw more data into Zookeeper which Nathan has been trying to prune.
> -----------
> sustrik: ZMQ_BACKLOG is listen function's 'backlog' parameter and has nothing to do with queued messages.
> Btw, even without queueing on ZeroMQ layer there's still queueing going on on the lower layers (TCP) which is kind of hard to assess. The only reasonable solution, AFAICS, is to hard-limit the buffers (on all layers) and consider the max buffered amount of messages to be the error of the queue depth measurement.
> Say, if you are using raw TCP and it's possible to buffer 100 messages in TCP's tx and rx buffers, you can measure the number of outstanding messages buffered in the application and report (N, N+100) interval as the queue depth.
> The problem gets more complex when there are many TCP connections involved. If there are 1000 connections the (N, N+100) interval expands to (N, N+100,000).
> ----------
> mrflip: Fixed by #633 ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)