You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Robert Joseph Evans (JIRA)" <ji...@apache.org> on 2017/09/08 15:59:00 UTC

[jira] [Resolved] (STORM-2539) Locality aware grouping, which is a new grouping method considering locality

     [ https://issues.apache.org/jira/browse/STORM-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Joseph Evans resolved STORM-2539.
----------------------------------------
    Resolution: Duplicate

[~siwoon.son],

There is a pull request up for something like this on STORM-2686, so I am marking this as a duplicate.

> Locality aware grouping, which is a new grouping method considering locality
> ----------------------------------------------------------------------------
>
>                 Key: STORM-2539
>                 URL: https://issues.apache.org/jira/browse/STORM-2539
>             Project: Apache Storm
>          Issue Type: New Feature
>          Components: storm-core
>            Reporter: Siwoon Son
>         Attachments: receiver-imbalance.png, sender-imbalance.png
>
>
> I’d like to propose a new grouping method. This method solves problems that can occur with existing _Shuffle grouping_ method and _Local-or-shuffle_ method.
> I was motivated by the following a question.
> bq. "Would not it be more efficient to transfer tuples to nearby nodes?"
> Among the Storm's grouping methods, _Shuffle grouping_ is a method of randomly selecting a task of the next node. In this method, all nodes can receive data evenly, but the same amount of data is transferred to a relatively distant node, which can cause a high delay. To solve this problem, Storm provides Local-or-shuffle grouping considering locality.
> _Local-or-shuffle grouping_ can minimize the delay by internally processing data in the process if the task receiving the data is in the same process. Local-or-shuffle grouping, however, considers *only locality*, which may leads to two load balancing problems: +the sender imbalance+ and +the receiver imbalance+. First, the sender imbalance problem is a load balancing problem in which traffic is concentrated on a specific task because the number of senders’ tasks and the number of nodes are not equal. Next, the receiver imbalance problem is a load balancing problem in which traffic is concentrated on a specific object because the number of receivers' tasks and the number of nodes are not equal. If these problems occur, the tasks of receivers will perform different amounts of work, resulting in performance degradation and processing delays.
> |!sender-imbalance.png|width=400!|!receiver-imbalance.png|width=400!|
> |(a) Example of sender imbalance problem.|(b) Example of receiver imbalance problem.|
> So, I propose locality aware grouping which can solve the load balancing problem while considering the locality. Locality aware grouping is a method of periodically calculating the ping response time between nodes and transmitting more tuples probabilistically to nodes with low response time. I implemented the proposed Locality aware grouping at [https://github.com/dke-knu/i2am/tree/master/i2am-app/locality-aware-grouping]. LocalityAwareGrouping.java is a class that implements locality aware grouping. LocalityGroupingTestTopology.java, TupleGeneratorSpout.java, and PerformanceLoggingBolt.java are topology classes for testing this. LocalityAwareGrouping$prepare() method reads the network information of each node from the Zookeeper and activates the thread. This thread periodically calculates the ping response time of each node. LocalityAwareGrouping$chooseTasks() method selects a task by a higher probability for the nodes with lower network response times.
> But the implementation is straightforward. To calculate the ping between nodes, the network information of the nodes that tasks are performing is needed. I got this information from Zookeeper using the Zookeeper$getData() method. At this time, in order to create a Zookeeper object, I had no choice but to receive the connection information of the Zookeeper from the user.
> Please let me know, if there is a way to get the network information of each node without requiring additional parameters from the user and if you have any additional comments.
> Thank you for reading.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)