You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Siwoon Son (JIRA)" <ji...@apache.org> on 2017/01/15 12:02:26 UTC

[jira] [Created] (STORM-2291) A Hash Collision Problem of Fields Grouping in Windowing Method

Siwoon Son created STORM-2291:
---------------------------------

             Summary: A Hash Collision Problem of Fields Grouping in Windowing Method
                 Key: STORM-2291
                 URL: https://issues.apache.org/jira/browse/STORM-2291
             Project: Apache Storm
          Issue Type: New Feature
          Components: storm-core
    Affects Versions: 1.x
            Reporter: Siwoon Son
             Fix For: 1.x


I‘d like to discuss the hash collision issue that occurs when applying the _Grouping_ method [http://storm.apache.org/releases/current/Concepts.html] to _Windowing_ method [http://storm.apache.org/releases/current/Windowing.html] in Storm.

I first assume the following situation. Spout constantly emits tuples to Bolt. At this time, Bolt tries to perform operations while moving tuples of a certain interval. e.g. moving average, etc. To solve this situation, Storm provides _Windowing_ method.

However, consider the following complex situation. Two problems are added in the above situation.
# As a first problem, the tuple emitted by Spout is multidimensional with multiple pieces of information. For example, Alice, Bob, and Clark are mapped to random real numbers. That is, the tuples emitted from Spout are {[Alice, 0.18322], [Clark, 0.57833], [Bob, 0.27902], [Clark, 0.24553], [Alice, 0.50164], [Alice, 0.06463], ...}. While those tuples are transmitted to the next windowed bolt, they must be necessarily separated by keys such as Alice, Bob, and Clark. In other words, each tuples in which Alice, Bob, and Clark are mapped must belong to different windows.
# The second problem is Storm's parallelism. Spout and Bolt can be operated as multiple objects on multiple servers. This problem is that the tuples with the same key must be emit in the same window even if they are created by a different Spout object.

Storm can specify the Bolt objects which the tuples is to be input as a _Grouping_ method. Storm provides various _Grouping_ methods, but a fields grouping is best suited as a way to solve the above problems. The fields grouping is a way of partitioning an input stream by a specified field. With the fields grouping, tuples of the same field can only be passed to the same Bolt object. However, the fields grouping has been implemented as a hash method. ([http://storm.apache.org/releases/current/Tutorial.html]) Therefore, it can be cause *a hash collision* problem that can include the tuples in the same window although they have different fields. So I am interested in solving the hash collision.

The source code of [https://github.com/dke-knu/i2am/tree/master/i2am-app/fields-window-grouping/src/main/java/org/fields/window/grouping/as_is] is a situation where the hash collision occurs. First, Spout randomly emits the tuples mapping Alice, Bob, and Clark on random real numbers. Next, Bolt which extends BaseWindowedBolt prints the TupleWindow objects received from Spout. At this time, Bolt uses the fields grouping. Spout emits three fields: Alice, Bob, and Clark. If the parallelism of Bolt is set less than 3, it surely cause the hash collision. Conversely, the greater the parallelism of Bolt than 3, the lower the probability of the hash collision. But, it can not be guaranteed that the hash collision does not occur.

As an alternative to this hash collision, I used two-step IRichBolt instead of BaseWindowedBolt. The source code for this is [https://github.com/dke-knu/i2am/tree/master/i2am-app/fields-window-grouping/src/main/java/org/fields/window/grouping/to_be]. The first Bolt is important. This Bolt takes tuples from Spout and manages them through a hash map of list according to the field. If the list is as filled as a predefined window size, the oldest tuple is removed and a new tuple is added. And then, Bolt emits this list to the next Bolt.

Using this method, the above problems can be solved. That is, the tuples of the same fields is always managed in the same window regardless of the number of fields and the number of parallelism. 

I'm concerned that this problem will frequently happen to many Storm users who use the _Windowing_ method. If there is not a better way than the one I presented, I think there should be a new grouping method for _Windowing_ method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)