You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@storm.apache.org by "Jungtaek Lim (JIRA)" <ji...@apache.org> on 2017/01/16 09:02:26 UTC

[jira] [Commented] (STORM-2291) A Hash Collision Problem of Fields Grouping in Windowing Method

    [ https://issues.apache.org/jira/browse/STORM-2291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15823648#comment-15823648 ] 

Jungtaek Lim commented on STORM-2291:
-------------------------------------

Hi [~Siwoon]

1. As you may know, field grouping is not intended to guarantee different values on given field are routing to the different tasks. So field grouping is closely matched to your situation but not perfect.

2. What you suggest adds intermediate bolt which also adds overall latency. IMHO, it would be natural for window bolt itself to recognize the field and aggregate based on the value of the field, which means that users should take care of. Btw, it can be provided to the users via higher level API. (Stream API - STORM-1961)

For example, please see below:
https://github.com/arunmahadevan/storm/blob/97e7ef1103c8a629c95c43bb87bdbd619477b85b/examples/storm-starter/src/jvm/org/apache/storm/starter/streams/GroupByKeyAndWindowExample.java

3. If you know all the values of given field, you can make custom grouping to ensure all different values will go to the different tasks.

I recommend you to post your idea to dev@ mailing list to initiate discussion. Please refer http://storm.apache.org/getting-help.html to subscribe mailing lists (user / dev).

Thanks!

> A Hash Collision Problem of Fields Grouping in Windowing Method
> ---------------------------------------------------------------
>
>                 Key: STORM-2291
>                 URL: https://issues.apache.org/jira/browse/STORM-2291
>             Project: Apache Storm
>          Issue Type: New Feature
>          Components: storm-core
>    Affects Versions: 1.x
>            Reporter: Siwoon Son
>             Fix For: 1.x
>
>
> I‘d like to discuss the hash collision issue that occurs when applying the _Grouping_ method [http://storm.apache.org/releases/current/Concepts.html] to _Windowing_ method [http://storm.apache.org/releases/current/Windowing.html] in Storm.
> I first assume the following situation. Spout constantly emits tuples to Bolt. At this time, Bolt tries to perform operations while moving tuples of a certain interval. e.g. moving average, etc. To solve this situation, Storm provides _Windowing_ method.
> However, consider the following complex situation. Two problems are added in the above situation.
> # As a first problem, the tuple emitted by Spout is multidimensional with multiple pieces of information. For example, Alice, Bob, and Clark are mapped to random real numbers. That is, the tuples emitted from Spout are {[Alice, 0.18322], [Clark, 0.57833], [Bob, 0.27902], [Clark, 0.24553], [Alice, 0.50164], [Alice, 0.06463], ...}. While those tuples are transmitted to the next windowed bolt, they must be necessarily separated by keys such as Alice, Bob, and Clark. In other words, each tuples in which Alice, Bob, and Clark are mapped must belong to different windows.
> # The second problem is Storm's parallelism. Spout and Bolt can be operated as multiple objects on multiple servers. This problem is that the tuples with the same key must be emit in the same window even if they are created by a different Spout object.
> Storm can specify the Bolt objects which the tuples is to be input as a _Grouping_ method. Storm provides various _Grouping_ methods, but a fields grouping is best suited as a way to solve the above problems. The fields grouping is a way of partitioning an input stream by a specified field. With the fields grouping, tuples of the same field can only be passed to the same Bolt object. However, the fields grouping has been implemented as a hash method. ([http://storm.apache.org/releases/current/Tutorial.html]) Therefore, it can be cause *a hash collision problem* that can include the tuples in the same window although they have different fields. So I am interested in solving the hash collision.
> The source code of [https://github.com/dke-knu/i2am/tree/master/i2am-app/fields-window-grouping/src/main/java/org/fields/window/grouping/as_is] is a situation where the hash collision occurs. First, Spout randomly emits the tuples mapping Alice, Bob, and Clark on random real numbers. Next, Bolt which extends BaseWindowedBolt prints the TupleWindow objects received from Spout. At this time, Bolt uses the fields grouping. Spout emits three fields: Alice, Bob, and Clark. If the parallelism of Bolt is set less than 3, it surely cause the hash collision. Conversely, the greater the parallelism of Bolt than 3, the lower the probability of the hash collision. But, it can not be guaranteed that the hash collision does not occur.
> As an alternative to this hash collision, I used two-step IRichBolt instead of BaseWindowedBolt. The source code for this is [https://github.com/dke-knu/i2am/tree/master/i2am-app/fields-window-grouping/src/main/java/org/fields/window/grouping/to_be]. The first Bolt is important. This Bolt takes tuples from Spout and manages them through a hash map of list according to the field. If the list is as filled as a predefined window size, the oldest tuple is removed and a new tuple is added. And then, Bolt emits this list to the next Bolt.
> Using this method, the above problems can be solved. That is, the tuples of the same fields is always managed in the same window regardless of the number of fields and the number of parallelism. 
> I'm concerned that this problem will frequently happen to many Storm users who use the Windowing method. If there is not a better way than the one I presented, I think there should be a new grouping method for Windowing method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)