You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Abhishek Bhattacharjee <ab...@gmail.com> on 2014/01/15 10:06:11 UTC
Re:
It is a good question. I think it is related to groupings. You should study
different types of groupings to understand the problem fully.
Let me if this helps. You can refer this book
http://shop.oreilly.com/product/0636920024835.do
On Tue, Jan 14, 2014 at 5:41 AM, Spico Florin <sp...@gmail.com> wrote:
> Hello!
> I'm a newbie in Storm and I have some questions regarding the scaling
> the number of workers/executors among clusters and how data is correctly
> handled between them.
> In the case of the WordCountTopology the WordCount Bolt is used to
> count the words from a text. From my observations and understanding, it is
> using an internal Map that is keeping the counts for the arrived words to
> each task.
>
> public static class WordCount extends BaseBasicBolt {
> Map<String, Integer> counts = new HashMap<String, Integer>();
> public void execute(Tuple tuple, BasicOutputCollector collector) {
> String word = tuple.getString(0);
> Integer count = counts.get(word);
> if (count == null)
> count = 0;
> count++;
> counts.put(word, count);
> }
> }
>
>
> From the configuration of the topology:
>
> builder.setBolt("count", new WordCount(), 6).fieldsGrouping("split", new
> Fields("word"));
>
> I can understand that each of the task will receive same words as it
> received first time or new ones,thus keeping the counts consistently (no
> task will receive words that were processed by a different task).
>
> Now suppose that:
> - you have a huge text with a small number of different words (let's say
> that you have 10MB of text containing only the words one, two, three,
> four,five,six,seven, eight, nine, ten).
> -you start the topology with 2 workers
> -at some moment in time (after all the words are distributed through the
> tasks and already have numbers), we are adding one more workers .
> Here are my questions:
> 1. When we re-balance our topology, will two newly added workers get data?
> 2. If the two more workers will get data, how the words counts are kept
> consistently since they will receive some already processed words? Are the
> count values for the already processed words passed to the newly created
> workers?
> 3. Given the fact that I don't have remote cluster installed, are my
> assumption correct?
>
> I look forward for your answers and opinions.
>
> Thanks.
>
> Regards,
> Florin
>
> Example.
> Given:
> Time T1:
> worker 1:
> task1: counts{(one,10),(two,21)}
> task2: counts{(three,5),(four,2)}
> task3: counts{(five,8)}
>
> worker 2:
> task1: counts{(six,10),(seven,21)}
> task2: counts{(eight,5),(nine,2)}
> task3: counts{(ten,8)}
>
> Time T10: (rebalancing 3 workers in place)
>
> worker 1:
> task1: counts{(one,10),(two,21)}
> task2: counts{(three,5),(four,2)}
> worker 2:
> task1: counts{(six,10),(seven,21)}
> task2: counts{(eight,5),(nine,2)}
>
> worker 3:
> task1: counts{(five,15 )}
> task2: counts{(ten,25)}
>
> For the worker 3 task 1: the value 15 should come from the previous value
> 8 (computed by w1/t3) plus 7 new computed
> worker 3 task 2: the value 25 should come from the previous value 8
> (computed by w2/t3) plus 17 new computed
>
>
>
>
>
>
>
>
>
>
>
>
>
>
--
*Abhishek Bhattacharjee*
*Pune Institute of Computer Technology*