You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by "Thomas Cooper (PGR)" <t....@newcastle.ac.uk> on 2017/04/03 15:27:08 UTC

Why Tasks?

Hi,


I was hoping that someone on here would be able to help me with a conceptual issue?


I understand how Storm implements parallelism. I am researching how to model the performance of Storm topologies so I have dug around in the source code quite a bit. However, I still can't quite wrap my head around tasks.


I know they are linked to Fields Groupings, so that a tuple with the same field value will always go to the same Executor. If task state was preserved through a re-balance then this would make sense as the state would follow the task and tuples would continue to be routed correctly. But, as I understand it, by default task state is not preserved through a re-balance. In this stateless case having tasks doesn't make sense, you could arbitrarily number the executors of each component and use those numbers for routing tuples? This would remove the upper scaling limit for each component of the topology?

Of course, if you have a state saving system (statefulBolt etc) tasks make sense and having tasks also simplify the hash functions that do the routing. So is this the reason they exist and that in the stateless case they are not strictly required (other than to make routing simpler)?

I am concerned that I am missing something fundamental?


Thanks in advance,


Thomas Cooper
PhD Student
Newcastle University, School of Computer Science
Twitter: @tomncooper

Re: Why Tasks?

Posted by Arun Mahadevan <ar...@apache.org>.

Fixing the tasks ensures that state is preserved during a re-balance and the tuples gets routed to the same task id with fields grouping. Users could be storing some state in a bolt (like maintaining some in-memory counter or something) without necessarily using a stateful bolt. If the number of tasks are changed during a rebalance, this goes for a toss. 

 

If we want to increase the number of tasks during a rebalance, we should handle the state migration as well.

 

Right now if you want bolts to execute with increased parallelism during a rebalance, you need to over provision the number of tasks. 

 

E.g. You start with parallelism = 2 and tasks = 10. There will be 2 threads executing 5 tasks each. Later may be you add more workers and rebalance with parallelism = 5, then there will be 5 threads executing 2 tasks each and you end up with 5 threads executing your code.

 

Thanks,

Arun

 

From: "Thomas Cooper (PGR)" <t....@newcastle.ac.uk>
Reply-To: "user@storm.apache.org" <us...@storm.apache.org>
Date: Monday, April 3, 2017 at 8:57 PM
To: "user@storm.apache.org" <us...@storm.apache.org>
Subject: Why Tasks?

 

Hi, 

 

I was hoping that someone on here would be able to help me with a conceptual issue?

 

I understand how Storm implements parallelism. I am researching how to model the performance of Storm topologies so I have dug around in the source code quite a bit. However, I still can't quite wrap my head around tasks.

 

I know they are linked to Fields Groupings, so that a tuple with the same field value will always go to the same Executor. If task state was preserved through a re-balance then this would make sense as the state would follow the task and tuples would continue to be routed correctly. But, as I understand it, by default task state is not preserved through a re-balance. In this stateless case having tasks doesn't make sense, you could arbitrarily number the executors of each component and use those numbers for routing tuples? This would remove the upper scaling limit for each component of the topology? 

Of course, if you have a state saving system (statefulBolt etc) tasks make sense and having tasks also simplify the hash functions that do the routing. So is this the reason they exist and that in the stateless case they are not strictly required (other than to make routing simpler)? 

I am concerned that I am missing something fundamental?

 

Thanks in advance, 

 

Thomas Cooper

PhD Student

Newcastle University, School of Computer Science

Twitter: @tomncooper