You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@storm.apache.org by Eric Frazer <ef...@microsoft.com> on 2015/10/31 16:27:20 UTC

How state is stored in Spouts? How can I tell when I'm done?

I've seen a ton of examples for Storm so far (I'm a noob)... but what I don't understand is how the spouts do parallelism. Suppose I want to process a giant file in Storm, each source has to read and process 64MB of the input file. I can't envision a topology like this yet (because I'm ignorant). Q1: How does each spout know which part of the giant input file to read? Q2: How does each spout get told which file to read? Q3: how do I know when the input file is completely processed? In the final bolts' emit logic, can they all communicate to one final bolt and tell them which piece of the source they've processed, and the final bolt checks off all the done messages and when done, does - ? How can it signal the topology owner it's done? Is there a online forum that is easier to use than this email list server thing, where I can ask and browse questions? This email list server is so early 1990's, it's shocking...

All the online examples I've read about Storm have spouts that produce essentially random information forever. They are essentially near-useless examples, to me. Processing a giant file, or processing data from a live generator of actual data, are much better. I hope I find some decent ones this weekend.

Thanks!

Re: How state is stored in Spouts? How can I tell when I'm done?

Posted by Bobby Evans <ev...@yahoo-inc.com.INVALID>.

Eric,
Storm is not a batch processing system. It is mend for continuous streams of data that are never done. You could use it for batch processing like what flink does, but it is not really designed for that.
Q1: each spout knows how many other spouts there are and which one in that list it is. You get that information from the TopologyContext passed into the open method of the spout, so it could handle splitting the input accordingly.Q2: That is up to you. So you could pass the list into the spout when you create it, just be careful not to open a link to the file until the open call because the spout will be serialized and deserialized in another process. File Handles don't like serialization.
Q3: You can use the acking process to know. When acking is enabled each tuple is tracked as it is processed by your topology. When it is fully done the ack method in your spout is called. - Bobby

On Saturday, October 31, 2015 10:38 AM, Eric Frazer <ef...@microsoft.com> wrote:

I've seen a ton of examples for Storm so far (I'm a noob)... but what I don't understand is how the spouts do parallelism. Suppose I want to process a giant file in Storm, each source has to read and process 64MB of the input file. I can't envision a topology like this yet (because I'm ignorant). Q1: How does each spout know which part of the giant input file to read? Q2: How does each spout get told which file to read? Q3: how do I know when the input file is completely processed? In the final bolts' emit logic, can they all communicate to one final bolt and tell them which piece of the source they've processed, and the final bolt checks off all the done messages and when done, does - ? How can it signal the topology owner it's done? Is there a online forum that is easier to use than this email list server thing, where I can ask and browse questions? This email list server is so early 1990's, it's shocking...

All the online examples I've read about Storm have spouts that produce essentially random information forever. They are essentially near-useless examples, to me. Processing a giant file, or processing data from a live generator of actual data, are much better. I hope I find some decent ones this weekend.

Thanks!