You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Eric Frazer <er...@dogstar.us> on 2015/11/01 18:12:52 UTC

storm noob question - understanding ingestion

I've seen a ton of examples for Storm so far (I'm a noob). but what I don't
understand is how the spouts do parallelism. Suppose I want to process a
giant file in Storm, each source has to read and process 64MB of the input
file. How can I get each spout to chew on only a portion of the input file
source? Alternatively, suppose I have a twitter fire hose. When twitter's
fire hose produces a tweet, do each of the 10 spouts get the same tweet? The
way I envision spouts in my head now is that simple X threads are created
and each one runs the same code. Is there something special Storm does so
that each spout is processing something different from it's code?

 

Q1: How does each spout know which part of the giant input file to read? 

Q2: How would I initialize a spout to read a certain file? Input params from
the Storm command line?

Q3: how do I know when the input file is completely processed? In the final
bolts' emit logic, can they all communicate to one final bolt and tell them
which piece of the source they've processed, and the final bolt checks off
all the done messages and when done, does - ? How can it signal the topology
owner it's done? 

Q4: Is there a online forum that is easier to use than this email list
server thing, where I can ask and browse questions? This email list server
is so early 1990's, it's shocking.

 

All the online examples I've read about Storm have spouts that produce
essentially random information forever. They are essentially near-useless
examples, to me. Processing a giant file, or processing data from a live
generator of actual data, are much better. I hope I find some decent
examples this weekend.

 

Thanks!

Re: storm noob question - understanding ingestion

Posted by Nathan Leung <nc...@gmail.com>.

A spout can get its own task ID, and also the number of tasks running for
the spout (see
https://nathanmarz.github.io/storm/doc/backtype/storm/task/TopologyContext.html
for details).  You can use this data to assign sections of the file, if
that's what you really want.

Other options include sending different files to different spout tasks
based on the spout index, or using Kafka as an intermediary (write file
names to kafka, and have spout process a file, or have a separate process
that reads from a file and writes the data to Kafka).

For end of file signalling, unless there's a file reader spout
implementation, you have to handle that yourself.  For the list server.. I
guess that's what Apache provides, though you can browse questions in their
mailing list archives https://mail-archives.apache.org/mod_mbox/storm-user/.

In practice I think most users of storm have some queuing solution to feed
their topologies.

On Sun, Nov 1, 2015 at 12:12 PM, Eric Frazer <er...@dogstar.us> wrote:

> I’ve seen a ton of examples for Storm so far (I’m a noob)… but what I
> don’t understand is how the spouts do parallelism. Suppose I want to
> process a giant file in Storm, each source has to read and process 64MB of
> the input file. How can I get each spout to chew on only a portion of the
> input file source? Alternatively, suppose I have a twitter fire hose… When
> twitter’s fire hose produces a tweet, do each of the 10 spouts get the same
> tweet? The way I envision spouts in my head now is that simple X threads
> are created and each one runs the same code. Is there something special
> Storm does so that each spout is processing something different from it’s
> code?
>
>
>
> Q1: How does each spout know which part of the giant input file to read?
>
> Q2: How would I initialize a spout to read a certain file? Input params
> from the Storm command line?
>
> Q3: how do I know when the input file is completely processed? In the
> final bolts’ emit logic, can they all communicate to one final bolt and
> tell them which piece of the source they’ve processed, and the final bolt
> checks off all the done messages and when done, does - ? How can it signal
> the topology owner it’s done?
>
> Q4: Is there a online forum that is easier to use than this email list
> server thing, where I can ask and browse questions? This email list server
> is so early 1990’s, it’s shocking…
>
>
>
> All the online examples I’ve read about Storm have spouts that produce
> essentially random information forever. They are essentially near-useless
> examples, to me. Processing a giant file, or processing data from a live
> generator of actual data, are much better. I hope I find some decent
> examples this weekend.
>
>
>
> Thanks!
>
>
>