You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Eugene Dzhurinsky <jd...@gmail.com> on 2014/03/19 16:21:54 UTC

Is Storm a right tool for processing of thousands of small tasks?

Hello!

I'm evaluating Storm for the project, which involves processing of many
distinct small tasks in the following way:

- a user supplies some data source

- spout is attached to the source and produces chunks of data to the topology

- bolts are being processing the chunk of data and transform it somehow (in general
reducing the number of chunks, so number of records in sink are much less
than number of records out of the spout)

- when all records are processed - the results are accumulated and sent back
to the user.

As far as I understand, a topology is supposed to be kept running forever, so
I don't really see the easy way to "distinguish" the records from one task
from records of another one. Should a new topology be started for each new
task of a user?

Thank you in advance! The links to any appropriate articles are very welcome :)

-- 
Eugene N Dzhurinsky

Re: Is Storm a right tool for processing of thousands of small tasks?

Posted by Angelo Genovese <an...@genovese.ca>.

On Wed, Mar 19, 2014 at 1:28 PM, Eugene Dzhurinsky <jd...@gmail.com> wrote:
>
> On Wed, Mar 19, 2014 at 01:03:43PM -0700, Angelo Genovese wrote:
> >    "Any problem in computer science can be solved with another layer of
> >    indirection." (David Wheeler)
>
> ... and every new layer of indirection introduces new bugs, which will combine
> with the existings ones and produce up to N*M new problems, where N is number
> of bugs in the lower layer, and M - in the upper one. (Unknown Developer)

The quote was meant to be a bit tongue in cheek, sorry.

> >    You could setup the spout to read it's list of sources from somewhere, add
> >    to that list when a request comes in, then remove completed sources from
> >    that list in your sink.  Or you could use a separate app to read from the
> >    sources on request and pile that info into a queue like Kestrel for a
> >    storm spout to read from.
>
> Yes, that will probably solve the task of adding new data sources, however
> it's still not clear how to mark some task as "complete", so the sink could
> send the results back to the user. Like if the spout had produced 5 messages
> and 6-th is "EOT", and that EOT message arrived to sink when the message 3 was
> in progress.
>

>From the sounds of it you need some sort of coordination, which from
what I've seen so far is only really possible with a single starting
tuple.

If that's the case (I'm still pretty new to storm myself, I've only
been working on it for a few months now), you could have a spout emit
the new external source and build a bolt to connect to it and emit the
relevant chunks. you would need to coordinate that and return your
response to the user in the finished callback for your sink.

This is a bit similar to something we're working on where we have a
url coming in on a kestrel queue and bolts which fetch the content of
the page, break it up and process the pieces, then a bolt which
accumulates the results of that processing and writes the aggregate
information to a data-store in its finished callback.

Ciao,

-- 
Angelo Genovese

Re: Is Storm a right tool for processing of thousands of small tasks?

Posted by Eugene Dzhurinsky <jd...@gmail.com>.

On Wed, Mar 19, 2014 at 01:03:43PM -0700, Angelo Genovese wrote:
>    "Any problem in computer science can be solved with another layer of
>    indirection." (David Wheeler)

... and every new layer of indirection introduces new bugs, which will combine
with the existings ones and produce up to N*M new problems, where N is number
of bugs in the lower layer, and M - in the upper one. (Unknown Developer)

>    You could setup the spout to read it's list of sources from somewhere, add
>    to that list when a request comes in, then remove completed sources from
>    that list in your sink.  Or you could use a separate app to read from the
>    sources on request and pile that info into a queue like Kestrel for a
>    storm spout to read from.

Yes, that will probably solve the task of adding new data sources, however
it's still not clear how to mark some task as "complete", so the sink could
send the results back to the user. Like if the spout had produced 5 messages
and 6-th is "EOT", and that EOT message arrived to sink when the message 3 was
in progress.

-- 
Eugene N Dzhurinsky

Re: Is Storm a right tool for processing of thousands of small tasks?

Posted by Angelo Genovese <an...@genovese.ca>.

"Any problem in computer science can be solved with another layer of
indirection." (David Wheeler)

You could setup the spout to read it's list of sources from somewhere, add
to that list when a request comes in, then remove completed sources from
that list in your sink.  Or you could use a separate app to read from the
sources on request and pile that info into a queue like Kestrel for a storm
spout to read from.




On Wed, Mar 19, 2014 at 8:21 AM, Eugene Dzhurinsky <jd...@gmail.com>wrote:

> Hello!
>
> I'm evaluating Storm for the project, which involves processing of many
> distinct small tasks in the following way:
>
> - a user supplies some data source
>
> - spout is attached to the source and produces chunks of data to the
> topology
>
> - bolts are being processing the chunk of data and transform it somehow
> (in general
> reducing the number of chunks, so number of records in sink are much less
> than number of records out of the spout)
>
> - when all records are processed - the results are accumulated and sent
> back
> to the user.
>
> As far as I understand, a topology is supposed to be kept running forever,
> so
> I don't really see the easy way to "distinguish" the records from one task
> from records of another one. Should a new topology be started for each new
> task of a user?
>
> Thank you in advance! The links to any appropriate articles are very
> welcome :)
>
> --
> Eugene N Dzhurinsky
>

Re: Is Storm a right tool for processing of thousands of small tasks?

Posted by Eugene Dzhurinsky <jd...@gmail.com>.

On Wed, Mar 19, 2014 at 01:02:54PM -0400, Milinda Pathirage wrote:
> How about adding a extra field (user id or job id) to data chunks and
> use that field to distiguish tasks.

Milinda, that might work if I could send a "message" that this task is done,
and no more input will come from spout. And it seems to be not trivial in case
if topology is truly distributed (I could expect some messages after this
"end" marker has been sent).

Also, it seems that I can not "attach" a new spout to an existing topology, so
looks like I have to start a new topology for a task.

-- 
Eugene N Dzhurinsky

Re: Is Storm a right tool for processing of thousands of small tasks?

Posted by Milinda Pathirage <mp...@umail.iu.edu>.

How about adding a extra field (user id or job id) to data chunks and
use that field to distiguish tasks.

Thanks
Milinda

On Wed, Mar 19, 2014 at 11:21 AM, Eugene Dzhurinsky <jd...@gmail.com> wrote:
> Hello!
>
> I'm evaluating Storm for the project, which involves processing of many
> distinct small tasks in the following way:
>
> - a user supplies some data source
>
> - spout is attached to the source and produces chunks of data to the topology
>
> - bolts are being processing the chunk of data and transform it somehow (in general
> reducing the number of chunks, so number of records in sink are much less
> than number of records out of the spout)
>
> - when all records are processed - the results are accumulated and sent back
> to the user.
>
> As far as I understand, a topology is supposed to be kept running forever, so
> I don't really see the easy way to "distinguish" the records from one task
> from records of another one. Should a new topology be started for each new
> task of a user?
>
> Thank you in advance! The links to any appropriate articles are very welcome :)
>
> --
> Eugene N Dzhurinsky



-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org

Re: Is Storm a right tool for processing of thousands of small tasks?

Posted by Eugene <ed...@gmail.com>.

One approach would be to have additional field in the tuple with taskId,
then track all the tuples with ack and fail method on the spout. (job is
done when  # of acked + # of fails records=total for the job)
Total should be somewhere outside of storm, in fast data storage, then you
don't have to create new topology for each task.
This approach has disadvantages - storm has to track tuples (affecting
throughput) and have a separate data storage. but that maybe no problem in
your use case.

This approach should works if you know how many records you expecting to
process per job.

On Wed, Mar 19, 2014 at 5:38 PM, Eugene Dzhurinsky <jd...@gmail.com>wrote:

> On Wed, Mar 19, 2014 at 05:14:28PM -0400, P. Taylor Goetz wrote:
> > Have you considered using DRPC [1]?
> >
> > [1] http://storm.incubator.apache.org/documentation/Distributed-RPC.html
>
> From what I've read so far - it seems to create and deploy a topology,
> which I
> could also do here (and probably much simpler, since it's a concrete case -
> not a generic one).
>
> So if a creation of a new topology per "task" is not considered as a "bad
> practice" - the only thing which I can't get is how to track that a task
> has
> actually finished and all data from the spout was received in sink.
> Perhaps I
> could add some sort of "record number" (which I may know in advance) and
> process the completion event if the sink has received the expected number
> of
> records.
>
> --
> Eugene N Dzhurinsky
>

-- 
Eugene Dvorkin
Software Engineer
New York City Storm User Group - organizer
WebMD
email: edvorkin@gmail.com
phone: 646-250-9649
eugenedvorkin.com
Connect with me on:
LinkedIn <http://www.linkedin.com/in/eugenedvorkin>
Twitter<http://twitter.com/edvorkin>

Re: Is Storm a right tool for processing of thousands of small tasks?

Posted by Eugene Dzhurinsky <jd...@gmail.com>.

On Wed, Mar 19, 2014 at 05:14:28PM -0400, P. Taylor Goetz wrote:
> Have you considered using DRPC [1]?
> 
> [1] http://storm.incubator.apache.org/documentation/Distributed-RPC.html

From what I've read so far - it seems to create and deploy a topology, which I
could also do here (and probably much simpler, since it's a concrete case -
not a generic one).

So if a creation of a new topology per "task" is not considered as a "bad
practice" - the only thing which I can't get is how to track that a task has
actually finished and all data from the spout was received in sink. Perhaps I
could add some sort of "record number" (which I may know in advance) and
process the completion event if the sink has received the expected number of
records.

-- 
Eugene N Dzhurinsky

Re: Is Storm a right tool for processing of thousands of small tasks?

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

Have you considered using DRPC [1]?

[1] http://storm.incubator.apache.org/documentation/Distributed-RPC.html

On Mar 19, 2014, at 11:21 AM, Eugene Dzhurinsky <jd...@gmail.com> wrote:

> Hello!
> 
> I'm evaluating Storm for the project, which involves processing of many
> distinct small tasks in the following way:
> 
> - a user supplies some data source
> 
> - spout is attached to the source and produces chunks of data to the topology
> 
> - bolts are being processing the chunk of data and transform it somehow (in general
> reducing the number of chunks, so number of records in sink are much less
> than number of records out of the spout)
> 
> - when all records are processed - the results are accumulated and sent back
> to the user.
> 
> As far as I understand, a topology is supposed to be kept running forever, so
> I don't really see the easy way to "distinguish" the records from one task
> from records of another one. Should a new topology be started for each new
> task of a user?
> 
> Thank you in advance! The links to any appropriate articles are very welcome :)
> 
> -- 
> Eugene N Dzhurinsky