You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Brian Candler <b....@pobox.com> on 2016/03/22 11:37:01 UTC

Multilang, C and binary data

Hello,

I have some questions about external workers and the multi-lang 
protocol. We have a bunch of existing C code for running processing 
steps over binary data and I'm looking to see how feasible it is to hook 
it into Storm.


(1) Is it possible to handle binary data with multi-lang? Or is there 
existing support for hooking C into Storm?

The multi-lang protocol is JSON, so that implies either base64-encoding 
everything or passing round a URL to where the binary data is stored.

But looking at the source I see that topology.multilang.serializer is 
pluggable, so perhaps it's possible to make a version using (e.g.) 
MsgPack? Ah yes:
https://github.com/pystorm/pystorm/issues/5

So maybe there's a C library comparable to pystorm? Or I can use this 
serializer to talk msgpack to a spawned C process?


(2) Is there a practical maximum size to a tuple? In some cases we have 
chunks of around 50MB to pass from step to step. Is it reasonable to 
pass these directly? Or should they be written into some intermediate 
store like an NFS server?


(3) http://storm.apache.org/documentation/Multilang-protocol.html
"The shell bolt protocol is asynchronous. You will receive tuples on 
STDIN as soon as they are available"

So just to be clear: it's fine for me to write a multi-threaded external 
process which handles multiple overlapping requests?

Furthermore: if all the threads are busy, can I simply stop reading from 
stdin and let the sender block until I'm ready to receive more tuples?


I also have some general questions about the Storm architecture.

(4) http://storm.apache.org/documentation/Concepts.html

" Shuffle grouping: Tuples are randomly distributed across the bolt's 
tasks in a way such that each bolt is guaranteed to get an equal number 
of tuples."

Suppose the bolt's tasks are split across two servers, one of which is 
slower than the other. Does this mean that the slower server will be 
100% utilised while the faster servers will have idle periods? Or is 
there some flow-control mechanism which kicks into play and gives a 
larger share to the faster servers?

Specifically I am thinking of:
- A heterogenous cluster, where some servers are older and slower than 
others
- A cluster where one server happens to be busier than another (e.g. it 
is also working on a different topology)

Through googling I found topology.max.spout.pending, so I see there is 
an overall control of the number of in-flight (unacked) tuples, except 
for unreliable spouts:
http://stackoverflow.com/questions/24413088/storm-max-spout-pending

But other than that, will the shufflegrouping deal them out as fast as 
possible into the downstream bolts?


(5) 
http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html

This says that a single thread (executor) can run multiple task 
instances of the same component.

How does that work? That is, if those multiple tasks are in the same 
thread, how do they run concurrently? Or if they can't run concurrently, 
what is the benefit of having multiple tasks in a thread instead of just 
one task?


(6) How does Storm distribute tasks over workers and servers? For 
example, suppose spout A connects to bolt B. I have two servers, and I 
run a topology with 2 workers, 4 tasks of A and 4 tasks of B. Will I get 
4A on one server and 4B on the other, or 2A+2B on both, or something else?


Many thanks,

Brian Candler.


Re: Multilang, C and binary data

Posted by Xin Wang <da...@gmail.com>.
Thanks Brian,

I will add these links to https://issues.apache.org/jira/browse/STORM-1658
, and solve these issues as far as possible.

Xin

2016-03-29 0:57 GMT+08:00 Brian Candler <b....@pobox.com>:

> On 27/03/2016 05:03, Xin Wang wrote:
>
>> RAS:
>> http://storm.apache.org/releases/2.0.0-SNAPSHOT/Resource_Aware_Scheduler_overview.html
>> CGroup:
>> http://storm.apache.org/releases/2.0.0-SNAPSHOT/cgroups_in_storm.html
>>
>> Thanks for that. I hadn't twigged that these features were in
> new/unreleased code.
>
> BTW it seems that there are lots of broken documentation links now, but I
> can't see how to create an issue in JIRA so I'll report the ones I've found
> here.
>
> (1) http://storm.apache.org/
> at the bottom has a link to "Tutorial":
> http://storm.apache.org/releases/current/tutorial.html
> but this gives a 404
>
> Either the link needs to change to
> http://storm.apache.org/releases/current/Tutorial.html
> or the page itself needs to be renamed to "tutorial.html"
>
> (2) http://storm.apache.org/releases/0.10.0/index.html
> Under "Intermediate"
>
> * The "Direct Groupings" link points to
> http://storm.apache.org/releases/0.10.0/Direct-groupings.html
> which is non-existent.
>
> * The "Lifecycle of a trident tuple" link just points to the documentation
> index,
> http://storm.apache.org/releases/0.10.0/index.html
>
> (3) There are many pages with broken images. For example
>
> *
> http://storm.apache.org/releases/0.10.0/Understanding-the-parallelism-of-a-Storm-topology.html
> has broken image pointing to
>
> http://storm.apache.org/releases/0.10.0/images/relationships-worker-processes-executors-tasks.png
>
> http://storm.apache.org/releases/0.10.0/images/example-of-a-running-topology.png
> (these give 404 errors)
>
> * http://storm.apache.org/releases/0.10.0/Trident-tutorial.html
> has a broken image pointing to
> http://storm.apache.org/releases/0.10.0/images/batched-stream.png
>
> Regards,
>
> Brian.
>
>

Re: Multilang, C and binary data

Posted by Brian Candler <b....@pobox.com>.
On 27/03/2016 05:03, Xin Wang wrote:
> RAS: 
> http://storm.apache.org/releases/2.0.0-SNAPSHOT/Resource_Aware_Scheduler_overview.html
> CGroup: 
> http://storm.apache.org/releases/2.0.0-SNAPSHOT/cgroups_in_storm.html
>
Thanks for that. I hadn't twigged that these features were in 
new/unreleased code.

BTW it seems that there are lots of broken documentation links now, but 
I can't see how to create an issue in JIRA so I'll report the ones I've 
found here.

(1) http://storm.apache.org/
at the bottom has a link to "Tutorial":
http://storm.apache.org/releases/current/tutorial.html
but this gives a 404

Either the link needs to change to
http://storm.apache.org/releases/current/Tutorial.html
or the page itself needs to be renamed to "tutorial.html"

(2) http://storm.apache.org/releases/0.10.0/index.html
Under "Intermediate"

* The "Direct Groupings" link points to
http://storm.apache.org/releases/0.10.0/Direct-groupings.html
which is non-existent.

* The "Lifecycle of a trident tuple" link just points to the 
documentation index,
http://storm.apache.org/releases/0.10.0/index.html

(3) There are many pages with broken images. For example

* 
http://storm.apache.org/releases/0.10.0/Understanding-the-parallelism-of-a-Storm-topology.html
has broken image pointing to
http://storm.apache.org/releases/0.10.0/images/relationships-worker-processes-executors-tasks.png
http://storm.apache.org/releases/0.10.0/images/example-of-a-running-topology.png
(these give 404 errors)

* http://storm.apache.org/releases/0.10.0/Trident-tutorial.html
has a broken image pointing to
http://storm.apache.org/releases/0.10.0/images/batched-stream.png

Regards,

Brian.


Re: Multilang, C and binary data

Posted by Xin Wang <da...@gmail.com>.
RAS:
http://storm.apache.org/releases/2.0.0-SNAPSHOT/Resource_Aware_Scheduler_overview.html
CGroup:
http://storm.apache.org/releases/2.0.0-SNAPSHOT/cgroups_in_storm.html


2016-03-24 16:59 GMT+08:00 Brian Candler <b....@pobox.com>:

> On 23/03/2016 03:37, Xin Wang wrote:
>
> I have provided an implementation `MessagePackSerializer` for improving
> muti-lang performance. (PR: https://github.com/apache/storm/pull/1136).
> You can take a look at this. It's not merged yet.
>
> Thanks.
>
> Another question: the default config file talks about a "resource aware
> scheduler"
> https://github.com/apache/storm/blob/master/conf/defaults.yaml
> and I see cgroup support in STORM-1336.
>
> Is there any documentation for this, apart from reverse-engineering
> examples/storm-starter/src/jvm/org/apache/storm/starter/ResourceAwareExampleTopology.java
> ?
>
> Regards,
>
> Brian.
>
>

Re: Multilang, C and binary data

Posted by Brian Candler <b....@pobox.com>.
On 23/03/2016 03:37, Xin Wang wrote:
> I have provided an implementation `MessagePackSerializer` for 
> improving muti-lang performance. (PR: 
> https://github.com/apache/storm/pull/1136). You can take a look at 
> this. It's not merged yet.
>
Thanks.

Another question: the default config file talks about a "resource aware 
scheduler"
https://github.com/apache/storm/blob/master/conf/defaults.yaml
and I see cgroup support in STORM-1336.

Is there any documentation for this, apart from reverse-engineering 
examples/storm-starter/src/jvm/org/apache/storm/starter/ResourceAwareExampleTopology.java 
?

Regards,

Brian.


Re: Multilang, C and binary data

Posted by Xiang Wang <xi...@gmail.com>.
Hi,

You may have a look at: http://demeter.inf.ed.ac.uk/cross/stormcpp.html
It may help you to run storm with binary c code.

I am a storm beginner, and cannot help you with other questions...





-------------------------------
Xiang Wang PhD Candidate
Database Research Group
School of Computer Science and Engineering
The University of New South Wales
Sydney, Australia

On Wed, Mar 23, 2016 at 2:37 PM, Xin Wang <da...@gmail.com> wrote:

> I have provided an implementation `MessagePackSerializer` for improving
> muti-lang performance. (PR: https://github.com/apache/storm/pull/1136).
> You can take a look at this. It's not merged yet.
>
>
> Thanks,
> Xin
>
>
> 2016-03-22 18:37 GMT+08:00 Brian Candler <b....@pobox.com>:
>
>> Hello,
>>
>> I have some questions about external workers and the multi-lang protocol.
>> We have a bunch of existing C code for running processing steps over binary
>> data and I'm looking to see how feasible it is to hook it into Storm.
>>
>>
>> (1) Is it possible to handle binary data with multi-lang? Or is there
>> existing support for hooking C into Storm?
>>
>> The multi-lang protocol is JSON, so that implies either base64-encoding
>> everything or passing round a URL to where the binary data is stored.
>>
>> But looking at the source I see that topology.multilang.serializer is
>> pluggable, so perhaps it's possible to make a version using (e.g.) MsgPack?
>> Ah yes:
>> https://github.com/pystorm/pystorm/issues/5
>>
>> So maybe there's a C library comparable to pystorm? Or I can use this
>> serializer to talk msgpack to a spawned C process?
>>
>>
>> (2) Is there a practical maximum size to a tuple? In some cases we have
>> chunks of around 50MB to pass from step to step. Is it reasonable to pass
>> these directly? Or should they be written into some intermediate store like
>> an NFS server?
>>
>>
>> (3) http://storm.apache.org/documentation/Multilang-protocol.html
>> "The shell bolt protocol is asynchronous. You will receive tuples on
>> STDIN as soon as they are available"
>>
>> So just to be clear: it's fine for me to write a multi-threaded external
>> process which handles multiple overlapping requests?
>>
>> Furthermore: if all the threads are busy, can I simply stop reading from
>> stdin and let the sender block until I'm ready to receive more tuples?
>>
>>
>> I also have some general questions about the Storm architecture.
>>
>> (4) http://storm.apache.org/documentation/Concepts.html
>>
>> " Shuffle grouping: Tuples are randomly distributed across the bolt's
>> tasks in a way such that each bolt is guaranteed to get an equal number of
>> tuples."
>>
>> Suppose the bolt's tasks are split across two servers, one of which is
>> slower than the other. Does this mean that the slower server will be 100%
>> utilised while the faster servers will have idle periods? Or is there some
>> flow-control mechanism which kicks into play and gives a larger share to
>> the faster servers?
>>
>> Specifically I am thinking of:
>> - A heterogenous cluster, where some servers are older and slower than
>> others
>> - A cluster where one server happens to be busier than another (e.g. it
>> is also working on a different topology)
>>
>> Through googling I found topology.max.spout.pending, so I see there is an
>> overall control of the number of in-flight (unacked) tuples, except for
>> unreliable spouts:
>> http://stackoverflow.com/questions/24413088/storm-max-spout-pending
>>
>> But other than that, will the shufflegrouping deal them out as fast as
>> possible into the downstream bolts?
>>
>>
>> (5)
>> http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
>>
>> This says that a single thread (executor) can run multiple task instances
>> of the same component.
>>
>> How does that work? That is, if those multiple tasks are in the same
>> thread, how do they run concurrently? Or if they can't run concurrently,
>> what is the benefit of having multiple tasks in a thread instead of just
>> one task?
>>
>>
>> (6) How does Storm distribute tasks over workers and servers? For
>> example, suppose spout A connects to bolt B. I have two servers, and I run
>> a topology with 2 workers, 4 tasks of A and 4 tasks of B. Will I get 4A on
>> one server and 4B on the other, or 2A+2B on both, or something else?
>>
>>
>> Many thanks,
>>
>> Brian Candler.
>>
>>
>

Re: Multilang, C and binary data

Posted by Xin Wang <da...@gmail.com>.
I have provided an implementation `MessagePackSerializer` for improving
muti-lang performance. (PR: https://github.com/apache/storm/pull/1136). You
can take a look at this. It's not merged yet.


Thanks,
Xin


2016-03-22 18:37 GMT+08:00 Brian Candler <b....@pobox.com>:

> Hello,
>
> I have some questions about external workers and the multi-lang protocol.
> We have a bunch of existing C code for running processing steps over binary
> data and I'm looking to see how feasible it is to hook it into Storm.
>
>
> (1) Is it possible to handle binary data with multi-lang? Or is there
> existing support for hooking C into Storm?
>
> The multi-lang protocol is JSON, so that implies either base64-encoding
> everything or passing round a URL to where the binary data is stored.
>
> But looking at the source I see that topology.multilang.serializer is
> pluggable, so perhaps it's possible to make a version using (e.g.) MsgPack?
> Ah yes:
> https://github.com/pystorm/pystorm/issues/5
>
> So maybe there's a C library comparable to pystorm? Or I can use this
> serializer to talk msgpack to a spawned C process?
>
>
> (2) Is there a practical maximum size to a tuple? In some cases we have
> chunks of around 50MB to pass from step to step. Is it reasonable to pass
> these directly? Or should they be written into some intermediate store like
> an NFS server?
>
>
> (3) http://storm.apache.org/documentation/Multilang-protocol.html
> "The shell bolt protocol is asynchronous. You will receive tuples on STDIN
> as soon as they are available"
>
> So just to be clear: it's fine for me to write a multi-threaded external
> process which handles multiple overlapping requests?
>
> Furthermore: if all the threads are busy, can I simply stop reading from
> stdin and let the sender block until I'm ready to receive more tuples?
>
>
> I also have some general questions about the Storm architecture.
>
> (4) http://storm.apache.org/documentation/Concepts.html
>
> " Shuffle grouping: Tuples are randomly distributed across the bolt's
> tasks in a way such that each bolt is guaranteed to get an equal number of
> tuples."
>
> Suppose the bolt's tasks are split across two servers, one of which is
> slower than the other. Does this mean that the slower server will be 100%
> utilised while the faster servers will have idle periods? Or is there some
> flow-control mechanism which kicks into play and gives a larger share to
> the faster servers?
>
> Specifically I am thinking of:
> - A heterogenous cluster, where some servers are older and slower than
> others
> - A cluster where one server happens to be busier than another (e.g. it is
> also working on a different topology)
>
> Through googling I found topology.max.spout.pending, so I see there is an
> overall control of the number of in-flight (unacked) tuples, except for
> unreliable spouts:
> http://stackoverflow.com/questions/24413088/storm-max-spout-pending
>
> But other than that, will the shufflegrouping deal them out as fast as
> possible into the downstream bolts?
>
>
> (5)
> http://storm.apache.org/documentation/Understanding-the-parallelism-of-a-Storm-topology.html
>
> This says that a single thread (executor) can run multiple task instances
> of the same component.
>
> How does that work? That is, if those multiple tasks are in the same
> thread, how do they run concurrently? Or if they can't run concurrently,
> what is the benefit of having multiple tasks in a thread instead of just
> one task?
>
>
> (6) How does Storm distribute tasks over workers and servers? For example,
> suppose spout A connects to bolt B. I have two servers, and I run a
> topology with 2 workers, 4 tasks of A and 4 tasks of B. Will I get 4A on
> one server and 4B on the other, or 2A+2B on both, or something else?
>
>
> Many thanks,
>
> Brian Candler.
>
>