You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Adam Mitchell <ad...@salesforce.com> on 2015/04/28 21:27:45 UTC

Bolt declareOutputFields() and the contract between bolts

I've got a topology that starts with one single-field input tuple from a
spout, then chains a bunch of bolts together.

* The spout emits a piece of event data,
* The first bolt, in declareOutputFields(), declares that it will emit two
fields - the input to the bolt plus one new field,
* The next bolt does the same thing - takes two input fields and adds a
third.

Eventually I've got a bolt that declares it will emit 8 fields, though the
first 7 are just pass-throughs from the previous bolt.

Everything works fine, but since I'm declaring that I will emit 8 fields,
I'm adding checks during processBolt() to validate the first 7 fields
before doing the work and adding the 8th field that my bolt is meant to do.

It doesn't feel like a good pattern.

Is this chaining of one bolt to another the right way to go?  Or should
bolts only emit the new fields that they generate?

(trying to attach an image to this question to illustrate the chaining)

Re: Bolt declareOutputFields() and the contract between bolts

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

Hi Adam,

The use case you are describing will result in a lot of network transfer, which could adversely affect performance/throughput.

I would suggest taking a look at Storm's micro batching/transactional API (aka Trident).

With Trident, your topology will get optimized to minimize network transfer. It will (behind the scenes) pipeline stream operations such that they run on a single node, and only resort to network data transfer when necessary. This is one reason why trident is roughly 2x faster in terms of throughput. 

The cost is latency, but it's entirely tunable. Sub-second ( < 250 ms) latency is easy with Trident. The higher your latency tolerance, the higher the throughput you can achieve (to a point). 

This is true of most, if not all, streaming frameworks. The difference is that Storm, being a pure streaming framework (I.e. Not limited to a batch paradigm like Spark streaming), allows you to choose the balance between throughput/latency that best fits your use case.

As it stands, the current Trident documentation has proven difficult for many people to grok, but I hope to change that in the near future. And as always, contributions are more than welcome.

-Taylor

> On Apr 28, 2015, at 12:27 PM, Adam Mitchell <ad...@salesforce.com> wrote:
> 
> I've got a topology that starts with one single-field input tuple from a spout, then chains a bunch of bolts together.  
> 
> * The spout emits a piece of event data,
> * The first bolt, in declareOutputFields(), declares that it will emit two fields - the input to the bolt plus one new field,
> * The next bolt does the same thing - takes two input fields and adds a third.
> 
> Eventually I've got a bolt that declares it will emit 8 fields, though the first 7 are just pass-throughs from the previous bolt.
> 
> Everything works fine, but since I'm declaring that I will emit 8 fields, I'm adding checks during processBolt() to validate the first 7 fields before doing the work and adding the 8th field that my bolt is meant to do.
> 
> It doesn't feel like a good pattern.
> 
> Is this chaining of one bolt to another the right way to go?  Or should bolts only emit the new fields that they generate?
> 
> (trying to attach an image to this question to illustrate the chaining)
> 
> 
> 
> <storm_topo.png>