You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Hemanth Yamijala <yh...@gmail.com> on 2015/01/07 14:14:18 UTC

Storm patterns vis-a-vis external data storage

Hi all,

I guess it is common to build topologies where message processing in storm
results in data that should be stored in external stores like NoSQL DBs or
message queues like Kafka.

There are two broad approaches to handle this storage:

1) Inline the storage functionality with the processing functionality -
i.e. the bolt generating the info to be stored also takes care of storing
it.
2) Separate out the two and make a downstream bolt responsible for the
storage.

Just wanted to see if people on the list think if there are advantages to
favour one approach over the other. Any pitfalls to take care of in one
case over the other.

Thanks
Hemanth

Re: Storm patterns vis-a-vis external data storage

Posted by Hemanth Yamijala <yh...@gmail.com>.

Thanks for the tips folks. I'd not paid close attention to noneGrouping's
documentation. But if (when?) the proposed enhancement there is
implemented, it seems like it would certainly help reduce the number of
parallel processing units.. Until then, localOrShuffleGrouping does seem
like a good optimisation.

Thanks
hemanth

On Thu, Jan 8, 2015 at 9:39 PM, Itai Frenkel <It...@forter.com> wrote:

>  Frankly, I'm not sure either. I was thinking it is more optimized due to
> this (pending) pull request
>
>
> https://github.com/apache/storm/pull/343/files#diff-869e78bed57392ae80320b1962434348R79
> 
>
>
>  ------------------------------
> *From:* Nathan Leung <nc...@gmail.com>
> *Sent:* Thursday, January 8, 2015 5:46 PM
> *To:* user
>
> *Subject:* Re: Storm patterns vis-a-vis external data storage
>
>  I thought the storm documentation indicates that noneGrouping is
> currently equivalent to shuffleGrouping?  Has this changed?  If this is
> still the case, I would recommend using localOrShuffleGrouping which will
> keep the data in process at least, and avoid serialization and network
> transfer.
>
> On Thu, Jan 8, 2015 at 10:34 AM, Itai Frenkel <It...@forter.com> wrote:
>
>>  Use noneGrouping between the two bolts so the only overhead is a thread
>> context switch. Storm+Linux manages these context switches pretty
>> well. Unless you are already in the stage of CPU usage optimizations, I
>> would not sweat about it.
>>  ------------------------------
>> *From:* Hemanth Yamijala <yh...@gmail.com>
>> *Sent:* Thursday, January 8, 2015 8:27 AM
>> *To:* user@storm.apache.org
>> *Subject:* Re: Storm patterns vis-a-vis external data storage
>>
>>   Itai & Jens,
>>
>>  Thank you for sharing your thoughts. My requirement is what Jens has
>> referred to as "export" data from my topology outside.
>>
>>  I can clearly see the benefits of segregating this functionality to
>> another bolt - for e.g. to scale it independently of the processing bolts,
>> or for accommodating changes.
>>
>>  The only negative (if it is that) seems to be the increase in number of
>> runtime bolt instances in the topology. I understand that it can be solved
>> with more hardware resources and the horizontal scalability of Storm. Also,
>> it might be hard to quantify this precisely, given the different scaling
>> requirements for processing and I/O bound bolts. Do you see this as a
>> concern ?
>>
>>  Thanks
>> hemanth
>>
>> On Wed, Jan 7, 2015 at 9:39 PM, Jens-U. Mozdzen <jm...@nde.ag> wrote:
>>
>>> Hi Hemanth,
>>>
>>> Zitat von Hemanth Yamijala <yh...@gmail.com>
>>>
>>>> Hi all,
>>>>
>>>> I guess it is common to build topologies where message processing in
>>>> storm results in data that should be stored in external stores like NoSQL
>>>> DBs or message queues like Kafka.
>>>>
>>>> There are two broad approaches to handle this storage:
>>>>
>>>> 1) Inline the storage functionality with the processing functionality -
>>>> i.e. the bolt generating the info to be stored also takes care of storing
>>>> it.
>>>> 2) Separate out the two and make a downstream bolt responsible for the
>>>> storage.
>>>>
>>>> Just wanted to see if people on the list think if there are advantages
>>>> to favour one approach over the other. Any pitfalls to take care of in one
>>>> case over the other.
>>>>
>>>
>>> I'd say: it depends ;) In case of aggregation bolts that persist their
>>> states, you may want to limit the memory footprint of each bolt instance.
>>> Thus implementing an in-mem cache for persisted data is pretty helpful, but
>>> means to incorporate persistence access per-bolt.
>>>
>>> OTOH, if you plan to "export" data from your topology (which seems to be
>>> the main focus of your question), separating calculation and "export" into
>>> separate bolts seems a natural choice to me - especially when you consider
>>> future changes (i.e. to support a different or possibly *additional* export
>>> paths - you can keep the "tuple interface" as it is and simply connect
>>> different and/or additional export bolts).
>>>
>>> Regards,
>>> Jens
>>>
>>>
>>
>

Re: Storm patterns vis-a-vis external data storage

Posted by Itai Frenkel <It...@forter.com>.

Frankly, I'm not sure either. I was thinking it is more optimized due to this (pending) pull request

https://github.com/apache/storm/pull/343/files#diff-869e78bed57392ae80320b1962434348R79?

________________________________
From: Nathan Leung <nc...@gmail.com>
Sent: Thursday, January 8, 2015 5:46 PM
To: user
Subject: Re: Storm patterns vis-a-vis external data storage

I thought the storm documentation indicates that noneGrouping is currently equivalent to shuffleGrouping?  Has this changed?  If this is still the case, I would recommend using localOrShuffleGrouping which will keep the data in process at least, and avoid serialization and network transfer.

On Thu, Jan 8, 2015 at 10:34 AM, Itai Frenkel <It...@forter.com>> wrote:

Use noneGrouping between the two bolts so the only overhead is a thread context switch. Storm+Linux manages these context switches pretty well. Unless you are already in the stage of CPU usage optimizations, I would not sweat about it.

________________________________
From: Hemanth Yamijala <yh...@gmail.com>>
Sent: Thursday, January 8, 2015 8:27 AM
To: user@storm.apache.org<ma...@storm.apache.org>
Subject: Re: Storm patterns vis-a-vis external data storage

Itai & Jens,

Thank you for sharing your thoughts. My requirement is what Jens has referred to as "export" data from my topology outside.

I can clearly see the benefits of segregating this functionality to another bolt - for e.g. to scale it independently of the processing bolts, or for accommodating changes.

The only negative (if it is that) seems to be the increase in number of runtime bolt instances in the topology. I understand that it can be solved with more hardware resources and the horizontal scalability of Storm. Also, it might be hard to quantify this precisely, given the different scaling requirements for processing and I/O bound bolts. Do you see this as a concern ?

Thanks
hemanth

On Wed, Jan 7, 2015 at 9:39 PM, Jens-U. Mozdzen <jm...@nde.ag>> wrote:
Hi Hemanth,

Zitat von Hemanth Yamijala <yh...@gmail.com>>
Hi all,

I guess it is common to build topologies where message processing in storm results in data that should be stored in external stores like NoSQL DBs or message queues like Kafka.

There are two broad approaches to handle this storage:

1) Inline the storage functionality with the processing functionality - i.e. the bolt generating the info to be stored also takes care of storing it.
2) Separate out the two and make a downstream bolt responsible for the storage.

Just wanted to see if people on the list think if there are advantages to favour one approach over the other. Any pitfalls to take care of in one case over the other.

I'd say: it depends ;) In case of aggregation bolts that persist their states, you may want to limit the memory footprint of each bolt instance. Thus implementing an in-mem cache for persisted data is pretty helpful, but means to incorporate persistence access per-bolt.

OTOH, if you plan to "export" data from your topology (which seems to be the main focus of your question), separating calculation and "export" into separate bolts seems a natural choice to me - especially when you consider future changes (i.e. to support a different or possibly *additional* export paths - you can keep the "tuple interface" as it is and simply connect different and/or additional export bolts).

Regards,
Jens

Re: Storm patterns vis-a-vis external data storage

Posted by Nathan Leung <nc...@gmail.com>.

I thought the storm documentation indicates that noneGrouping is currently
equivalent to shuffleGrouping?  Has this changed?  If this is still the
case, I would recommend using localOrShuffleGrouping which will keep the
data in process at least, and avoid serialization and network transfer.

On Thu, Jan 8, 2015 at 10:34 AM, Itai Frenkel <It...@forter.com> wrote:

>  Use noneGrouping between the two bolts so the only overhead is a thread
> context switch. Storm+Linux manages these context switches pretty
> well. Unless you are already in the stage of CPU usage optimizations, I
> would not sweat about it.
>  ------------------------------
> *From:* Hemanth Yamijala <yh...@gmail.com>
> *Sent:* Thursday, January 8, 2015 8:27 AM
> *To:* user@storm.apache.org
> *Subject:* Re: Storm patterns vis-a-vis external data storage
>
>  Itai & Jens,
>
>  Thank you for sharing your thoughts. My requirement is what Jens has
> referred to as "export" data from my topology outside.
>
>  I can clearly see the benefits of segregating this functionality to
> another bolt - for e.g. to scale it independently of the processing bolts,
> or for accommodating changes.
>
>  The only negative (if it is that) seems to be the increase in number of
> runtime bolt instances in the topology. I understand that it can be solved
> with more hardware resources and the horizontal scalability of Storm. Also,
> it might be hard to quantify this precisely, given the different scaling
> requirements for processing and I/O bound bolts. Do you see this as a
> concern ?
>
>  Thanks
> hemanth
>
> On Wed, Jan 7, 2015 at 9:39 PM, Jens-U. Mozdzen <jm...@nde.ag> wrote:
>
>> Hi Hemanth,
>>
>> Zitat von Hemanth Yamijala <yh...@gmail.com>
>>
>>> Hi all,
>>>
>>> I guess it is common to build topologies where message processing in
>>> storm results in data that should be stored in external stores like NoSQL
>>> DBs or message queues like Kafka.
>>>
>>> There are two broad approaches to handle this storage:
>>>
>>> 1) Inline the storage functionality with the processing functionality -
>>> i.e. the bolt generating the info to be stored also takes care of storing
>>> it.
>>> 2) Separate out the two and make a downstream bolt responsible for the
>>> storage.
>>>
>>> Just wanted to see if people on the list think if there are advantages
>>> to favour one approach over the other. Any pitfalls to take care of in one
>>> case over the other.
>>>
>>
>> I'd say: it depends ;) In case of aggregation bolts that persist their
>> states, you may want to limit the memory footprint of each bolt instance.
>> Thus implementing an in-mem cache for persisted data is pretty helpful, but
>> means to incorporate persistence access per-bolt.
>>
>> OTOH, if you plan to "export" data from your topology (which seems to be
>> the main focus of your question), separating calculation and "export" into
>> separate bolts seems a natural choice to me - especially when you consider
>> future changes (i.e. to support a different or possibly *additional* export
>> paths - you can keep the "tuple interface" as it is and simply connect
>> different and/or additional export bolts).
>>
>> Regards,
>> Jens
>>
>>
>

Re: Storm patterns vis-a-vis external data storage

Posted by Itai Frenkel <It...@forter.com>.

Use noneGrouping between the two bolts so the only overhead is a thread context switch. Storm+Linux manages these context switches pretty well. Unless you are already in the stage of CPU usage optimizations, I would not sweat about it.

________________________________
From: Hemanth Yamijala <yh...@gmail.com>
Sent: Thursday, January 8, 2015 8:27 AM
To: user@storm.apache.org
Subject: Re: Storm patterns vis-a-vis external data storage

Itai & Jens,

Thank you for sharing your thoughts. My requirement is what Jens has referred to as "export" data from my topology outside.

I can clearly see the benefits of segregating this functionality to another bolt - for e.g. to scale it independently of the processing bolts, or for accommodating changes.

The only negative (if it is that) seems to be the increase in number of runtime bolt instances in the topology. I understand that it can be solved with more hardware resources and the horizontal scalability of Storm. Also, it might be hard to quantify this precisely, given the different scaling requirements for processing and I/O bound bolts. Do you see this as a concern ?

Thanks
hemanth

On Wed, Jan 7, 2015 at 9:39 PM, Jens-U. Mozdzen <jm...@nde.ag>> wrote:
Hi Hemanth,

Zitat von Hemanth Yamijala <yh...@gmail.com>>
Hi all,

I guess it is common to build topologies where message processing in storm results in data that should be stored in external stores like NoSQL DBs or message queues like Kafka.

There are two broad approaches to handle this storage:

1) Inline the storage functionality with the processing functionality - i.e. the bolt generating the info to be stored also takes care of storing it.
2) Separate out the two and make a downstream bolt responsible for the storage.

Just wanted to see if people on the list think if there are advantages to favour one approach over the other. Any pitfalls to take care of in one case over the other.

I'd say: it depends ;) In case of aggregation bolts that persist their states, you may want to limit the memory footprint of each bolt instance. Thus implementing an in-mem cache for persisted data is pretty helpful, but means to incorporate persistence access per-bolt.

OTOH, if you plan to "export" data from your topology (which seems to be the main focus of your question), separating calculation and "export" into separate bolts seems a natural choice to me - especially when you consider future changes (i.e. to support a different or possibly *additional* export paths - you can keep the "tuple interface" as it is and simply connect different and/or additional export bolts).

Regards,
Jens

Re: Storm patterns vis-a-vis external data storage

Posted by Hemanth Yamijala <yh...@gmail.com>.

Itai & Jens,

Thank you for sharing your thoughts. My requirement is what Jens has
referred to as "export" data from my topology outside.

I can clearly see the benefits of segregating this functionality to another
bolt - for e.g. to scale it independently of the processing bolts, or for
accommodating changes.

The only negative (if it is that) seems to be the increase in number of
runtime bolt instances in the topology. I understand that it can be solved
with more hardware resources and the horizontal scalability of Storm. Also,
it might be hard to quantify this precisely, given the different scaling
requirements for processing and I/O bound bolts. Do you see this as a
concern ?

Thanks
hemanth

On Wed, Jan 7, 2015 at 9:39 PM, Jens-U. Mozdzen <jm...@nde.ag> wrote:

> Hi Hemanth,
>
> Zitat von Hemanth Yamijala <yh...@gmail.com>
>
>> Hi all,
>>
>> I guess it is common to build topologies where message processing in
>> storm results in data that should be stored in external stores like NoSQL
>> DBs or message queues like Kafka.
>>
>> There are two broad approaches to handle this storage:
>>
>> 1) Inline the storage functionality with the processing functionality -
>> i.e. the bolt generating the info to be stored also takes care of storing
>> it.
>> 2) Separate out the two and make a downstream bolt responsible for the
>> storage.
>>
>> Just wanted to see if people on the list think if there are advantages to
>> favour one approach over the other. Any pitfalls to take care of in one
>> case over the other.
>>
>
> I'd say: it depends ;) In case of aggregation bolts that persist their
> states, you may want to limit the memory footprint of each bolt instance.
> Thus implementing an in-mem cache for persisted data is pretty helpful, but
> means to incorporate persistence access per-bolt.
>
> OTOH, if you plan to "export" data from your topology (which seems to be
> the main focus of your question), separating calculation and "export" into
> separate bolts seems a natural choice to me - especially when you consider
> future changes (i.e. to support a different or possibly *additional* export
> paths - you can keep the "tuple interface" as it is and simply connect
> different and/or additional export bolts).
>
> Regards,
> Jens
>
>

Re: Storm patterns vis-a-vis external data storage

Posted by "Jens-U. Mozdzen" <jm...@nde.ag>.

Hi Hemanth,

Zitat von Hemanth Yamijala <yh...@gmail.com>
> Hi all,
>
> I guess it is common to build topologies where message processing in  
> storm results in data that should be stored in external stores like  
> NoSQL DBs or message queues like Kafka.
>
> There are two broad approaches to handle this storage:
>
> 1) Inline the storage functionality with the processing  
> functionality - i.e. the bolt generating the info to be stored also  
> takes care of storing it.
> 2) Separate out the two and make a downstream bolt responsible for  
> the storage.
>
> Just wanted to see if people on the list think if there are  
> advantages to favour one approach over the other. Any pitfalls to  
> take care of in one case over the other.

I'd say: it depends ;) In case of aggregation bolts that persist their  
states, you may want to limit the memory footprint of each bolt  
instance. Thus implementing an in-mem cache for persisted data is  
pretty helpful, but means to incorporate persistence access per-bolt.

OTOH, if you plan to "export" data from your topology (which seems to  
be the main focus of your question), separating calculation and  
"export" into separate bolts seems a natural choice to me - especially  
when you consider future changes (i.e. to support a different or  
possibly *additional* export paths - you can keep the "tuple  
interface" as it is and simply connect different and/or additional  
export bolts).

Regards,
Jens

Re: Storm patterns vis-a-vis external data storage

Posted by Itai Frenkel <It...@forter.com>.

?I prefer (2) since it makes devops easier. The scaling and monitoring of a CPU intensive bolt (processing)is different from I/O bound bolt(persistence).

________________________________
From: Hemanth Yamijala <yh...@gmail.com>
Sent: Wednesday, January 7, 2015 3:14 PM
To: user@storm.apache.org
Subject: Storm patterns vis-a-vis external data storage

Hi all,

I guess it is common to build topologies where message processing in storm results in data that should be stored in external stores like NoSQL DBs or message queues like Kafka.

There are two broad approaches to handle this storage:

1) Inline the storage functionality with the processing functionality - i.e. the bolt generating the info to be stored also takes care of storing it.
2) Separate out the two and make a downstream bolt responsible for the storage.

Just wanted to see if people on the list think if there are advantages to favour one approach over the other. Any pitfalls to take care of in one case over the other.

Thanks
Hemanth