You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@storm.apache.org by Laurent Thoulon <la...@ldmobile.net> on 2014/05/02 10:12:56 UTC

Best practice to persist data in multiple TridentState

Hi,

What would you say is the best way to persist data to multiple states ?
Currently i have 3 options in mind:

1- Process data and use the stream to send data to both state
Stream stream = ...each...filter...bla....
stream.partitionPersist(state1, ...)
stream.partitionPersist(state2, ...)

2- Process data and chain the persists
Stream stream = ...each...filter...bla....
stream.partitionPersist(state1, ...) .newValuesStream() .partitionPersist(state2, ...)

3- Do a topology for each state which would all mostly does the same thing but for the persist part.

My main concerns here is handling failures and efficiency.

In my usecase i actually have 3 states. 2 of them can store in a non transactionnal way and the other should be opaque transactionnal but actually can't as it's just an api call that doesn't recognize duplicates.
That's no big deal if we could just make sure it's not bound to the failures of the other states (meaning that if an other state fails we're sure this one hasn't yet processed data).

This makes option n°1 a bit tricky as i'm never sure of the order in which the state will be processed. Or is there a way to be sure ?
Option 2 would do i guess but i have to pass allong in the first state all the data needed for the second. Potentially i would like to filter the tuples that goes to state 1 or state 2. I would then have to make my own updater that uses a filter for the first persists so that it doesn't send everything to the state but still emits everything in the end.
Options 3 would also do but there i wouldn't be that efficient: reading my spout two times, processing data the same way in both topology up until the persist part.

Any ideas on the best way to handle this ?
Thanks

Regards
Laurent

Re: Best practice to persist data in multiple TridentState

Posted by "Cody A. Ray" <co...@gmail.com>.

I don't know what the "best practice" is... but I actually like a 4th
option: creating a composite state.

Instead of sending all data to every state, I needed to randomly shard data
between an arbitrary number of states. I've thrown this on a gist here:
https://gist.github.com/codyaray/d58c1aaf688f27b72fdd

You could probably take a similar approach with a CompositeState that would
send the data to all TridentStates instead of randomly choosing a state.

Good luck!

-Cody


On Fri, May 2, 2014 at 3:12 AM, Laurent Thoulon <
laurent.thoulon@ldmobile.net> wrote:

> Hi,
>
> What would you say is the best way to persist data to multiple states ?
> Currently i have 3 options in mind:
>
> 1- Process data and use the stream to send data to both state
> Stream stream = ...each...filter...bla....
> stream.partitionPersist(state1, ...)
> stream.partitionPersist(state2, ...)
>
> 2- Process data and chain the persists
> Stream stream = ...each...filter...bla....
> stream.partitionPersist(state1,
> ...).newValuesStream().partitionPersist(state2, ...)
>
> 3- Do a topology for each state which would all mostly does the same thing
> but for the persist part.
>
> My main concerns here is handling failures and efficiency.
>
> In my usecase i actually have 3 states. 2 of them can store in a non
> transactionnal way and the other should be opaque transactionnal but
> actually can't as it's just an api call that doesn't recognize duplicates.
> That's no big deal if we could just make sure it's not bound to the
> failures of the other states (meaning that if an other state fails we're
> sure this one hasn't yet processed data).
>
> This makes option n°1 a bit tricky as i'm never sure of the order in which
> the state will be processed. Or is there a way to be sure ?
> Option 2 would do i guess but i have to pass allong in the first state all
> the data needed for the second. Potentially i would like to filter the
> tuples that goes to state 1 or state 2. I would then have to make my own
> updater that uses a filter for the first persists so that it doesn't send
> everything to the state but still emits everything in the end.
> Options 3 would also do but there i wouldn't be that efficient: reading my
> spout two times, processing data the same way in both topology up until the
> persist part.
>
> Any ideas on the best way to handle this ?
> Thanks
>
> Regards
> Laurent
>



-- 
Cody A. Ray, LEED AP
cody.a.ray@gmail.com
215.501.7891