You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Javier Gonzalez <ja...@gmail.com> on 2015/03/03 18:37:25 UTC

Exactly once transactions and storm

Hi guys,



We're looking at storm to solve a message processing scenario that needs to
be horizontally scalable for high projected volume. The use case goes like
this:



1.- receive messages from external source.

2.- generate a set of messages from this external input, based on rules.

3.- persist these message sets into a DB. There is no update, the use case
is insert only.



Currently we have implemented this as a PoC using a Spout for step 1, a
bolt for step 2 and a bolt for step 3. There is no aggregation or
partitioning between steps, we're just doing shuffling between bolts and
looking to scale by just throwing nodes at it. We need to guarantee exactly
once processing - but bolt #3 is going to a database. How does one
guarantee that the scenario where the DB transaction is successful but for
some reason the spout decides to retry and we get duplicate entries?



We don't see the need for batching, and we don't quite see how Trident
would help us in this case. If you could offer suggestions of alternatives
(or point out what are we missing about Trident), we would be very grateful.



Thanks in advance,

JG

Re: Exactly once transactions and storm

Posted by Javier Gonzalez <ja...@gmail.com>.
Hi,

I've checked with the data team and it is not possible to have the database
provide us with the "error as duplicate check" strategy (there is a link
between the incoming message id and the persisted row, but the actual key
is not unique based on it, so it wouldn't create this clash).

We're looking into using some kind of cache of incoming ids processed to
avoid persisting twice to the database. The coordination of this between
all nodes is where it gets tricky. Perhaps using the zookeeper as the
cache? It covers replication and should be available to the nodes easily.

Thanks,
Javier


On Tue, Mar 3, 2015 at 1:49 PM, Parth Brahmbhatt <
pbrahmbhatt@hortonworks.com> wrote:

>  I am not really familiar with cassandra but I think they do support
> conditional insert/update. Something like *Insert into my_table (col1)
> values (‘val1’) if not exists;. *
>
>  See if it actually does support conditional insert/update and if you can
> use this feature.
>
>  Thanks
> Parth
>
>   From: Javier Gonzalez <ja...@gmail.com>
> Reply-To: "user@storm.apache.org" <us...@storm.apache.org>
> Date: Tuesday, March 3, 2015 at 10:43 AM
> To: "user@storm.apache.org" <us...@storm.apache.org>
> Subject: Re: Exactly once transactions and storm
>
>  Thanks for your reply. This could work, as the problem domain has a
> unique Id in the incoming stream, but I believe the db will be Cassandra,
> which updates instead of throwing errors when inserting a duplicate key. So
> I can't rely on that.
>
>


-- 
Javier González Nicolini

Re: Exactly once transactions and storm

Posted by Parth Brahmbhatt <pb...@hortonworks.com>.
I am not really familiar with cassandra but I think they do support conditional insert/update. Something like Insert into my_table (col1) values ('val1') if not exists;.

See if it actually does support conditional insert/update and if you can use this feature.

Thanks
Parth

From: Javier Gonzalez <ja...@gmail.com>>
Reply-To: "user@storm.apache.org<ma...@storm.apache.org>" <us...@storm.apache.org>>
Date: Tuesday, March 3, 2015 at 10:43 AM
To: "user@storm.apache.org<ma...@storm.apache.org>" <us...@storm.apache.org>>
Subject: Re: Exactly once transactions and storm


Thanks for your reply. This could work, as the problem domain has a unique Id in the incoming stream, but I believe the db will be Cassandra, which updates instead of throwing errors when inserting a duplicate key. So I can't rely on that.


Re: Exactly once transactions and storm

Posted by Javier Gonzalez <ja...@gmail.com>.
Hi Parth,

Thanks for your reply. This could work, as the problem domain has a unique
Id in the incoming stream, but I believe the db will be Cassandra, which
updates instead of throwing errors when inserting a duplicate key. So I
can't rely on that.

Best regards,
JG
On Mar 3, 2015 12:45 PM, "Parth Brahmbhatt" <pb...@hortonworks.com>
wrote:

>  Do you have some uniqueness in messages based on which you can define a
> DB constraint ? If there is one you define a unique constraint in DB, if
> the spout retries the bolt writing to DB will fail with constraint
> violation and the exception should also tell you which constraint was
> violated, you can ignore the unique constraint violations and ack back so
> the spout will stop retrying.
>
>  Its not clean but should work.
>
>  Thanks
> Parth
>
>   From: Javier Gonzalez <ja...@gmail.com>
> Reply-To: "user@storm.apache.org" <us...@storm.apache.org>
> Date: Tuesday, March 3, 2015 at 9:37 AM
> To: "user@storm.apache.org" <us...@storm.apache.org>
> Subject: Exactly once transactions and storm
>
>  We don't see the need for batching, and we don't quite see how Trident
> would help us in this case. If you could offer suggestions of alternatives
> (or point out what are we missing about Trident), we would be very grateful.
>
>

Re: Exactly once transactions and storm

Posted by Parth Brahmbhatt <pb...@hortonworks.com>.
Do you have some uniqueness in messages based on which you can define a DB constraint ? If there is one you define a unique constraint in DB, if the spout retries the bolt writing to DB will fail with constraint violation and the exception should also tell you which constraint was violated, you can ignore the unique constraint violations and ack back so the spout will stop retrying.

Its not clean but should work.

Thanks
Parth

From: Javier Gonzalez <ja...@gmail.com>>
Reply-To: "user@storm.apache.org<ma...@storm.apache.org>" <us...@storm.apache.org>>
Date: Tuesday, March 3, 2015 at 9:37 AM
To: "user@storm.apache.org<ma...@storm.apache.org>" <us...@storm.apache.org>>
Subject: Exactly once transactions and storm


We don't see the need for batching, and we don't quite see how Trident would help us in this case. If you could offer suggestions of alternatives (or point out what are we missing about Trident), we would be very grateful.