You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by David Garcia <da...@spiceworks.com> on 2016/07/26 16:50:28 UTC

Streaming Application Design with Database Lookups

Hello, we are working on designs for several streaming applications and a common consideration is the need for occasional external database updates/lookups. For example…we would be processing a stream of events with some kind of local-id, and we occasionally need to resolve the local-id to a global-id using an external service (e.g. such as a database).

A simple approach is the following:

1.) Quantify the expected throughput of a topic/s

2.) Partition the topic/s so that each task isn’t “overwhelmed”

3.) Combine blocking database-calls and an in-memory cache for external storage lookup/updates

4.) If the system isn’t performing fast enough, simply add more partitions and tasks to the application.

Obviously we are assuming that the external database can handle the rate of transactions.

Another approach is to process the messages asynchronously. That is, database callbacks are attached to something like Futures and the streaming threads aren’t interrupted. Assuming there isn’t shared data between threads, this seems exactly like the first approach. If we have a thread pool with ‘N’ number of threads, our application will never go faster than c/N where ‘c’ is the average latency for a (Database+cache) lookup. This is equivalent to making N partitions and starting N tasks. However, this approach does have the advantage of “decoupling” the threads from the partitions (i.e. nothing precludes us from having more database threads than partitions). The application, however, becomes much more complicated.

Any other approaches?

I would appreciate any design suggestions anyone has. Thx!
-David

Re: Streaming Application Design with Database Lookups

Posted by Guozhang Wang <wa...@gmail.com>.

Hello David,

There is another remote store IO saving idea from a recent paper:

https://scontent.xx.fbcdn.net/t39.2365-6/13331599_975087972607457_1796386216_n/Realtime_Data_Processing_at_Facebook.pdf

"A monoid is an algebraic structure that has an identity element and is
associative. When a monoid processor’s application needs to access state
that is not in memory, mutations are applied to an empty state (the
identity element). Periodically, the existing database state is loaded in
memory, merged with the in-memory partial state, and then written out to
the database asynchronously. This read-merge-write pattern can be done less
often than the read-modify-write. When the remote database supports a
custom merge operator, then the merge operation can happen in the database.
The read-modify-write pattern is optimized to an append-only pattern,
resulting in performance gains."

Basically, the main idea is that if you can delay the writeback of the
remote store or transform it into an append instead of an overwrite (for
merging aggregates, for example) then you can do such writes
asynchronously. This is similar to your second approach but may be a bit
simpler.

Guozhang

On Tue, Jul 26, 2016 at 9:50 AM, David Garcia <da...@spiceworks.com> wrote:

> Hello, we are working on designs for several streaming applications and a
> common consideration is the need for occasional external database
> updates/lookups.  For example…we would be processing a stream of events
> with some kind of local-id, and we occasionally need to resolve the
> local-id to a global-id using an external service (e.g. such as a database).
>
> A simple approach is the following:
>
>
> 1.)     Quantify the expected throughput of a topic/s
>
> 2.)     Partition the topic/s so that each task isn’t “overwhelmed”
>
> 3.)     Combine blocking database-calls and an in-memory cache for
> external storage lookup/updates
>
> 4.)     If the system isn’t performing fast enough, simply add more
> partitions and tasks to the application.
>
> Obviously we are assuming that the external database can handle the rate
> of transactions.
>
> Another approach is to process the messages asynchronously.  That is,
> database callbacks are attached to something like Futures and the streaming
> threads aren’t interrupted.  Assuming there isn’t shared data between
> threads, this seems exactly like the first approach.  If we have a thread
> pool with ‘N’ number of threads, our application will never go faster than
> c/N where ‘c’ is the average latency for a (Database+cache) lookup.  This
> is equivalent to making N partitions and starting N tasks.  However, this
> approach does have the advantage of “decoupling” the threads from the
> partitions (i.e. nothing precludes us from having more database threads
> than partitions).  The application, however, becomes much more complicated.
>
>
> Any other approaches?
>
> I would appreciate any design suggestions anyone has.  Thx!
> -David
>

-- 
-- Guozhang