You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Grant Henke <gh...@cloudera.com> on 2015/11/02 17:44:03 UTC

Re: Topic per entity

Hi Alex & Andrew,

There was a discussion with some pointers on this mailing list a bit ago
titled "mapping events to topics". I suggest taking a look at that thread:
http://search-hadoop.com/m/uyzND1vJsUuYtGD91/mapping+events+to+topics&subj=mapping+events+to+topics

If you still have questions, don't hesitate to ask.

Thanks,
Grant



On Sat, Oct 31, 2015 at 3:19 AM, Andrew Stevenson <as...@outlook.com>
wrote:

> I too would be interested in any responses to this question.
>
> I'm using kafka for event notification and once secure will put real
> payload in it and take advantage of the durable commit log. I want to let
> users describe a DAG in orientdb and have the Kafka Client processor load
> and execute it. Each processor would then attach it's lineage and
> provenance back to the orientdbs graph store.
>
> This way I can let users replay stress scenarios, calculate VaR etc with
> one source of replayable truth. Compliance and regulatory authorities like
> this.
>
> Regards
>
> Andrew
> ________________________________
> From: Alex Buchanan<ma...@gmail.com>
> Sent: ‎31/‎10/‎2015 05:30
> To: users@kafka.apache.org<ma...@kafka.apache.org>
> Subject: Topic per entity
>
> Hey Kafka community.
>
> I'm researching possible architecture for a distributed data processing
> system. In this system, there's a close relationship between a specific
> dataset and the processing code. The user might upload a few datasets and
> write code to run analysis on that data. In other words, frequently the
> analysis code pulls data from a specific entity.
>
> Kafka is attractive for lots of reasons:
> - I'll need messaging anyway
> - I want a model for immutability of data (mutable state and potential job
> failure don't mix)
> - cross-language clients
> - the change stream concept could have some nice uses (such as updating
> visualizations without rebuilding)
> - Samza's model of state management is a simple way to think of external
> data without worrying too-much about network-based RPC
> - as a source of truth data store, it's really simple. No mutability,
> complex queries, etc. Just a log. To me, that helps prevent abuse and
> mistakes.
> - it fits well with the concept of pipes, frequently found in data analysis
>
> But most of the Kafka examples are about processing a large stream of a
> specific _type_, not so much about processing specific entities. And I
> understand there are limits to topics (file/node limits on the filesystem
> and in zookeeper) and it's discouraged to model topics based on
> characteristics of data. In this system, it feels more natural to have a
> topic per entity so the processing code can connect directly to the data it
> wants.
>
> So I need a little guidance from smart people. Am I lost in the rabbit
> hole? Maybe I'm trying to force Kafka into this territory it's not suited
> for. Have I been reading too many (awesome) articles about the role of the
> log and streaming in distributed computing? Or am I on the right track and
> I just need to put in some work to jump the hurdles (such as topic storage
> and coordination)?
>
> It sounds like Cassandra might be another good option, but I don't know
> much about it yet.
>
> Thanks guys!
>



-- 
Grant Henke
Software Engineer | Cloudera
grant@cloudera.com | twitter.com/gchenke | linkedin.com/in/granthenke