You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Alex Buchanan <bu...@gmail.com> on 2015/10/31 05:24:30 UTC

Topic per entity

Hey Kafka community.

I'm researching possible architecture for a distributed data processing
system. In this system, there's a close relationship between a specific
dataset and the processing code. The user might upload a few datasets and
write code to run analysis on that data. In other words, frequently the
analysis code pulls data from a specific entity.

Kafka is attractive for lots of reasons:
- I'll need messaging anyway
- I want a model for immutability of data (mutable state and potential job
failure don't mix)
- cross-language clients
- the change stream concept could have some nice uses (such as updating
visualizations without rebuilding)
- Samza's model of state management is a simple way to think of external
data without worrying too-much about network-based RPC
- as a source of truth data store, it's really simple. No mutability,
complex queries, etc. Just a log. To me, that helps prevent abuse and
mistakes.
- it fits well with the concept of pipes, frequently found in data analysis

But most of the Kafka examples are about processing a large stream of a
specific _type_, not so much about processing specific entities. And I
understand there are limits to topics (file/node limits on the filesystem
and in zookeeper) and it's discouraged to model topics based on
characteristics of data. In this system, it feels more natural to have a
topic per entity so the processing code can connect directly to the data it
wants.

So I need a little guidance from smart people. Am I lost in the rabbit
hole? Maybe I'm trying to force Kafka into this territory it's not suited
for. Have I been reading too many (awesome) articles about the role of the
log and streaming in distributed computing? Or am I on the right track and
I just need to put in some work to jump the hurdles (such as topic storage
and coordination)?

It sounds like Cassandra might be another good option, but I don't know
much about it yet.

Thanks guys!