You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Jay Kreps <ja...@gmail.com> on 2013/01/08 00:01:33 UTC

LinkedIn's Kafka->Hadoop ETL pipeline is open source

Hey All,

There has been interesting in getting something a little more sophisticated
then the Input- and OutputFormat we include in contrib for reading Kafka
data into HDFS.

Internally at LinkedIn we have had a pretty sophisticated system that we
use for Kafka ETL. It automatically discovers topics, does date
partitioning, balances load for many topics, etc. We have wanted to open
source this for a while but haven't really had time to spend on it. This
code is now open source:
  https://github.com/linkedin/camus

Ken Goodhope is the lead for this system. If you have any questions there
is a mailing list here:
  camus_etl@googlegroups.com

We haven't done a ton of packaging work on this yet so there isn't a ton of
documentation and it is a bit of work to get set up. So it is probably most
appropriate for people who would be taking a "white box" approach to the
code. We have had interest from a few groups in contributing and we are
definitely interested in recruiting this kind of help. All our own
development going forward will be done off the public github repo, as usual
with LinkedIn open source projects.

Until we get better docs up, you can get a pretty good high-level overview
of our setup from this paper:
  http://sites.computer.org/debull/A12june/pipeline.pdf

-Jay