You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Casey Green <cg...@conductor.com> on 2014/10/30 15:32:25 UTC

A more scalable Kafka to Hadoop InputFormat

Hi Folks,

I'm open sourcing a scalable Kafka InputFormat.  As far as I know or am aware of, my version is unique compared to other Kafka InputFormats out there, in that input splits are mapped to Kafka log files, rather than entire Kafka partitions.  Mapping Kafka log files to input splits scales your Map/Reduce job by the amount of data left to consume in a queue, whereas mapping input splits to entire partitions always gives you a constant number of input splits.

I wrote up a blog post about it here<http://www.conductor.com/nightlight/data-stream-processing-bulk-kafka-hadoop/>, and the source code for my KafkaInputFormat is on github<https://github.com/Conductor/kangaroo>.  Your questions, comments and feedback are welcomed and much appreciated!

Thanks,
Casey Green