You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@crunch.apache.org by "Micah Whitacre (JIRA)" <ji...@apache.org> on 2016/04/22 23:33:12 UTC

[jira] [Updated] (CRUNCH-606) Create a KafkaSource

     [ https://issues.apache.org/jira/browse/CRUNCH-606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Micah Whitacre updated CRUNCH-606:
----------------------------------
    Attachment: CRUNCH-606.patch

So the approach I'm going with the source is that callers are going to be responsible for managing start/stop offsets.  They are also responsible for persisting that information somewhere.  I think in a later request I'll work on adding some convenience methods for that.

Posting progress here for comments or feedback.  There is some work to be done, specifically:
* More tests around some of the utilities in KafkaUtils.
* Finish flushing out the KafkaSourceIT (hitting a ClassCastException from String to Text)
* Finish flushing out the KafkaSource.read(...) method because it stops us from being able to call materialize on a read PCollection.
* Reogranize the tests because right now the "long running" (but not really) tests are in surefire vs our convention of putting them in src/it/tests

One specific feedback is if I should keep with the path of making the KafkaSource extend FileInputFormat or if I'm trying to put a square peg in a round hole.  

> Create a KafkaSource
> --------------------
>
>                 Key: CRUNCH-606
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-606
>             Project: Crunch
>          Issue Type: New Feature
>          Components: IO
>            Reporter: Micah Whitacre
>            Assignee: Micah Whitacre
>         Attachments: CRUNCH-606.patch
>
>
> Pulling data out of Kafka is a common use case and some of the ways to do it Kafka Connect, Camus, Gobblin do not integrate nicely with existing processing pipelines like Crunch.  With Kafka 0.9, the consuming API is a lot easier so we should build a Source implementation that can read from Kafka.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)