You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (JIRA)" <ji...@apache.org> on 2015/04/05 12:40:33 UTC

[jira] [Commented] (SPARK-6714) additionally overload KafkaUtils.createDirectStream for using a messageHandler without having to specify the offsets

    [ https://issues.apache.org/jira/browse/SPARK-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14396190#comment-14396190 ] 

Apache Spark commented on SPARK-6714:
-------------------------------------

User 'juanrh' has created a pull request for this issue:
https://github.com/apache/spark/pull/5367

> additionally overload KafkaUtils.createDirectStream for using a messageHandler without having to specify the offsets
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-6714
>                 URL: https://issues.apache.org/jira/browse/SPARK-6714
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 1.3.0
>            Reporter: Juan Rodríguez Hortalá
>            Priority: Trivial
>              Labels: kafka, streaming
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Currently, in the Scala API, KafkaUtils.createDirectStream has two overload methods, one for an "easy mode" where only the Kafka parameters and topics are specified, and other "hard mode" where we can specify the offsets and a messageHandler for manipulating the MessageAndMetadata values obtained from Kafka. I think an intermediate method that automatically handles the offsets, but that allows you to specify the messageHandler would be very useful. For example the triple (topic, partition, offset) uniquely identifies each message, so that could be useful as primary key for idempotent insertions in an external database. Also, in an implementation of a lambda architecture, offsets could be used to trace which part of a kafka topic has been covered by the batch view, and which part by the real-time / live view. For both cases I think that having access to the meta information through a messageHandler, while maintaining managed offsets, would be interesting
> A proposal for the implementation is available at https://github.com/juanrh/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala. A new overload of KafkaUtils.createDirectStream for the Scala API, and another for the Java API, have been added



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org