You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Mario Briggs (JIRA)" <ji...@apache.org> on 2016/01/20 19:16:39 UTC

[jira] [Comment Edited] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

    [ https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15109056#comment-15109056 ] 

Mario Briggs edited comment on SPARK-12177 at 1/20/16 6:16 PM:
---------------------------------------------------------------

bq. I totally understand what you mean. However, kafka has its own assembly in Spark and the way the code is structured right now, both the new API and old API would go in the same assembly so it's important to have a different package name. Also, I think for our end users transitioning from old to new API, I foresee them having 2 versions of their spark-kafka app. One that works with the old API and one with the new API. And, I think it would be an easier transition if they could include both the kafka API versions in the spark classpath and pick and choose which app to run without mucking with maven dependencies and re-compiling when they want to switch. Let me know if you disagree.

Hey Mark,
Since this is WIP, i myself have atleast 1 more different option to what i suggested above... put just one to get the conversation rolling, so thanks for chipping in. Thanks for bringing out the kafka-assembly part… the assembly jar does include all dependencies too, so we would include kafka's v0.9’s jars i guess? I remember Cody mentioning that a v0.8 to v0.9 upgrade on kafka side involves upgrading the brokers… i think that is not required when client uses a v0.9 jar though consuming only the older high level/low level API and talking to a v0.8 kafka cluster. So we can go with 1 kafka-assembly and then my suggestion above itself is broken since we would have issues of same package & class names.

1 thought around not introducing the version in the package name or class name (I see that Flink does it in the class name) was to avoid forcing us to create v0.10/v0.11 packages (and customers to change code and recompile), even if those releases of kafka don’t have client-api’ or otherwise such changes that warrant us to make a new version (Also spark got away without putting a version# till now, which means less work in Spark, so not sure we want to start forcing this work going forward). Once we introduce the version #, we need to ensure it is in sync with kafka.

That’s why 1 earlier idea i mentioned in this JIRA was 'The public API signatures (of KafkaUtils in v0.9 subproject) are different and do not clash (with KafkaUtils in original kafka subproject) and hence can be added to the existing (original kafka subproject) KafkaUtils class.’  This also addresses the issues u mention above. Cody mentioned that we need to get others on the same page for this idea, so i guess we really need the committers to chime in here. Of course i forgot to answer’s Nikita’s followup question - 'do you mean that we would change the original KafkaUtils by adding new functions for new DirectIS/KafkaRDD but using them from separate module with kafka09 classes’ ?  To be clear, these new public methods added to  original kafka subproject’s ‘KafkaUtils' ,will  make use of DirectKafkaInputDStream,KafkaRDD,KafkaRDDPartition,OffsetRange classes that are in a new v09 package (internal of course).  In short we don’t have a new subproject. (I skipped class KafkaCluster class from the list, becuase i am thinking it makes more sense to call this class something like 'KafkaClient' instead going forward)


was (Author: mariobriggs):
bq. I totally understand what you mean. However, kafka has its own assembly in Spark and the way the code is structured right now, both the new API and old API would go in the same assembly so it's important to have a different package name. Also, I think for our end users transitioning from old to new API, I foresee them having 2 versions of their spark-kafka app. One that works with the old API and one with the new API. And, I think it would be an easier transition if they could include both the kafka API versions in the spark classpath and pick and choose which app to run without mucking with maven dependencies and re-compiling when they want to switch. Let me know if you disagree.

Since this is WIP, i myself have atleast 1 more different option to what i suggested above... put just one to get the conversation rolling, so thanks for chipping in. Thanks for bringing out the kafka-assembly part… the assembly jar does include all dependencies too, so we would include kafka's v0.9’s jars i guess? I remember Cody mentioning that a v0.8 to v0.9 upgrade on kafka side involves upgrading the brokers… i think that is not required when client uses a v0.9 jar though consuming only the older high level/low level API and talking to a v0.8 kafka cluster. So we can go with 1 kafka-assembly and then my suggestion above itself is broken since we would have issues of same package & class names.

1 thought around not introducing the version in the package name or class name (I see that Flink does it in the class name) was to avoid forcing us to create v0.10/v0.11 packages (and customers to change code and recompile), even if those releases of kafka don’t have client-api’ or otherwise such changes that warrant us to make a new version (Also spark got away without putting a version# till now, which means less work in Spark, so not sure we want to start forcing this work going forward). Once we introduce the version #, we need to ensure it is in sync with kafka.

That’s why 1 earlier idea i mentioned in this JIRA was 'The public API signatures (of KafkaUtils in v0.9 subproject) are different and do not clash (with KafkaUtils in original kafka subproject) and hence can be added to the existing (original kafka subproject) KafkaUtils class.’  Cody mentioned that we need to get others on the same page for this idea, so i guess we really need the committers to chime in here. Of course i forgot to answer’s Nikita’s followup question - 'do you mean that we would change the original KafkaUtils by adding new functions for new DirectIS/KafkaRDD but using them from separate module with kafka09 classes’ ?  To be clear, these new public methods added to  original kafka subproject’s ‘KafkaUtils' ,will  make use of DirectKafkaInputDStream,KafkaRDD,KafkaRDDPartition,OffsetRange classes that are in a new v09 package (internal of course).  In short we don’t have a new subproject. (I skipped class KafkaCluster class from the list, becuase i am thinking it makes more sense to call this class something like 'KafkaClient' instead going forward)

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --------------------------------------------------
>
>                 Key: SPARK-12177
>                 URL: https://issues.apache.org/jira/browse/SPARK-12177
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 1.6.0
>            Reporter: Nikita Tarasenko
>              Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not compatible with old one. So, I added new consumer api. I made separate classes in package org.apache.spark.streaming.kafka.v09 with changed API. I didn't remove old classes for more backward compatibility. User will not need to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org