You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Mariano Semelman <ma...@despegar.com> on 2016/09/06 20:51:13 UTC

Q: Multiple spark streaming app, one kafka topic, same consumer group

Hello everybody,

I am trying to understand how Kafka Direct Stream works. I'm interested in
having a production ready Spark Streaming application that consumes a Kafka
topic. But I need to guarantee there's (almost) no downtime, specially
during deploys (and submit) of new versions. What it seems to be the best
solution is to deploy and submit the new version without shutting down the
previous one, wait for the new application to start consuming events and
then shutdown the previous one.

What I would expect is that the events get distributed among the two
applications in a balanced fashion using the consumer group id
​ splitted by the partition key that I've previously set on my Kafka
Producer.​ However I don't see that Kafka Direct stream support this
functionality.

I've achieved this with the Receiver-based approach (btw i've used "kafka"
for the "offsets.storage" kafka property[2]). However this approach come
with technical difficulties named in the Documentation[1] (ie: exactly-once
semantics).

​Anyway, not even this approach seems very failsafe, Does anyone know a way
to safely deploy new versions of a streaming application of this kind
without downtime?

​Thanks in advance

Mariano​
​


[1] http://spark.apache.org/docs/latest/streaming-kafka-integration.html
[2] http://kafka.apache.org/documentation.html#oldconsumerconfigs

Re: Q: Multiple spark streaming app, one kafka topic, same consumer group

Posted by Cody Koeninger <co...@koeninger.org>.
In general, see the material linked from
https://github.com/koeninger/kafka-exactly-once  if you want a better
understanding of the direct stream.

For spark-streaming-kafka-0-8, the direct stream doesn't really care
about consumer group, since it uses the simple consumer.  For the 0.10
version, it uses the new kafka consumer, so consumer group does
matter.  In either case, splitting events across old and new versions
of the job is not what I would want.

I'd suggest making sure that your outputs are idempotent or
transactional, and that the new app has a different consumer group
(for versions for which it matters). Start up the new app, make sure
it is running (even if it errors due to transactional safeguards),
then shut down the old app.


On Tue, Sep 6, 2016 at 3:51 PM, Mariano Semelman
<ma...@despegar.com> wrote:
> Hello everybody,
>
> I am trying to understand how Kafka Direct Stream works. I'm interested in
> having a production ready Spark Streaming application that consumes a Kafka
> topic. But I need to guarantee there's (almost) no downtime, specially
> during deploys (and submit) of new versions. What it seems to be the best
> solution is to deploy and submit the new version without shutting down the
> previous one, wait for the new application to start consuming events and
> then shutdown the previous one.
>
> What I would expect is that the events get distributed among the two
> applications in a balanced fashion using the consumer group id
> splitted by the partition key that I've previously set on my Kafka Producer.
> However I don't see that Kafka Direct stream support this functionality.
>
> I've achieved this with the Receiver-based approach (btw i've used "kafka"
> for the "offsets.storage" kafka property[2]). However this approach come
> with technical difficulties named in the Documentation[1] (ie: exactly-once
> semantics).
>
> Anyway, not even this approach seems very failsafe, Does anyone know a way
> to safely deploy new versions of a streaming application of this kind
> without downtime?
>
> Thanks in advance
>
> Mariano
>
>
>
> [1] http://spark.apache.org/docs/latest/streaming-kafka-integration.html
> [2] http://kafka.apache.org/documentation.html#oldconsumerconfigs
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org