You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Paul Whalen <pg...@gmail.com> on 2019/03/24 17:31:24 UTC

MirrorMaker 2.0 and Streams interplay (topic naming control)

Hi all,

With MirrorMaker 2.0 (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0)
accepted and coming along very nicely in development, it has got me
wondering if a certain use case is supported, and if not, can changes be
made to Streams or MM2 to support it.  I'll explain the use case, but the
TL;DR here is "do we need more control over topic naming in MM2 or Streams?"

My team foresees using MM2 as a way to mirror data from our prod
environment to a pre-prod environment.  The data is supplied by external
vendors, introduced into our system through a Kafka Streams ETL pipeline,
and consumed by our end-applications.  Generally we would only like to run
the ETL pipeline in prod since there is an operational cost to running it
in both prod and pre-prod (the data sometimes needs manual attention).
This seems to fit MM2 well: pre-prod end-applications consume from the
pre-prod Kafka cluster, which is entirely "remote" topics being mirrored
from the prod cluster.  We only have to keep one instance of the ETL
pipeline running, but end-applications can be separate, connecting to their
respective prod and pre-prod Kafka clusters.

However, when we want to test changes to the ETL pipeline itself, we would
like to turn off the mirroring from prod to pre-prod, and run the ETL
pipeline also in pre-prod, picking up the most recent state of the prod
pipeline from when mirroring was turned off (FWIW, downtime is not an issue
for our use case).

My question/concern is basically, can Streams apps work when they're
running against topics prepended with a cluster alias, like
"pre-prod.App-statestore-changelog" as is the plan with MM2. From what I
can tell the answer is no, and my proposal would be to give the Streams
user more specific control over how Streams names its internal topics
(repartition and changelogs) by defining an "InternalTopicNamingStrategy"
or similar.  Perhaps there is a solution on the MM2 side as well, but it
seems much less desirable to budge on that convention.

I phrased the question in terms of my team's problem, but it's worth noting
that this use case is passably similar to a potential DR use case, where
there is a DR cluster that is normally just being mirrored to by MM2, but
in a DR scenario would become the active cluster that Streams applications
are connected to.

Thanks for considering this issue, and great job to those working on MM2 so
far!

Paul

Re: MirrorMaker 2.0 and Streams interplay (topic naming control)

Posted by John Roesler <jo...@confluent.io>.
Hi Paul,

No problem! And please let us know how it goes.

Thanks,
-John

On Wed, Mar 27, 2019 at 9:13 PM Paul Whalen <pg...@gmail.com> wrote:

> John,
>
> You make a good case for it already being a public API, so my nerves are
> definitely eased on that front. I do think we have a path to move forward
> with the user space solution, and if I get a chance, I'm going to try
> proving it out with a trivial demo using an early MM2 build - but it's nice
> to hear your support of the use case regardless.  The ACL concern makes a
> lot of sense, and while I don't think it would be a deal breaker because of
> what you say about advanced control naturally requiring extra care, I'm
> generally against the added complexity of custom topic naming unless we
> really need it.  It looks like MM2 will also support optional ACL
> mirroring, so that should only make things easier.
>
> Regarding the management burden of doing these switchovers: fortunately our
> case is something like running in pre-prod maybe 3 consecutive days out of
> the month, and just prod for the rest of the month.  So if it wasn't the
> most effortless or fast process we could tolerate it.  Though if it was
> easy I wouldn't be surprised if others wanted a similar workflow with much
> faster iteration - spinning up a new environment with the same data as prod
> is always a boon.
>
> Thanks again!
> Paul
>
> On Wed, Mar 27, 2019 at 2:17 PM John Roesler <jo...@confluent.io> wrote:
>
> > Hi Paul,
> >
> > Sorry for overlooking the "offset translation" MM2 feature. I'm glad
> > Ryanne was able to confirm this would work.
> >
> > I'm just one voice, but FWIW, I think that the internal topic naming
> > scheme is a public API. We document the structure of the naming
> > scheme in several places. We also recommend making use of the fact
> > that the applicationId is a prefix of the topic name in conjunction with
> > Kafka Broker ACLs to grant access to the internal topics to the
> > applications that own them.
> >
> > Actually, for this latter reason, I'm concerned that giving more control
> > over the names of internal topics might make topic security and
> > access control more difficult. Or maybe this concern is off-base, and
> > folks who take advanced control over the topic name would also take
> > on the responsibility to make sure their naming scheme works in
> > conjunction with their broker configs.
> >
> > For whatever reason, I hadn't considered prefixing the application's
> > id with "pre-prod.". Offhand, I think this would achieve the desired
> > outcome. There may be some devil in the details, of course.
> >
> >
> > Glad to hear, by the way, that you've already considered the problem
> > of concurrent modifications to the changelogs (etc.). It sounds like
> > your plan should work, although it might become a management burden
> > if you start wanting to run a lot of these stream-app tests. In that
> case,
> > you could consider mirroring the relevant topics *again* into a
> > test-specific
> > prefix (like "pre-prod.test-1.", up to some point. Then, you could stop
> > the mirror, run the test, verify the results, and then just delete the
> > whole test dataset.
> >
> >
> > Does it seem like you have a good path forward? From what I'm
> > hearing, the "user-space" approach is at least worth exploring before
> > considering a new API. Of course, if it doesn't pan out for whatever
> > reason,
> > I'd (personally) support adding whatever features are necessary to
> support
> > your use case.
> >
> > Thanks,
> > -John
> >
> >
> >
> > On Mon, Mar 25, 2019 at 9:40 PM Paul Whalen <pg...@gmail.com> wrote:
> >
> > > John and Ryanne,
> > >
> > > Thanks for the responses! I think Ryanne's way of describing the
> question
> > > is actually a much better summary than my long winded description: "a
> > > Streams app can switch between topics with and without a cluster alias
> > > prefix when you migrate between prod and pre-prod, while preserving
> > state."
> > >
> > > To address a few of John's points...
> > >
> > > But, the prod app will still be running, and its changelog will still
> be
> > > > mirrored into pre-prod when you start the pre-prod app.
> > > >
> > > The idea is actually to turn off the mirroring from prod to pre-prod
> > during
> > > this period, so the environments can operate completely independently
> and
> > > their state can comfortably diverge during the testing period.  After
> the
> > > testing period we'd be happy to throw away everything in pre-prod and
> > start
> > > mirroring again from prod with a blank slate.
> > >
> > > Also, the pre-prod app won't be in the same consumer group as the prod
> > app,
> > > > so it won't know from what offset to start processing input.
> > > >
> > > This is where I'm hoping the magic of MM2 will come in - at the time we
> > > shut off mirroring from prod to pre-prod in order to spin of the
> pre-prod
> > > environment, we will do an "offset translation" with RemoteClusterUtils
> > > like Ryanne mentioned, so new Streams apps in pre-prod will see
> consumer
> > > offsets that make sense for reading from pre-prod topics.
> > >
> > > I like both of your ideas around the "user space" solution: subscribing
> > to
> > > multiple topics, or choosing a topic based on config.  However, in
> order
> > to
> > > populate their internal state properly, when the pre-prod apps come up
> > they
> > > will need to look for repartition and changelog topics with the right
> > > prefix.  This seems problematic to me since the user doesn't have
> direct
> > > control over those topic names, though it did just occur to me now that
> > the
> > > user *sort of* does.  Since the naming scheme is currently just
> > > applicationId + "-" + storeName + "-changelog", we could translate the
> > > consumer group offsets to a consumer group with a new name that has the
> > > same prefix as the mirrored topics do.  That seems a bit clumsly/lucky
> to
> > > me (is the internal topic naming convention really a "public API"?),
> but
> > I
> > > think it would work.
> > >
> > > I'd be curious to hear if folks think that solution would work and be
> an
> > > acceptable pattern, since my original proposal of more user control of
> > > internal topic naming did seem a bit heavy handed.
> > >
> > > Thanks very much for your help!
> > > Paul
> > >
> > > On Mon, Mar 25, 2019 at 3:14 PM Ryanne Dolan <ry...@gmail.com>
> > > wrote:
> > >
> > > > Hey Paul, thanks for the kind words re MM2.
> > > >
> > > > I'm not a Streams expert first off, but I think I understand your
> > > question:
> > > > if a Streams app can switch between topics with and without a cluster
> > > alias
> > > > prefix when you migrate between prod and pre-prod, while preserving
> > > state.
> > > > Streams supports regexes and lists of topics as input, so you can use
> > > e.g.
> > > > builder.stream(List.of("topic1", "prod.topic1")), which is a good
> place
> > > to
> > > > start. In this case, the combined subscription is still a single
> > stream,
> > > > conceptually, but comprises partitions from both topics, i.e.
> > partitions
> > > > from topic1 plus partitions from prod.topic1. At a high level, this
> is
> > no
> > > > different than adding more partitions to a single topic. I think any
> > > > intermediate or downstream topics/tables would remain unchanged,
> since
> > > they
> > > > are still the result of this single stream.
> > > >
> > > > The trick is to correctly translate offsets for the input topics when
> > > > migrating the app between prod and pre-prod, which RemoteClusterUtils
> > can
> > > > help with. You could do this with external tooling, e.g. a script
> > > > leveraging RemoteClusterUtils and
> kafka-streams-application-reset.sh. I
> > > > haven't tried this with a Streams app myself, but I suspect it would
> > > work.
> > > >
> > > > Ryanne
> > > >
> > > >
> > > > On Sun, Mar 24, 2019 at 12:31 PM Paul Whalen <pg...@gmail.com>
> > wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > With MirrorMaker 2.0 (
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0
> > > > > )
> > > > > accepted and coming along very nicely in development, it has got me
> > > > > wondering if a certain use case is supported, and if not, can
> changes
> > > be
> > > > > made to Streams or MM2 to support it.  I'll explain the use case,
> but
> > > the
> > > > > TL;DR here is "do we need more control over topic naming in MM2 or
> > > > > Streams?"
> > > > >
> > > > > My team foresees using MM2 as a way to mirror data from our prod
> > > > > environment to a pre-prod environment.  The data is supplied by
> > > external
> > > > > vendors, introduced into our system through a Kafka Streams ETL
> > > pipeline,
> > > > > and consumed by our end-applications.  Generally we would only like
> > to
> > > > run
> > > > > the ETL pipeline in prod since there is an operational cost to
> > running
> > > it
> > > > > in both prod and pre-prod (the data sometimes needs manual
> > attention).
> > > > > This seems to fit MM2 well: pre-prod end-applications consume from
> > the
> > > > > pre-prod Kafka cluster, which is entirely "remote" topics being
> > > mirrored
> > > > > from the prod cluster.  We only have to keep one instance of the
> ETL
> > > > > pipeline running, but end-applications can be separate, connecting
> to
> > > > their
> > > > > respective prod and pre-prod Kafka clusters.
> > > > >
> > > > > However, when we want to test changes to the ETL pipeline itself,
> we
> > > > would
> > > > > like to turn off the mirroring from prod to pre-prod, and run the
> ETL
> > > > > pipeline also in pre-prod, picking up the most recent state of the
> > prod
> > > > > pipeline from when mirroring was turned off (FWIW, downtime is not
> an
> > > > issue
> > > > > for our use case).
> > > > >
> > > > > My question/concern is basically, can Streams apps work when
> they're
> > > > > running against topics prepended with a cluster alias, like
> > > > > "pre-prod.App-statestore-changelog" as is the plan with MM2. From
> > what
> > > I
> > > > > can tell the answer is no, and my proposal would be to give the
> > Streams
> > > > > user more specific control over how Streams names its internal
> topics
> > > > > (repartition and changelogs) by defining an
> > > "InternalTopicNamingStrategy"
> > > > > or similar.  Perhaps there is a solution on the MM2 side as well,
> but
> > > it
> > > > > seems much less desirable to budge on that convention.
> > > > >
> > > > > I phrased the question in terms of my team's problem, but it's
> worth
> > > > noting
> > > > > that this use case is passably similar to a potential DR use case,
> > > where
> > > > > there is a DR cluster that is normally just being mirrored to by
> MM2,
> > > but
> > > > > in a DR scenario would become the active cluster that Streams
> > > > applications
> > > > > are connected to.
> > > > >
> > > > > Thanks for considering this issue, and great job to those working
> on
> > > MM2
> > > > so
> > > > > far!
> > > > >
> > > > > Paul
> > > > >
> > > >
> > >
> >
>

Re: MirrorMaker 2.0 and Streams interplay (topic naming control)

Posted by Paul Whalen <pg...@gmail.com>.
John,

You make a good case for it already being a public API, so my nerves are
definitely eased on that front. I do think we have a path to move forward
with the user space solution, and if I get a chance, I'm going to try
proving it out with a trivial demo using an early MM2 build - but it's nice
to hear your support of the use case regardless.  The ACL concern makes a
lot of sense, and while I don't think it would be a deal breaker because of
what you say about advanced control naturally requiring extra care, I'm
generally against the added complexity of custom topic naming unless we
really need it.  It looks like MM2 will also support optional ACL
mirroring, so that should only make things easier.

Regarding the management burden of doing these switchovers: fortunately our
case is something like running in pre-prod maybe 3 consecutive days out of
the month, and just prod for the rest of the month.  So if it wasn't the
most effortless or fast process we could tolerate it.  Though if it was
easy I wouldn't be surprised if others wanted a similar workflow with much
faster iteration - spinning up a new environment with the same data as prod
is always a boon.

Thanks again!
Paul

On Wed, Mar 27, 2019 at 2:17 PM John Roesler <jo...@confluent.io> wrote:

> Hi Paul,
>
> Sorry for overlooking the "offset translation" MM2 feature. I'm glad
> Ryanne was able to confirm this would work.
>
> I'm just one voice, but FWIW, I think that the internal topic naming
> scheme is a public API. We document the structure of the naming
> scheme in several places. We also recommend making use of the fact
> that the applicationId is a prefix of the topic name in conjunction with
> Kafka Broker ACLs to grant access to the internal topics to the
> applications that own them.
>
> Actually, for this latter reason, I'm concerned that giving more control
> over the names of internal topics might make topic security and
> access control more difficult. Or maybe this concern is off-base, and
> folks who take advanced control over the topic name would also take
> on the responsibility to make sure their naming scheme works in
> conjunction with their broker configs.
>
> For whatever reason, I hadn't considered prefixing the application's
> id with "pre-prod.". Offhand, I think this would achieve the desired
> outcome. There may be some devil in the details, of course.
>
>
> Glad to hear, by the way, that you've already considered the problem
> of concurrent modifications to the changelogs (etc.). It sounds like
> your plan should work, although it might become a management burden
> if you start wanting to run a lot of these stream-app tests. In that case,
> you could consider mirroring the relevant topics *again* into a
> test-specific
> prefix (like "pre-prod.test-1.", up to some point. Then, you could stop
> the mirror, run the test, verify the results, and then just delete the
> whole test dataset.
>
>
> Does it seem like you have a good path forward? From what I'm
> hearing, the "user-space" approach is at least worth exploring before
> considering a new API. Of course, if it doesn't pan out for whatever
> reason,
> I'd (personally) support adding whatever features are necessary to support
> your use case.
>
> Thanks,
> -John
>
>
>
> On Mon, Mar 25, 2019 at 9:40 PM Paul Whalen <pg...@gmail.com> wrote:
>
> > John and Ryanne,
> >
> > Thanks for the responses! I think Ryanne's way of describing the question
> > is actually a much better summary than my long winded description: "a
> > Streams app can switch between topics with and without a cluster alias
> > prefix when you migrate between prod and pre-prod, while preserving
> state."
> >
> > To address a few of John's points...
> >
> > But, the prod app will still be running, and its changelog will still be
> > > mirrored into pre-prod when you start the pre-prod app.
> > >
> > The idea is actually to turn off the mirroring from prod to pre-prod
> during
> > this period, so the environments can operate completely independently and
> > their state can comfortably diverge during the testing period.  After the
> > testing period we'd be happy to throw away everything in pre-prod and
> start
> > mirroring again from prod with a blank slate.
> >
> > Also, the pre-prod app won't be in the same consumer group as the prod
> app,
> > > so it won't know from what offset to start processing input.
> > >
> > This is where I'm hoping the magic of MM2 will come in - at the time we
> > shut off mirroring from prod to pre-prod in order to spin of the pre-prod
> > environment, we will do an "offset translation" with RemoteClusterUtils
> > like Ryanne mentioned, so new Streams apps in pre-prod will see consumer
> > offsets that make sense for reading from pre-prod topics.
> >
> > I like both of your ideas around the "user space" solution: subscribing
> to
> > multiple topics, or choosing a topic based on config.  However, in order
> to
> > populate their internal state properly, when the pre-prod apps come up
> they
> > will need to look for repartition and changelog topics with the right
> > prefix.  This seems problematic to me since the user doesn't have direct
> > control over those topic names, though it did just occur to me now that
> the
> > user *sort of* does.  Since the naming scheme is currently just
> > applicationId + "-" + storeName + "-changelog", we could translate the
> > consumer group offsets to a consumer group with a new name that has the
> > same prefix as the mirrored topics do.  That seems a bit clumsly/lucky to
> > me (is the internal topic naming convention really a "public API"?), but
> I
> > think it would work.
> >
> > I'd be curious to hear if folks think that solution would work and be an
> > acceptable pattern, since my original proposal of more user control of
> > internal topic naming did seem a bit heavy handed.
> >
> > Thanks very much for your help!
> > Paul
> >
> > On Mon, Mar 25, 2019 at 3:14 PM Ryanne Dolan <ry...@gmail.com>
> > wrote:
> >
> > > Hey Paul, thanks for the kind words re MM2.
> > >
> > > I'm not a Streams expert first off, but I think I understand your
> > question:
> > > if a Streams app can switch between topics with and without a cluster
> > alias
> > > prefix when you migrate between prod and pre-prod, while preserving
> > state.
> > > Streams supports regexes and lists of topics as input, so you can use
> > e.g.
> > > builder.stream(List.of("topic1", "prod.topic1")), which is a good place
> > to
> > > start. In this case, the combined subscription is still a single
> stream,
> > > conceptually, but comprises partitions from both topics, i.e.
> partitions
> > > from topic1 plus partitions from prod.topic1. At a high level, this is
> no
> > > different than adding more partitions to a single topic. I think any
> > > intermediate or downstream topics/tables would remain unchanged, since
> > they
> > > are still the result of this single stream.
> > >
> > > The trick is to correctly translate offsets for the input topics when
> > > migrating the app between prod and pre-prod, which RemoteClusterUtils
> can
> > > help with. You could do this with external tooling, e.g. a script
> > > leveraging RemoteClusterUtils and kafka-streams-application-reset.sh. I
> > > haven't tried this with a Streams app myself, but I suspect it would
> > work.
> > >
> > > Ryanne
> > >
> > >
> > > On Sun, Mar 24, 2019 at 12:31 PM Paul Whalen <pg...@gmail.com>
> wrote:
> > >
> > > > Hi all,
> > > >
> > > > With MirrorMaker 2.0 (
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0
> > > > )
> > > > accepted and coming along very nicely in development, it has got me
> > > > wondering if a certain use case is supported, and if not, can changes
> > be
> > > > made to Streams or MM2 to support it.  I'll explain the use case, but
> > the
> > > > TL;DR here is "do we need more control over topic naming in MM2 or
> > > > Streams?"
> > > >
> > > > My team foresees using MM2 as a way to mirror data from our prod
> > > > environment to a pre-prod environment.  The data is supplied by
> > external
> > > > vendors, introduced into our system through a Kafka Streams ETL
> > pipeline,
> > > > and consumed by our end-applications.  Generally we would only like
> to
> > > run
> > > > the ETL pipeline in prod since there is an operational cost to
> running
> > it
> > > > in both prod and pre-prod (the data sometimes needs manual
> attention).
> > > > This seems to fit MM2 well: pre-prod end-applications consume from
> the
> > > > pre-prod Kafka cluster, which is entirely "remote" topics being
> > mirrored
> > > > from the prod cluster.  We only have to keep one instance of the ETL
> > > > pipeline running, but end-applications can be separate, connecting to
> > > their
> > > > respective prod and pre-prod Kafka clusters.
> > > >
> > > > However, when we want to test changes to the ETL pipeline itself, we
> > > would
> > > > like to turn off the mirroring from prod to pre-prod, and run the ETL
> > > > pipeline also in pre-prod, picking up the most recent state of the
> prod
> > > > pipeline from when mirroring was turned off (FWIW, downtime is not an
> > > issue
> > > > for our use case).
> > > >
> > > > My question/concern is basically, can Streams apps work when they're
> > > > running against topics prepended with a cluster alias, like
> > > > "pre-prod.App-statestore-changelog" as is the plan with MM2. From
> what
> > I
> > > > can tell the answer is no, and my proposal would be to give the
> Streams
> > > > user more specific control over how Streams names its internal topics
> > > > (repartition and changelogs) by defining an
> > "InternalTopicNamingStrategy"
> > > > or similar.  Perhaps there is a solution on the MM2 side as well, but
> > it
> > > > seems much less desirable to budge on that convention.
> > > >
> > > > I phrased the question in terms of my team's problem, but it's worth
> > > noting
> > > > that this use case is passably similar to a potential DR use case,
> > where
> > > > there is a DR cluster that is normally just being mirrored to by MM2,
> > but
> > > > in a DR scenario would become the active cluster that Streams
> > > applications
> > > > are connected to.
> > > >
> > > > Thanks for considering this issue, and great job to those working on
> > MM2
> > > so
> > > > far!
> > > >
> > > > Paul
> > > >
> > >
> >
>

Re: MirrorMaker 2.0 and Streams interplay (topic naming control)

Posted by John Roesler <jo...@confluent.io>.
Hi Paul,

Sorry for overlooking the "offset translation" MM2 feature. I'm glad
Ryanne was able to confirm this would work.

I'm just one voice, but FWIW, I think that the internal topic naming
scheme is a public API. We document the structure of the naming
scheme in several places. We also recommend making use of the fact
that the applicationId is a prefix of the topic name in conjunction with
Kafka Broker ACLs to grant access to the internal topics to the
applications that own them.

Actually, for this latter reason, I'm concerned that giving more control
over the names of internal topics might make topic security and
access control more difficult. Or maybe this concern is off-base, and
folks who take advanced control over the topic name would also take
on the responsibility to make sure their naming scheme works in
conjunction with their broker configs.

For whatever reason, I hadn't considered prefixing the application's
id with "pre-prod.". Offhand, I think this would achieve the desired
outcome. There may be some devil in the details, of course.


Glad to hear, by the way, that you've already considered the problem
of concurrent modifications to the changelogs (etc.). It sounds like
your plan should work, although it might become a management burden
if you start wanting to run a lot of these stream-app tests. In that case,
you could consider mirroring the relevant topics *again* into a
test-specific
prefix (like "pre-prod.test-1.", up to some point. Then, you could stop
the mirror, run the test, verify the results, and then just delete the
whole test dataset.


Does it seem like you have a good path forward? From what I'm
hearing, the "user-space" approach is at least worth exploring before
considering a new API. Of course, if it doesn't pan out for whatever reason,
I'd (personally) support adding whatever features are necessary to support
your use case.

Thanks,
-John



On Mon, Mar 25, 2019 at 9:40 PM Paul Whalen <pg...@gmail.com> wrote:

> John and Ryanne,
>
> Thanks for the responses! I think Ryanne's way of describing the question
> is actually a much better summary than my long winded description: "a
> Streams app can switch between topics with and without a cluster alias
> prefix when you migrate between prod and pre-prod, while preserving state."
>
> To address a few of John's points...
>
> But, the prod app will still be running, and its changelog will still be
> > mirrored into pre-prod when you start the pre-prod app.
> >
> The idea is actually to turn off the mirroring from prod to pre-prod during
> this period, so the environments can operate completely independently and
> their state can comfortably diverge during the testing period.  After the
> testing period we'd be happy to throw away everything in pre-prod and start
> mirroring again from prod with a blank slate.
>
> Also, the pre-prod app won't be in the same consumer group as the prod app,
> > so it won't know from what offset to start processing input.
> >
> This is where I'm hoping the magic of MM2 will come in - at the time we
> shut off mirroring from prod to pre-prod in order to spin of the pre-prod
> environment, we will do an "offset translation" with RemoteClusterUtils
> like Ryanne mentioned, so new Streams apps in pre-prod will see consumer
> offsets that make sense for reading from pre-prod topics.
>
> I like both of your ideas around the "user space" solution: subscribing to
> multiple topics, or choosing a topic based on config.  However, in order to
> populate their internal state properly, when the pre-prod apps come up they
> will need to look for repartition and changelog topics with the right
> prefix.  This seems problematic to me since the user doesn't have direct
> control over those topic names, though it did just occur to me now that the
> user *sort of* does.  Since the naming scheme is currently just
> applicationId + "-" + storeName + "-changelog", we could translate the
> consumer group offsets to a consumer group with a new name that has the
> same prefix as the mirrored topics do.  That seems a bit clumsly/lucky to
> me (is the internal topic naming convention really a "public API"?), but I
> think it would work.
>
> I'd be curious to hear if folks think that solution would work and be an
> acceptable pattern, since my original proposal of more user control of
> internal topic naming did seem a bit heavy handed.
>
> Thanks very much for your help!
> Paul
>
> On Mon, Mar 25, 2019 at 3:14 PM Ryanne Dolan <ry...@gmail.com>
> wrote:
>
> > Hey Paul, thanks for the kind words re MM2.
> >
> > I'm not a Streams expert first off, but I think I understand your
> question:
> > if a Streams app can switch between topics with and without a cluster
> alias
> > prefix when you migrate between prod and pre-prod, while preserving
> state.
> > Streams supports regexes and lists of topics as input, so you can use
> e.g.
> > builder.stream(List.of("topic1", "prod.topic1")), which is a good place
> to
> > start. In this case, the combined subscription is still a single stream,
> > conceptually, but comprises partitions from both topics, i.e. partitions
> > from topic1 plus partitions from prod.topic1. At a high level, this is no
> > different than adding more partitions to a single topic. I think any
> > intermediate or downstream topics/tables would remain unchanged, since
> they
> > are still the result of this single stream.
> >
> > The trick is to correctly translate offsets for the input topics when
> > migrating the app between prod and pre-prod, which RemoteClusterUtils can
> > help with. You could do this with external tooling, e.g. a script
> > leveraging RemoteClusterUtils and kafka-streams-application-reset.sh. I
> > haven't tried this with a Streams app myself, but I suspect it would
> work.
> >
> > Ryanne
> >
> >
> > On Sun, Mar 24, 2019 at 12:31 PM Paul Whalen <pg...@gmail.com> wrote:
> >
> > > Hi all,
> > >
> > > With MirrorMaker 2.0 (
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0
> > > )
> > > accepted and coming along very nicely in development, it has got me
> > > wondering if a certain use case is supported, and if not, can changes
> be
> > > made to Streams or MM2 to support it.  I'll explain the use case, but
> the
> > > TL;DR here is "do we need more control over topic naming in MM2 or
> > > Streams?"
> > >
> > > My team foresees using MM2 as a way to mirror data from our prod
> > > environment to a pre-prod environment.  The data is supplied by
> external
> > > vendors, introduced into our system through a Kafka Streams ETL
> pipeline,
> > > and consumed by our end-applications.  Generally we would only like to
> > run
> > > the ETL pipeline in prod since there is an operational cost to running
> it
> > > in both prod and pre-prod (the data sometimes needs manual attention).
> > > This seems to fit MM2 well: pre-prod end-applications consume from the
> > > pre-prod Kafka cluster, which is entirely "remote" topics being
> mirrored
> > > from the prod cluster.  We only have to keep one instance of the ETL
> > > pipeline running, but end-applications can be separate, connecting to
> > their
> > > respective prod and pre-prod Kafka clusters.
> > >
> > > However, when we want to test changes to the ETL pipeline itself, we
> > would
> > > like to turn off the mirroring from prod to pre-prod, and run the ETL
> > > pipeline also in pre-prod, picking up the most recent state of the prod
> > > pipeline from when mirroring was turned off (FWIW, downtime is not an
> > issue
> > > for our use case).
> > >
> > > My question/concern is basically, can Streams apps work when they're
> > > running against topics prepended with a cluster alias, like
> > > "pre-prod.App-statestore-changelog" as is the plan with MM2. From what
> I
> > > can tell the answer is no, and my proposal would be to give the Streams
> > > user more specific control over how Streams names its internal topics
> > > (repartition and changelogs) by defining an
> "InternalTopicNamingStrategy"
> > > or similar.  Perhaps there is a solution on the MM2 side as well, but
> it
> > > seems much less desirable to budge on that convention.
> > >
> > > I phrased the question in terms of my team's problem, but it's worth
> > noting
> > > that this use case is passably similar to a potential DR use case,
> where
> > > there is a DR cluster that is normally just being mirrored to by MM2,
> but
> > > in a DR scenario would become the active cluster that Streams
> > applications
> > > are connected to.
> > >
> > > Thanks for considering this issue, and great job to those working on
> MM2
> > so
> > > far!
> > >
> > > Paul
> > >
> >
>

Re: MirrorMaker 2.0 and Streams interplay (topic naming control)

Posted by Paul Whalen <pg...@gmail.com>.
John and Ryanne,

Thanks for the responses! I think Ryanne's way of describing the question
is actually a much better summary than my long winded description: "a
Streams app can switch between topics with and without a cluster alias
prefix when you migrate between prod and pre-prod, while preserving state."

To address a few of John's points...

But, the prod app will still be running, and its changelog will still be
> mirrored into pre-prod when you start the pre-prod app.
>
The idea is actually to turn off the mirroring from prod to pre-prod during
this period, so the environments can operate completely independently and
their state can comfortably diverge during the testing period.  After the
testing period we'd be happy to throw away everything in pre-prod and start
mirroring again from prod with a blank slate.

Also, the pre-prod app won't be in the same consumer group as the prod app,
> so it won't know from what offset to start processing input.
>
This is where I'm hoping the magic of MM2 will come in - at the time we
shut off mirroring from prod to pre-prod in order to spin of the pre-prod
environment, we will do an "offset translation" with RemoteClusterUtils
like Ryanne mentioned, so new Streams apps in pre-prod will see consumer
offsets that make sense for reading from pre-prod topics.

I like both of your ideas around the "user space" solution: subscribing to
multiple topics, or choosing a topic based on config.  However, in order to
populate their internal state properly, when the pre-prod apps come up they
will need to look for repartition and changelog topics with the right
prefix.  This seems problematic to me since the user doesn't have direct
control over those topic names, though it did just occur to me now that the
user *sort of* does.  Since the naming scheme is currently just
applicationId + "-" + storeName + "-changelog", we could translate the
consumer group offsets to a consumer group with a new name that has the
same prefix as the mirrored topics do.  That seems a bit clumsly/lucky to
me (is the internal topic naming convention really a "public API"?), but I
think it would work.

I'd be curious to hear if folks think that solution would work and be an
acceptable pattern, since my original proposal of more user control of
internal topic naming did seem a bit heavy handed.

Thanks very much for your help!
Paul

On Mon, Mar 25, 2019 at 3:14 PM Ryanne Dolan <ry...@gmail.com> wrote:

> Hey Paul, thanks for the kind words re MM2.
>
> I'm not a Streams expert first off, but I think I understand your question:
> if a Streams app can switch between topics with and without a cluster alias
> prefix when you migrate between prod and pre-prod, while preserving state.
> Streams supports regexes and lists of topics as input, so you can use e.g.
> builder.stream(List.of("topic1", "prod.topic1")), which is a good place to
> start. In this case, the combined subscription is still a single stream,
> conceptually, but comprises partitions from both topics, i.e. partitions
> from topic1 plus partitions from prod.topic1. At a high level, this is no
> different than adding more partitions to a single topic. I think any
> intermediate or downstream topics/tables would remain unchanged, since they
> are still the result of this single stream.
>
> The trick is to correctly translate offsets for the input topics when
> migrating the app between prod and pre-prod, which RemoteClusterUtils can
> help with. You could do this with external tooling, e.g. a script
> leveraging RemoteClusterUtils and kafka-streams-application-reset.sh. I
> haven't tried this with a Streams app myself, but I suspect it would work.
>
> Ryanne
>
>
> On Sun, Mar 24, 2019 at 12:31 PM Paul Whalen <pg...@gmail.com> wrote:
>
> > Hi all,
> >
> > With MirrorMaker 2.0 (
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0
> > )
> > accepted and coming along very nicely in development, it has got me
> > wondering if a certain use case is supported, and if not, can changes be
> > made to Streams or MM2 to support it.  I'll explain the use case, but the
> > TL;DR here is "do we need more control over topic naming in MM2 or
> > Streams?"
> >
> > My team foresees using MM2 as a way to mirror data from our prod
> > environment to a pre-prod environment.  The data is supplied by external
> > vendors, introduced into our system through a Kafka Streams ETL pipeline,
> > and consumed by our end-applications.  Generally we would only like to
> run
> > the ETL pipeline in prod since there is an operational cost to running it
> > in both prod and pre-prod (the data sometimes needs manual attention).
> > This seems to fit MM2 well: pre-prod end-applications consume from the
> > pre-prod Kafka cluster, which is entirely "remote" topics being mirrored
> > from the prod cluster.  We only have to keep one instance of the ETL
> > pipeline running, but end-applications can be separate, connecting to
> their
> > respective prod and pre-prod Kafka clusters.
> >
> > However, when we want to test changes to the ETL pipeline itself, we
> would
> > like to turn off the mirroring from prod to pre-prod, and run the ETL
> > pipeline also in pre-prod, picking up the most recent state of the prod
> > pipeline from when mirroring was turned off (FWIW, downtime is not an
> issue
> > for our use case).
> >
> > My question/concern is basically, can Streams apps work when they're
> > running against topics prepended with a cluster alias, like
> > "pre-prod.App-statestore-changelog" as is the plan with MM2. From what I
> > can tell the answer is no, and my proposal would be to give the Streams
> > user more specific control over how Streams names its internal topics
> > (repartition and changelogs) by defining an "InternalTopicNamingStrategy"
> > or similar.  Perhaps there is a solution on the MM2 side as well, but it
> > seems much less desirable to budge on that convention.
> >
> > I phrased the question in terms of my team's problem, but it's worth
> noting
> > that this use case is passably similar to a potential DR use case, where
> > there is a DR cluster that is normally just being mirrored to by MM2, but
> > in a DR scenario would become the active cluster that Streams
> applications
> > are connected to.
> >
> > Thanks for considering this issue, and great job to those working on MM2
> so
> > far!
> >
> > Paul
> >
>

Re: MirrorMaker 2.0 and Streams interplay (topic naming control)

Posted by Ryanne Dolan <ry...@gmail.com>.
Hey Paul, thanks for the kind words re MM2.

I'm not a Streams expert first off, but I think I understand your question:
if a Streams app can switch between topics with and without a cluster alias
prefix when you migrate between prod and pre-prod, while preserving state.
Streams supports regexes and lists of topics as input, so you can use e.g.
builder.stream(List.of("topic1", "prod.topic1")), which is a good place to
start. In this case, the combined subscription is still a single stream,
conceptually, but comprises partitions from both topics, i.e. partitions
from topic1 plus partitions from prod.topic1. At a high level, this is no
different than adding more partitions to a single topic. I think any
intermediate or downstream topics/tables would remain unchanged, since they
are still the result of this single stream.

The trick is to correctly translate offsets for the input topics when
migrating the app between prod and pre-prod, which RemoteClusterUtils can
help with. You could do this with external tooling, e.g. a script
leveraging RemoteClusterUtils and kafka-streams-application-reset.sh. I
haven't tried this with a Streams app myself, but I suspect it would work.

Ryanne


On Sun, Mar 24, 2019 at 12:31 PM Paul Whalen <pg...@gmail.com> wrote:

> Hi all,
>
> With MirrorMaker 2.0 (
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0
> )
> accepted and coming along very nicely in development, it has got me
> wondering if a certain use case is supported, and if not, can changes be
> made to Streams or MM2 to support it.  I'll explain the use case, but the
> TL;DR here is "do we need more control over topic naming in MM2 or
> Streams?"
>
> My team foresees using MM2 as a way to mirror data from our prod
> environment to a pre-prod environment.  The data is supplied by external
> vendors, introduced into our system through a Kafka Streams ETL pipeline,
> and consumed by our end-applications.  Generally we would only like to run
> the ETL pipeline in prod since there is an operational cost to running it
> in both prod and pre-prod (the data sometimes needs manual attention).
> This seems to fit MM2 well: pre-prod end-applications consume from the
> pre-prod Kafka cluster, which is entirely "remote" topics being mirrored
> from the prod cluster.  We only have to keep one instance of the ETL
> pipeline running, but end-applications can be separate, connecting to their
> respective prod and pre-prod Kafka clusters.
>
> However, when we want to test changes to the ETL pipeline itself, we would
> like to turn off the mirroring from prod to pre-prod, and run the ETL
> pipeline also in pre-prod, picking up the most recent state of the prod
> pipeline from when mirroring was turned off (FWIW, downtime is not an issue
> for our use case).
>
> My question/concern is basically, can Streams apps work when they're
> running against topics prepended with a cluster alias, like
> "pre-prod.App-statestore-changelog" as is the plan with MM2. From what I
> can tell the answer is no, and my proposal would be to give the Streams
> user more specific control over how Streams names its internal topics
> (repartition and changelogs) by defining an "InternalTopicNamingStrategy"
> or similar.  Perhaps there is a solution on the MM2 side as well, but it
> seems much less desirable to budge on that convention.
>
> I phrased the question in terms of my team's problem, but it's worth noting
> that this use case is passably similar to a potential DR use case, where
> there is a DR cluster that is normally just being mirrored to by MM2, but
> in a DR scenario would become the active cluster that Streams applications
> are connected to.
>
> Thanks for considering this issue, and great job to those working on MM2 so
> far!
>
> Paul
>

Re: MirrorMaker 2.0 and Streams interplay (topic naming control)

Posted by John Roesler <jo...@confluent.io>.
Hi Paul,

Thanks for the email. This does seem like a good setup to support.

This might seem a little low-fi, but do you think it would work to handle
this
use case entirely in "user space"? I may be missing something because
this is off the cuff... In the code for your Streams app, I'm wondering
if you can prepend your input/output topics with a config-driven string
like:

    builder.stream(config.getEnvPrefix() + "my-input-topic")

Regarding internal topics, I think the issue might be more complicated than
just naming. I'm assuming you wish to load the changelog into the pre-prod
app so that it can just "restore" the prod app's state and continue
processing
from there. But, the prod app will still be running, and its changelog will
still be
mirrored into pre-prod when you start the pre-prod app. Then, you'd
basically
have both prod and pre-prod writing into the pre-prod changelog at the same
time. This seems likely to produce undesirable behavior. Also, the pre-prod
app won't be in the same consumer group as the prod app, so it won't know
from what offset to start processing input. It will load newer changelog
state
from prod and then start processing older events, probably producing
different
results from production anyway.

If you can constrain the testing effort to be limited to only the mirrored
"external" topics, I think you'll get more predictable results. But as I
noted, this
is off the cuff. Please let me know if I've overlooked something.

Thanks,
-John


On Sun, Mar 24, 2019 at 12:31 PM Paul Whalen <pg...@gmail.com> wrote:

> Hi all,
>
> With MirrorMaker 2.0 (
>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-382%3A+MirrorMaker+2.0
> )
> accepted and coming along very nicely in development, it has got me
> wondering if a certain use case is supported, and if not, can changes be
> made to Streams or MM2 to support it.  I'll explain the use case, but the
> TL;DR here is "do we need more control over topic naming in MM2 or
> Streams?"
>
> My team foresees using MM2 as a way to mirror data from our prod
> environment to a pre-prod environment.  The data is supplied by external
> vendors, introduced into our system through a Kafka Streams ETL pipeline,
> and consumed by our end-applications.  Generally we would only like to run
> the ETL pipeline in prod since there is an operational cost to running it
> in both prod and pre-prod (the data sometimes needs manual attention).
> This seems to fit MM2 well: pre-prod end-applications consume from the
> pre-prod Kafka cluster, which is entirely "remote" topics being mirrored
> from the prod cluster.  We only have to keep one instance of the ETL
> pipeline running, but end-applications can be separate, connecting to their
> respective prod and pre-prod Kafka clusters.
>
> However, when we want to test changes to the ETL pipeline itself, we would
> like to turn off the mirroring from prod to pre-prod, and run the ETL
> pipeline also in pre-prod, picking up the most recent state of the prod
> pipeline from when mirroring was turned off (FWIW, downtime is not an issue
> for our use case).
>
> My question/concern is basically, can Streams apps work when they're
> running against topics prepended with a cluster alias, like
> "pre-prod.App-statestore-changelog" as is the plan with MM2. From what I
> can tell the answer is no, and my proposal would be to give the Streams
> user more specific control over how Streams names its internal topics
> (repartition and changelogs) by defining an "InternalTopicNamingStrategy"
> or similar.  Perhaps there is a solution on the MM2 side as well, but it
> seems much less desirable to budge on that convention.
>
> I phrased the question in terms of my team's problem, but it's worth noting
> that this use case is passably similar to a potential DR use case, where
> there is a DR cluster that is normally just being mirrored to by MM2, but
> in a DR scenario would become the active cluster that Streams applications
> are connected to.
>
> Thanks for considering this issue, and great job to those working on MM2 so
> far!
>
> Paul
>