You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Andy Coates <an...@confluent.io> on 2020/04/14 15:09:43 UTC

[DISCUSS] KIP-594: Expose output topic names from TopologyTestDriver

Hey all,
I would like to start off the discussion for KIP-594:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-594%3A+Expose+output+topic+names+from+TopologyTestDriver

This KIP proposes to expose the names of the topics a topology produces
records during a test run from the TopologyTestDriver class.

Let me know your thoughts!
Andy

Re: [DISCUSS] KIP-594: Expose output topic names from TopologyTestDriver

Posted by "Matthias J. Sax" <mj...@apache.org>.
Thanks Andi!

I buy your argument. For exposing topics names via `TopologyDescription`
I create a JIRA for tracking:
https://issues.apache.org/jira/browse/KAFKA-9913


-Matthias

On 4/20/20 8:07 AM, Andy Coates wrote:
> Hey all,
> 
>> One other point, I’m actually mildly concerned about what you say
> regarding the outputs to the repartition and changelog topics being
> captured and verified later. I would consider the data in these topics to
> be a private interface, and would strongly discourage user code to depend
> on it in any way, including in tests. The reason is that we need to
> maintain the flexibility to change how those topics are used at any time,
> and frequently do to implement features, fix bugs, etc. if we needed a KIP
> ad deprecation period for that data, we would be in for a lot of trouble.
> The only reason I wouldn’t consider actually removing those topics from TTD
> is that they’re exposed in the brokers anyway. But seeing this in the
> motivation gives me a little heartburn.
> 
> I'd disagree strongly here. If Streams was smart enough, (and this is not a
> criticism of Streams!), to be able to migrate from one set of internal
> topics to another after an upgrade or a change in the topology, then I'd be
> more inclined to agree that they are an internal implementation
> detail. However, that is not the case. Changing your topology can change
> the name or contents of these 'internal' topics, and so it's crucial that
> users can test which are written to and what they contain. Otherwise, how
> can users make changes to their topologies or upgrade Streams with any
> confidence the whole thing won't go up in smoke when they release to
> Production?
> 
> For each change we make to KsqlDB we need to ensure we've not
> inadvertently changed the names of internal topics or changed the schema or
> format of the data being produced to them. If we had and didn't catch it,
> then anyone upgrading to this new version of ksqlDB would likely see issues
> with any queries updating materialized views. This may 'just' be data loss,
> due to a topic name change, with data left orphaned in the old internal
> topic, but could equally be data corruption due to a change in the schema
> of the data being persisted.
> 
> *Function naming*
> I honestly don't care what it's called. Don't get me wrong - I know naming
> is important. What I mean is that you know better than I. Let me know the
> name and I'll update.
> 
> However, I did add a justification for the naming last week, which it looks
> like you haven't read.  Namely that the set of names returned is the exact
> same set of names you can use to call `createOutputTopic`. So it seems
> symmetrical to me that the method would be called `getOutputTopicNames` or
> `outputTopicNames`.  But, like I said... whatever you want.
> 
> *TopologyDescription vs TTD*
> 
> Personally, as a user who's working with TTD, I'd prefer this method on the
> TDD. This makes the API of TDD more complete IMHO.
> 
> That said, TDD could internally delegate to TopologyDescription, except it
> doesn't handle the case where there are dynamic bindings, right?  So, it
> seems the more complete solution is also the TDD approach, if I understand
> correctly.
> 
> Personally, I'm happy with purely the first phase mentioned in the KIP,
> i.e. just exposing from TDD, and I'm happy to remove the second.  It feels
> to me like the enhancement to TopologyDescription is complimentary by
> separation to this KIP.
> 
> Thoughts?
> 
> Andy
> 
> 
> On Fri, 17 Apr 2020 at 02:17, John Roesler <jo...@vvcephei.org> wrote:
> 
>> Thanks Matthias,
>>
>> It sounds like your proposal would be instead to add methods to
>> TopologyDescription: getOuputTopicNames(), getInternalTopicNames(), and
>> getInputTopicNames(). This sounds good to me. For one thing, it nicely
>> resolved the ambiguity with the method name.
>>
>> What do you think about this idea, Andy?
>>
>> Thanks,
>> John
>>
>> On Thu, Apr 16, 2020, at 19:22, Matthias J. Sax wrote:
>>> John,
>>>
>>> the "dynamic routing" case is certainly interesting. However, your
>>> argument about "simulating brokers" and verify which topics are created
>>> does not really hold up IMHO, because the names of all internal topics
>>> are contained in the TopologyDescription, too.
>>>
>>> To be fair, the `TopologyDescription` does not make is easy to get all
>>> those topic names, as one would need to iterator over all
>>> sub-topologies, all nodes (filter sources for
>>> output/internal-repartition topics) and all nodes to check if they have
>>> a state store etc.
>>>
>>> However, if this is the main use case, we could simply add methods to
>>> `TopologyDescription` like `inputTopics(), `outputTopics()`, and
>>> `internalTopics()` (or similar). Details could be discussed further.
>>>
>>>
>>> I agree that the motivation of the KIP is unclear: which topics should
>>> actually be returned? The proposed method name is
>>> `getOutputTopicNames()` while the current PR exposes all topics the
>>> application did write into. Adding the input topic names is even more
>>> confusing. Maybe Andy can clarify.
>>>
>>> The method name "allProducedTopics()" might work, but it would (by its
>>> name) exclude input topic (with the exception that one has a loop in the
>>> topology and for "through()" topics). It would also not contain "unused"
>>> topics (eg, a filter() prevents writing into an output topic). Not sure
>>> if there is a use case to get only the topic names of "used topics"?
>>>
>>> Overall, besides the "dynamic routing" case, I still no 100% sure if we
>>> should expose topic names at the TTD level but would rather extend the
>>> TopologyDescription. For the dynamic routing case, we could add a method
>>> to TTD that _only_ exposes those dynamic topic names, like
>>> `dynamicOutputTopic()`?
>>>
>>>
>>> But I guess it's best to wait for Andy to clarify. The discussion is
>>> getting somewhat speculative :)
>>>
>>>
>>>
>>> -Matthias
>>>
>>>
>>>
>>>
>>> On 4/15/20 9:13 PM, John Roesler wrote:
>>>> Hi Andy,
>>>>
>>>> Thanks for the KIP!
>>>>
>>>> To Matthias’s concern, I have to agree that the motivation doesn’t
>> build a particularly strong case for the KIP, and people often look back to
>> these documents to understand why something was done, and done a particular
>> way. So, it would be nice to flesh it out a bit more. I have some thoughts
>> below on how to do this.
>>>>
>>>> Matthias also raised a good point that many applications can get the
>> same information just by looking in the TopologyDescription. At the least,
>> this alternative deserves to be called out in the “Rejected Alternatives”
>> section, with an explanation of why the chosen approach is preferable.
>>>>
>>>> For my part, I can think of a couple of reasons to add the method to
>> TopologyTestDriver instead of TopologyDescription.
>>>>
>>>> First, although it’s rare, some output topics are determined
>> dynamically by the data. In particular, we offer sink nodes with a
>> TopicNameExtractor. Such topics could only be verified by looking at the
>> actual topics used, rather than the ones specified in the topology.
>>>>
>>>> Second is a more philosophical justification, that the
>> TopologyTestDriver performs some functions of KafkaStreams (processing,
>> serving IQ), but also some functions of the brokers (simulating input and
>> output topics). A common integration testing strategy is to run an
>> application and then query the brokers for the list of topics that the app
>> creates/uses, to make sure the app won’t run afoul of ACLs in production,
>> or just to make sure it doesn’t contain a bug that produces to unexpected
>> topics, or any number of other reasons. Your proposal allows the TTD to
>> partly fill this role, which seems natural considering its present
>> broker-esque capabilities.
>>>>
>>>> As for my own feedback, I’m mostly concerned about the method name and
>> contract. The proposed name mentions “output” topics, but I think most
>> people would associate this with sink topics, probably not with repartition
>> topics, and certainly not with changelog topics. I’m sure this also crossed
>> your mind, and I’m afraid I don’t have a brilliant suggestion. Maybe
>> getProducedTopicNames? Or, we could include the input topics as well and go
>> with getTopicNames() as a listing of all the topics that the app “touches”?
>>>>
>>>> One other point, I’m actually mildly concerned about what you say
>> regarding the outputs to the repartition and changelog topics being
>> captured and verified later. I would consider the data in these topics to
>> be a private interface, and would strongly discourage user code to depend
>> on it in any way, including in tests. The reason is that we need to
>> maintain the flexibility to change how those topics are used at any time,
>> and frequently do to implement features, fix bugs, etc. if we needed a KIP
>> ad deprecation period for that data, we would be in for a lot of trouble.
>> The only reason I wouldn’t consider actually removing those topics from TTD
>> is that they’re exposed in the brokers anyway. But seeing this in the
>> motivation gives me a little heartburn.
>>>>
>>>> Finally, a minor point: the Javadoc you proposed contradicts your
>> proposal for Phase 2, in that the doc says it prints all the topics to
>> which records have been output, but Phase 2 says we’ll include all the
>> topics from the description up front, before any outputs happen.
>>>>
>>>> Anyway, thanks again for the KIP. It does seem useful to me, and I
>> hope my feedback helps speed you to a successful vote!
>>>>
>>>> Thanks,
>>>> John
>>>>
>>>> On Tue, Apr 14, 2020, at 19:49, Matthias J. Sax wrote:
>>>>> Andy,
>>>>>
>>>>> thanks for the KIP. The motivation is a little unclear to me:
>>>>>
>>>>>> This information allows all the outputs of a test run to be captured
>> without prior knowledge of the output topics.
>>>>>
>>>>> Given that the TTD users writes the `Topology` themselves, they should
>>>>> always have prior knowledge what output topics they use. So why would
>>>>> this be useful?
>>>>>
>>>>> Also, there is `Topology#describe()` to get all topic names (even the
>>>>> name of internal topics -- to be fair, changelog topic names are not
>>>>> exposed, only store names, but those can we used to infer changelog
>>>>> topic names, too).
>>>>>
>>>>> Can you elaborate about the motivation? So far, it's not convincing
>> to me.
>>>>>
>>>>>
>>>>>
>>>>> -Matthias
>>>>>
>>>>>
>>>>> On 4/14/20 8:09 AM, Andy Coates wrote:
>>>>>> Hey all,
>>>>>> I would like to start off the discussion for KIP-594:
>>>>>>
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-594%3A+Expose+output+topic+names+from+TopologyTestDriver
>>>>>>
>>>>>> This KIP proposes to expose the names of the topics a topology
>> produces
>>>>>> records during a test run from the TopologyTestDriver class.
>>>>>>
>>>>>> Let me know your thoughts!
>>>>>> Andy
>>>>>>
>>>>>
>>>>>
>>>>> Attachments:
>>>>> * signature.asc
>>>
>>>
>>> Attachments:
>>> * signature.asc
>>
> 


Re: [DISCUSS] KIP-594: Expose output topic names from TopologyTestDriver

Posted by Andy Coates <an...@confluent.io>.
Hey all,

> One other point, I’m actually mildly concerned about what you say
regarding the outputs to the repartition and changelog topics being
captured and verified later. I would consider the data in these topics to
be a private interface, and would strongly discourage user code to depend
on it in any way, including in tests. The reason is that we need to
maintain the flexibility to change how those topics are used at any time,
and frequently do to implement features, fix bugs, etc. if we needed a KIP
ad deprecation period for that data, we would be in for a lot of trouble.
The only reason I wouldn’t consider actually removing those topics from TTD
is that they’re exposed in the brokers anyway. But seeing this in the
motivation gives me a little heartburn.

I'd disagree strongly here. If Streams was smart enough, (and this is not a
criticism of Streams!), to be able to migrate from one set of internal
topics to another after an upgrade or a change in the topology, then I'd be
more inclined to agree that they are an internal implementation
detail. However, that is not the case. Changing your topology can change
the name or contents of these 'internal' topics, and so it's crucial that
users can test which are written to and what they contain. Otherwise, how
can users make changes to their topologies or upgrade Streams with any
confidence the whole thing won't go up in smoke when they release to
Production?

For each change we make to KsqlDB we need to ensure we've not
inadvertently changed the names of internal topics or changed the schema or
format of the data being produced to them. If we had and didn't catch it,
then anyone upgrading to this new version of ksqlDB would likely see issues
with any queries updating materialized views. This may 'just' be data loss,
due to a topic name change, with data left orphaned in the old internal
topic, but could equally be data corruption due to a change in the schema
of the data being persisted.

*Function naming*
I honestly don't care what it's called. Don't get me wrong - I know naming
is important. What I mean is that you know better than I. Let me know the
name and I'll update.

However, I did add a justification for the naming last week, which it looks
like you haven't read.  Namely that the set of names returned is the exact
same set of names you can use to call `createOutputTopic`. So it seems
symmetrical to me that the method would be called `getOutputTopicNames` or
`outputTopicNames`.  But, like I said... whatever you want.

*TopologyDescription vs TTD*

Personally, as a user who's working with TTD, I'd prefer this method on the
TDD. This makes the API of TDD more complete IMHO.

That said, TDD could internally delegate to TopologyDescription, except it
doesn't handle the case where there are dynamic bindings, right?  So, it
seems the more complete solution is also the TDD approach, if I understand
correctly.

Personally, I'm happy with purely the first phase mentioned in the KIP,
i.e. just exposing from TDD, and I'm happy to remove the second.  It feels
to me like the enhancement to TopologyDescription is complimentary by
separation to this KIP.

Thoughts?

Andy


On Fri, 17 Apr 2020 at 02:17, John Roesler <jo...@vvcephei.org> wrote:

> Thanks Matthias,
>
> It sounds like your proposal would be instead to add methods to
> TopologyDescription: getOuputTopicNames(), getInternalTopicNames(), and
> getInputTopicNames(). This sounds good to me. For one thing, it nicely
> resolved the ambiguity with the method name.
>
> What do you think about this idea, Andy?
>
> Thanks,
> John
>
> On Thu, Apr 16, 2020, at 19:22, Matthias J. Sax wrote:
> > John,
> >
> > the "dynamic routing" case is certainly interesting. However, your
> > argument about "simulating brokers" and verify which topics are created
> > does not really hold up IMHO, because the names of all internal topics
> > are contained in the TopologyDescription, too.
> >
> > To be fair, the `TopologyDescription` does not make is easy to get all
> > those topic names, as one would need to iterator over all
> > sub-topologies, all nodes (filter sources for
> > output/internal-repartition topics) and all nodes to check if they have
> > a state store etc.
> >
> > However, if this is the main use case, we could simply add methods to
> > `TopologyDescription` like `inputTopics(), `outputTopics()`, and
> > `internalTopics()` (or similar). Details could be discussed further.
> >
> >
> > I agree that the motivation of the KIP is unclear: which topics should
> > actually be returned? The proposed method name is
> > `getOutputTopicNames()` while the current PR exposes all topics the
> > application did write into. Adding the input topic names is even more
> > confusing. Maybe Andy can clarify.
> >
> > The method name "allProducedTopics()" might work, but it would (by its
> > name) exclude input topic (with the exception that one has a loop in the
> > topology and for "through()" topics). It would also not contain "unused"
> > topics (eg, a filter() prevents writing into an output topic). Not sure
> > if there is a use case to get only the topic names of "used topics"?
> >
> > Overall, besides the "dynamic routing" case, I still no 100% sure if we
> > should expose topic names at the TTD level but would rather extend the
> > TopologyDescription. For the dynamic routing case, we could add a method
> > to TTD that _only_ exposes those dynamic topic names, like
> > `dynamicOutputTopic()`?
> >
> >
> > But I guess it's best to wait for Andy to clarify. The discussion is
> > getting somewhat speculative :)
> >
> >
> >
> > -Matthias
> >
> >
> >
> >
> > On 4/15/20 9:13 PM, John Roesler wrote:
> > > Hi Andy,
> > >
> > > Thanks for the KIP!
> > >
> > > To Matthias’s concern, I have to agree that the motivation doesn’t
> build a particularly strong case for the KIP, and people often look back to
> these documents to understand why something was done, and done a particular
> way. So, it would be nice to flesh it out a bit more. I have some thoughts
> below on how to do this.
> > >
> > > Matthias also raised a good point that many applications can get the
> same information just by looking in the TopologyDescription. At the least,
> this alternative deserves to be called out in the “Rejected Alternatives”
> section, with an explanation of why the chosen approach is preferable.
> > >
> > > For my part, I can think of a couple of reasons to add the method to
> TopologyTestDriver instead of TopologyDescription.
> > >
> > > First, although it’s rare, some output topics are determined
> dynamically by the data. In particular, we offer sink nodes with a
> TopicNameExtractor. Such topics could only be verified by looking at the
> actual topics used, rather than the ones specified in the topology.
> > >
> > > Second is a more philosophical justification, that the
> TopologyTestDriver performs some functions of KafkaStreams (processing,
> serving IQ), but also some functions of the brokers (simulating input and
> output topics). A common integration testing strategy is to run an
> application and then query the brokers for the list of topics that the app
> creates/uses, to make sure the app won’t run afoul of ACLs in production,
> or just to make sure it doesn’t contain a bug that produces to unexpected
> topics, or any number of other reasons. Your proposal allows the TTD to
> partly fill this role, which seems natural considering its present
> broker-esque capabilities.
> > >
> > > As for my own feedback, I’m mostly concerned about the method name and
> contract. The proposed name mentions “output” topics, but I think most
> people would associate this with sink topics, probably not with repartition
> topics, and certainly not with changelog topics. I’m sure this also crossed
> your mind, and I’m afraid I don’t have a brilliant suggestion. Maybe
> getProducedTopicNames? Or, we could include the input topics as well and go
> with getTopicNames() as a listing of all the topics that the app “touches”?
> > >
> > > One other point, I’m actually mildly concerned about what you say
> regarding the outputs to the repartition and changelog topics being
> captured and verified later. I would consider the data in these topics to
> be a private interface, and would strongly discourage user code to depend
> on it in any way, including in tests. The reason is that we need to
> maintain the flexibility to change how those topics are used at any time,
> and frequently do to implement features, fix bugs, etc. if we needed a KIP
> ad deprecation period for that data, we would be in for a lot of trouble.
> The only reason I wouldn’t consider actually removing those topics from TTD
> is that they’re exposed in the brokers anyway. But seeing this in the
> motivation gives me a little heartburn.
> > >
> > > Finally, a minor point: the Javadoc you proposed contradicts your
> proposal for Phase 2, in that the doc says it prints all the topics to
> which records have been output, but Phase 2 says we’ll include all the
> topics from the description up front, before any outputs happen.
> > >
> > > Anyway, thanks again for the KIP. It does seem useful to me, and I
> hope my feedback helps speed you to a successful vote!
> > >
> > > Thanks,
> > > John
> > >
> > > On Tue, Apr 14, 2020, at 19:49, Matthias J. Sax wrote:
> > >> Andy,
> > >>
> > >> thanks for the KIP. The motivation is a little unclear to me:
> > >>
> > >>> This information allows all the outputs of a test run to be captured
> without prior knowledge of the output topics.
> > >>
> > >> Given that the TTD users writes the `Topology` themselves, they should
> > >> always have prior knowledge what output topics they use. So why would
> > >> this be useful?
> > >>
> > >> Also, there is `Topology#describe()` to get all topic names (even the
> > >> name of internal topics -- to be fair, changelog topic names are not
> > >> exposed, only store names, but those can we used to infer changelog
> > >> topic names, too).
> > >>
> > >> Can you elaborate about the motivation? So far, it's not convincing
> to me.
> > >>
> > >>
> > >>
> > >> -Matthias
> > >>
> > >>
> > >> On 4/14/20 8:09 AM, Andy Coates wrote:
> > >>> Hey all,
> > >>> I would like to start off the discussion for KIP-594:
> > >>>
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-594%3A+Expose+output+topic+names+from+TopologyTestDriver
> > >>>
> > >>> This KIP proposes to expose the names of the topics a topology
> produces
> > >>> records during a test run from the TopologyTestDriver class.
> > >>>
> > >>> Let me know your thoughts!
> > >>> Andy
> > >>>
> > >>
> > >>
> > >> Attachments:
> > >> * signature.asc
> >
> >
> > Attachments:
> > * signature.asc
>

Re: [DISCUSS] KIP-594: Expose output topic names from TopologyTestDriver

Posted by John Roesler <jo...@vvcephei.org>.
Thanks Matthias,

It sounds like your proposal would be instead to add methods to TopologyDescription: getOuputTopicNames(), getInternalTopicNames(), and getInputTopicNames(). This sounds good to me. For one thing, it nicely resolved the ambiguity with the method name. 

What do you think about this idea, Andy?

Thanks,
John

On Thu, Apr 16, 2020, at 19:22, Matthias J. Sax wrote:
> John,
> 
> the "dynamic routing" case is certainly interesting. However, your
> argument about "simulating brokers" and verify which topics are created
> does not really hold up IMHO, because the names of all internal topics
> are contained in the TopologyDescription, too.
> 
> To be fair, the `TopologyDescription` does not make is easy to get all
> those topic names, as one would need to iterator over all
> sub-topologies, all nodes (filter sources for
> output/internal-repartition topics) and all nodes to check if they have
> a state store etc.
> 
> However, if this is the main use case, we could simply add methods to
> `TopologyDescription` like `inputTopics(), `outputTopics()`, and
> `internalTopics()` (or similar). Details could be discussed further.
> 
> 
> I agree that the motivation of the KIP is unclear: which topics should
> actually be returned? The proposed method name is
> `getOutputTopicNames()` while the current PR exposes all topics the
> application did write into. Adding the input topic names is even more
> confusing. Maybe Andy can clarify.
> 
> The method name "allProducedTopics()" might work, but it would (by its
> name) exclude input topic (with the exception that one has a loop in the
> topology and for "through()" topics). It would also not contain "unused"
> topics (eg, a filter() prevents writing into an output topic). Not sure
> if there is a use case to get only the topic names of "used topics"?
> 
> Overall, besides the "dynamic routing" case, I still no 100% sure if we
> should expose topic names at the TTD level but would rather extend the
> TopologyDescription. For the dynamic routing case, we could add a method
> to TTD that _only_ exposes those dynamic topic names, like
> `dynamicOutputTopic()`?
> 
> 
> But I guess it's best to wait for Andy to clarify. The discussion is
> getting somewhat speculative :)
> 
> 
> 
> -Matthias
> 
> 
> 
> 
> On 4/15/20 9:13 PM, John Roesler wrote:
> > Hi Andy,
> > 
> > Thanks for the KIP!
> > 
> > To Matthias’s concern, I have to agree that the motivation doesn’t build a particularly strong case for the KIP, and people often look back to these documents to understand why something was done, and done a particular way. So, it would be nice to flesh it out a bit more. I have some thoughts below on how to do this. 
> > 
> > Matthias also raised a good point that many applications can get the same information just by looking in the TopologyDescription. At the least, this alternative deserves to be called out in the “Rejected Alternatives” section, with an explanation of why the chosen approach is preferable.
> > 
> > For my part, I can think of a couple of reasons to add the method to TopologyTestDriver instead of TopologyDescription.
> > 
> > First, although it’s rare, some output topics are determined dynamically by the data. In particular, we offer sink nodes with a TopicNameExtractor. Such topics could only be verified by looking at the actual topics used, rather than the ones specified in the topology.
> > 
> > Second is a more philosophical justification, that the TopologyTestDriver performs some functions of KafkaStreams (processing, serving IQ), but also some functions of the brokers (simulating input and output topics). A common integration testing strategy is to run an application and then query the brokers for the list of topics that the app creates/uses, to make sure the app won’t run afoul of ACLs in production, or just to make sure it doesn’t contain a bug that produces to unexpected topics, or any number of other reasons. Your proposal allows the TTD to partly fill this role, which seems natural considering its present broker-esque capabilities.
> > 
> > As for my own feedback, I’m mostly concerned about the method name and contract. The proposed name mentions “output” topics, but I think most people would associate this with sink topics, probably not with repartition topics, and certainly not with changelog topics. I’m sure this also crossed your mind, and I’m afraid I don’t have a brilliant suggestion. Maybe getProducedTopicNames? Or, we could include the input topics as well and go with getTopicNames() as a listing of all the topics that the app “touches”?
> > 
> > One other point, I’m actually mildly concerned about what you say regarding the outputs to the repartition and changelog topics being captured and verified later. I would consider the data in these topics to be a private interface, and would strongly discourage user code to depend on it in any way, including in tests. The reason is that we need to maintain the flexibility to change how those topics are used at any time, and frequently do to implement features, fix bugs, etc. if we needed a KIP ad deprecation period for that data, we would be in for a lot of trouble. The only reason I wouldn’t consider actually removing those topics from TTD is that they’re exposed in the brokers anyway. But seeing this in the motivation gives me a little heartburn. 
> > 
> > Finally, a minor point: the Javadoc you proposed contradicts your proposal for Phase 2, in that the doc says it prints all the topics to which records have been output, but Phase 2 says we’ll include all the topics from the description up front, before any outputs happen. 
> > 
> > Anyway, thanks again for the KIP. It does seem useful to me, and I hope my feedback helps speed you to a successful vote!
> > 
> > Thanks,
> > John
> > 
> > On Tue, Apr 14, 2020, at 19:49, Matthias J. Sax wrote:
> >> Andy,
> >>
> >> thanks for the KIP. The motivation is a little unclear to me:
> >>
> >>> This information allows all the outputs of a test run to be captured without prior knowledge of the output topics.
> >>
> >> Given that the TTD users writes the `Topology` themselves, they should
> >> always have prior knowledge what output topics they use. So why would
> >> this be useful?
> >>
> >> Also, there is `Topology#describe()` to get all topic names (even the
> >> name of internal topics -- to be fair, changelog topic names are not
> >> exposed, only store names, but those can we used to infer changelog
> >> topic names, too).
> >>
> >> Can you elaborate about the motivation? So far, it's not convincing to me.
> >>
> >>
> >>
> >> -Matthias
> >>
> >>
> >> On 4/14/20 8:09 AM, Andy Coates wrote:
> >>> Hey all,
> >>> I would like to start off the discussion for KIP-594:
> >>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-594%3A+Expose+output+topic+names+from+TopologyTestDriver
> >>>
> >>> This KIP proposes to expose the names of the topics a topology produces
> >>> records during a test run from the TopologyTestDriver class.
> >>>
> >>> Let me know your thoughts!
> >>> Andy
> >>>
> >>
> >>
> >> Attachments:
> >> * signature.asc
> 
> 
> Attachments:
> * signature.asc

Re: [DISCUSS] KIP-594: Expose output topic names from TopologyTestDriver

Posted by "Matthias J. Sax" <mj...@apache.org>.
John,

the "dynamic routing" case is certainly interesting. However, your
argument about "simulating brokers" and verify which topics are created
does not really hold up IMHO, because the names of all internal topics
are contained in the TopologyDescription, too.

To be fair, the `TopologyDescription` does not make is easy to get all
those topic names, as one would need to iterator over all
sub-topologies, all nodes (filter sources for
output/internal-repartition topics) and all nodes to check if they have
a state store etc.

However, if this is the main use case, we could simply add methods to
`TopologyDescription` like `inputTopics(), `outputTopics()`, and
`internalTopics()` (or similar). Details could be discussed further.


I agree that the motivation of the KIP is unclear: which topics should
actually be returned? The proposed method name is
`getOutputTopicNames()` while the current PR exposes all topics the
application did write into. Adding the input topic names is even more
confusing. Maybe Andy can clarify.

The method name "allProducedTopics()" might work, but it would (by its
name) exclude input topic (with the exception that one has a loop in the
topology and for "through()" topics). It would also not contain "unused"
topics (eg, a filter() prevents writing into an output topic). Not sure
if there is a use case to get only the topic names of "used topics"?

Overall, besides the "dynamic routing" case, I still no 100% sure if we
should expose topic names at the TTD level but would rather extend the
TopologyDescription. For the dynamic routing case, we could add a method
to TTD that _only_ exposes those dynamic topic names, like
`dynamicOutputTopic()`?


But I guess it's best to wait for Andy to clarify. The discussion is
getting somewhat speculative :)



-Matthias




On 4/15/20 9:13 PM, John Roesler wrote:
> Hi Andy,
> 
> Thanks for the KIP!
> 
> To Matthias’s concern, I have to agree that the motivation doesn’t build a particularly strong case for the KIP, and people often look back to these documents to understand why something was done, and done a particular way. So, it would be nice to flesh it out a bit more. I have some thoughts below on how to do this. 
> 
> Matthias also raised a good point that many applications can get the same information just by looking in the TopologyDescription. At the least, this alternative deserves to be called out in the “Rejected Alternatives” section, with an explanation of why the chosen approach is preferable.
> 
> For my part, I can think of a couple of reasons to add the method to TopologyTestDriver instead of TopologyDescription.
> 
> First, although it’s rare, some output topics are determined dynamically by the data. In particular, we offer sink nodes with a TopicNameExtractor. Such topics could only be verified by looking at the actual topics used, rather than the ones specified in the topology.
> 
> Second is a more philosophical justification, that the TopologyTestDriver performs some functions of KafkaStreams (processing, serving IQ), but also some functions of the brokers (simulating input and output topics). A common integration testing strategy is to run an application and then query the brokers for the list of topics that the app creates/uses, to make sure the app won’t run afoul of ACLs in production, or just to make sure it doesn’t contain a bug that produces to unexpected topics, or any number of other reasons. Your proposal allows the TTD to partly fill this role, which seems natural considering its present broker-esque capabilities.
> 
> As for my own feedback, I’m mostly concerned about the method name and contract. The proposed name mentions “output” topics, but I think most people would associate this with sink topics, probably not with repartition topics, and certainly not with changelog topics. I’m sure this also crossed your mind, and I’m afraid I don’t have a brilliant suggestion. Maybe getProducedTopicNames? Or, we could include the input topics as well and go with getTopicNames() as a listing of all the topics that the app “touches”?
> 
> One other point, I’m actually mildly concerned about what you say regarding the outputs to the repartition and changelog topics being captured and verified later. I would consider the data in these topics to be a private interface, and would strongly discourage user code to depend on it in any way, including in tests. The reason is that we need to maintain the flexibility to change how those topics are used at any time, and frequently do to implement features, fix bugs, etc. if we needed a KIP ad deprecation period for that data, we would be in for a lot of trouble. The only reason I wouldn’t consider actually removing those topics from TTD is that they’re exposed in the brokers anyway. But seeing this in the motivation gives me a little heartburn. 
> 
> Finally, a minor point: the Javadoc you proposed contradicts your proposal for Phase 2, in that the doc says it prints all the topics to which records have been output, but Phase 2 says we’ll include all the topics from the description up front, before any outputs happen. 
> 
> Anyway, thanks again for the KIP. It does seem useful to me, and I hope my feedback helps speed you to a successful vote!
> 
> Thanks,
> John
> 
> On Tue, Apr 14, 2020, at 19:49, Matthias J. Sax wrote:
>> Andy,
>>
>> thanks for the KIP. The motivation is a little unclear to me:
>>
>>> This information allows all the outputs of a test run to be captured without prior knowledge of the output topics.
>>
>> Given that the TTD users writes the `Topology` themselves, they should
>> always have prior knowledge what output topics they use. So why would
>> this be useful?
>>
>> Also, there is `Topology#describe()` to get all topic names (even the
>> name of internal topics -- to be fair, changelog topic names are not
>> exposed, only store names, but those can we used to infer changelog
>> topic names, too).
>>
>> Can you elaborate about the motivation? So far, it's not convincing to me.
>>
>>
>>
>> -Matthias
>>
>>
>> On 4/14/20 8:09 AM, Andy Coates wrote:
>>> Hey all,
>>> I would like to start off the discussion for KIP-594:
>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-594%3A+Expose+output+topic+names+from+TopologyTestDriver
>>>
>>> This KIP proposes to expose the names of the topics a topology produces
>>> records during a test run from the TopologyTestDriver class.
>>>
>>> Let me know your thoughts!
>>> Andy
>>>
>>
>>
>> Attachments:
>> * signature.asc


Re: [DISCUSS] KIP-594: Expose output topic names from TopologyTestDriver

Posted by John Roesler <vv...@apache.org>.
Hi Andy,

Thanks for the KIP!

To Matthias’s concern, I have to agree that the motivation doesn’t build a particularly strong case for the KIP, and people often look back to these documents to understand why something was done, and done a particular way. So, it would be nice to flesh it out a bit more. I have some thoughts below on how to do this. 

Matthias also raised a good point that many applications can get the same information just by looking in the TopologyDescription. At the least, this alternative deserves to be called out in the “Rejected Alternatives” section, with an explanation of why the chosen approach is preferable.

For my part, I can think of a couple of reasons to add the method to TopologyTestDriver instead of TopologyDescription.

First, although it’s rare, some output topics are determined dynamically by the data. In particular, we offer sink nodes with a TopicNameExtractor. Such topics could only be verified by looking at the actual topics used, rather than the ones specified in the topology.

Second is a more philosophical justification, that the TopologyTestDriver performs some functions of KafkaStreams (processing, serving IQ), but also some functions of the brokers (simulating input and output topics). A common integration testing strategy is to run an application and then query the brokers for the list of topics that the app creates/uses, to make sure the app won’t run afoul of ACLs in production, or just to make sure it doesn’t contain a bug that produces to unexpected topics, or any number of other reasons. Your proposal allows the TTD to partly fill this role, which seems natural considering its present broker-esque capabilities.

As for my own feedback, I’m mostly concerned about the method name and contract. The proposed name mentions “output” topics, but I think most people would associate this with sink topics, probably not with repartition topics, and certainly not with changelog topics. I’m sure this also crossed your mind, and I’m afraid I don’t have a brilliant suggestion. Maybe getProducedTopicNames? Or, we could include the input topics as well and go with getTopicNames() as a listing of all the topics that the app “touches”?

One other point, I’m actually mildly concerned about what you say regarding the outputs to the repartition and changelog topics being captured and verified later. I would consider the data in these topics to be a private interface, and would strongly discourage user code to depend on it in any way, including in tests. The reason is that we need to maintain the flexibility to change how those topics are used at any time, and frequently do to implement features, fix bugs, etc. if we needed a KIP ad deprecation period for that data, we would be in for a lot of trouble. The only reason I wouldn’t consider actually removing those topics from TTD is that they’re exposed in the brokers anyway. But seeing this in the motivation gives me a little heartburn. 

Finally, a minor point: the Javadoc you proposed contradicts your proposal for Phase 2, in that the doc says it prints all the topics to which records have been output, but Phase 2 says we’ll include all the topics from the description up front, before any outputs happen. 

Anyway, thanks again for the KIP. It does seem useful to me, and I hope my feedback helps speed you to a successful vote!

Thanks,
John

On Tue, Apr 14, 2020, at 19:49, Matthias J. Sax wrote:
> Andy,
> 
> thanks for the KIP. The motivation is a little unclear to me:
> 
> > This information allows all the outputs of a test run to be captured without prior knowledge of the output topics.
> 
> Given that the TTD users writes the `Topology` themselves, they should
> always have prior knowledge what output topics they use. So why would
> this be useful?
> 
> Also, there is `Topology#describe()` to get all topic names (even the
> name of internal topics -- to be fair, changelog topic names are not
> exposed, only store names, but those can we used to infer changelog
> topic names, too).
> 
> Can you elaborate about the motivation? So far, it's not convincing to me.
> 
> 
> 
> -Matthias
> 
> 
> On 4/14/20 8:09 AM, Andy Coates wrote:
> > Hey all,
> > I would like to start off the discussion for KIP-594:
> > https://cwiki.apache.org/confluence/display/KAFKA/KIP-594%3A+Expose+output+topic+names+from+TopologyTestDriver
> > 
> > This KIP proposes to expose the names of the topics a topology produces
> > records during a test run from the TopologyTestDriver class.
> > 
> > Let me know your thoughts!
> > Andy
> > 
> 
> 
> Attachments:
> * signature.asc

Re: [DISCUSS] KIP-594: Expose output topic names from TopologyTestDriver

Posted by "Matthias J. Sax" <mj...@apache.org>.
Andy,

thanks for the KIP. The motivation is a little unclear to me:

> This information allows all the outputs of a test run to be captured without prior knowledge of the output topics.

Given that the TTD users writes the `Topology` themselves, they should
always have prior knowledge what output topics they use. So why would
this be useful?

Also, there is `Topology#describe()` to get all topic names (even the
name of internal topics -- to be fair, changelog topic names are not
exposed, only store names, but those can we used to infer changelog
topic names, too).

Can you elaborate about the motivation? So far, it's not convincing to me.



-Matthias


On 4/14/20 8:09 AM, Andy Coates wrote:
> Hey all,
> I would like to start off the discussion for KIP-594:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-594%3A+Expose+output+topic+names+from+TopologyTestDriver
> 
> This KIP proposes to expose the names of the topics a topology produces
> records during a test run from the TopologyTestDriver class.
> 
> Let me know your thoughts!
> Andy
>