You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Bill Bejeck <bb...@gmail.com> on 2018/09/12 23:52:41 UTC

[DISCUSS] KIP-372: Naming Joins and Grouping

All I'd like to start a discussion on KIP-372 for the naming of joins and
grouping operations in Kafka Streams.

The KIP page can be found here:
https://cwiki.apache.org/confluence/display/KAFKA/KIP-372%3A+Naming+Joins+and+Grouping

I look forward to feedback and comments.

Thanks,
Bill

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Thant for the update Bill. LGTM.


On 9/14/18 5:33 PM, Guozhang Wang wrote:
> Hello Bill,
> 
> I've made another pass over the wiki page and it lgtm now. Thanks!
> 
> 
> Guozhang
> 
> 
> On Fri, Sep 14, 2018 at 3:05 PM, John Roesler <jo...@confluent.io> wrote:
> 
>> Hey Bill (and Guozhang),
>>
>> Apologies. I misunderstood what Guozhang was getting at in his earlier
>> remark.
>>
>> I'm a +1 on the current proposal.
>>
>> Thanks,
>> -John
>>
>> On Fri, Sep 14, 2018 at 3:10 PM Bill Bejeck <bb...@gmail.com> wrote:
>>
>>> All,
>>>
>>> Thanks for the comments, I'll respond to all of your comments in the
>>> recieved order.
>>>
>>>
>>> Guozhang,
>>>
>>>>  And if [join-name] not specified, stay the same, which is:
>>>> * [previous-processor-name]-repartition for both Stream-Stream (S-S)
>> join
>>> and S-T join
>>>
>>>   I believe the current approach is
>>> [appId]-[previous-processor-name]-repartition, but the main point is if
>> no
>>> name is provided, the current naming scheme will remain in place
>>>
>>>>  I think it is more natural to rename it to
>>>>  * [app-id]-[join-name]-left-repartition
>>>>  * [app-id]-[join-name]-right-repartition
>>>
>>> I agree and I'll update the KIP accordingly
>>>
>>>> 2 I'd suggest to use the name to also define the corresponding
>> processor
>>>
>>> I'll address this in conjunction with comments from Matthias and your
>>> second round of comments.
>>>
>>>
>>>> 3. Could you also specify how this would affect the optimization for
>>>
>>>                  merging multiple repartition topics?
>>>
>>>           With an optimization where we reduce the number of repartition
>>> topics, if the user has not named the repartition topics, we'll generate
>> a
>>> name for
>>>           the optimized repartition topic. In the case where users have
>>> provided names for the repartition topics,
>>>           we'll reuse the name from the first of the named repartition
>>> topics.
>>>           However, optimizing a topology by reducing the number of
>>> repartition topics may require an application reset, but that is beyond
>> the
>>> scope of this KIP.
>>>
>>>           I'll update the KIP with this case
>>>
>>>           I've also addressed 4.a and 4.b in the "Compatibility,
>>> Deprecation and Migration plan" section
>>>
>>>
>>> Matthias
>>>
>>>> 1) We should either use `[app-id]-this|other-[join-name]-repartition`
>> or
>>>> `app-id]-[join-name]-left|right-repartition` but we should not change
>>>> the pattern depending if the user specifies a name of not.
>>>
>>> I've updated the KIP to say in the case of Joins and a name is provided
>>> we'll go with `app-id-name-left|right-repartition` in all cases if the
>> name
>>> is not provided
>>> we will continue to use the current naming process.
>>>
>>>> 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
>>>> to be orthogonal, and thus KIP-372 should not change any processor
>>>> names, but KIP-307 should define a holistic strategy for all processor.
>>>> Otherwise, we might up with different strategies or revert what we
>>>> decide in this KIP if it's not compatible with KIP-307.
>>>
>>> I agree and have updated the name of this KIP to be more precise.
>>>
>>> Eno
>>>
>>>> I know we don't normally have a "Related work" section in KIPs, but
>>>> sometimes I find it useful to see what others have done in similar
>> cases.
>>>> Since this will be important for rolling re-deployments, I wonder what
>>>> other frameworks like Flink (or Samza) have done in these cases.
>> Perhaps
>>>> they have done nothing, in which case it's fine to do this from first
>>>> principles, but IMO it would be good to know just to make sure we're
>>>> heading in the right direction.
>>>
>>>> Also I don't get a good feel for how much work this will be for an end
>>> user
>>>> who is doing the rolling deployment, perhaps an end-to-end example
>> would
>>>> help.
>>>
>>> I can't speak for the other projects, so I'll have to take a look and
>> see.
>>>
>>> As for getting a feel for how much work this will be for an end user,
>> could
>>> you look at the
>>> updated KIP and see if it clears things up?
>>>
>>>
>>> Matthias
>>>
>>>> (1) For `Grouped` should we add `with(String name, Serde<K> key,
>>>> Serde<V> value)` to allow specifying all parameters at once?
>>>
>>>> Produced/Consumed/Serialized etc follow a similar pattern. There is one
>>>> static method for each config parameter, plus one static method that
>>>> accepts all parameters. Might be more consistent if we follow this
>>> pattern.
>>>
>>>> (2) It seems that `Serialized` is only used in `groupBy` and
>>>> `groupByKey` -- because both methods accepting `Serialized` parameter
>>>> are deprecated and replaced with methods accepting `Grouped`, it seems
>>>> that we also want to deprecate `Serialized`?
>>>
>>>> (3) About naming repartition topics: thinking about this once more, I
>>>> actually prefer to use `left|right` instead of `this|other` :)
>>>
>>> For 1, I agree and I'll update the KIP.
>>>
>>> For 2, Yes that seems to make sense and I'll update the KIP
>>>
>>> Part 3 addressed in earlier comments and the KIP is clarified.
>>>
>>> Guozhang
>>>
>>>> Just to clarify on 2): currently KIP-307 do not have proposed APIs for
>>>> `groupBy/groupByKey` naming schemes, and for joins its current proposal
>>> is
>>>> to extend ValueJoiner with Named and hence this part is what I meant to
>>>> have "overlaps".
>>>
>>>> Thinking about it a bit more, since Joined is only used for S-S and S-T
>>>> joins but not T-T joins, having the naming schemes on Joined would not
>> be
>>>> sufficient, and extending ValueJoiner would indeed be a good choice.
>>>
>>>> As for groupBy, since it is using KeyValueMapper which is supposed to
>> be
>>>> extended with Named in KIP-307, it does not require extending to
>>> processor
>>>> nodes as well.
>>>
>>>> Given this, I'm fine with limiting the scope to only repartition
>> topics.
>>>
>>> I agree with limiting the scope of this KIP to only repartition topics
>> and
>>> have updated the KIP
>>>
>>> John
>>>
>>>> I think it's slightly out of scope for this KIP, but I'm not sure it's
>>>> right to add a name to ValueJoiner or KeyValueMapper.
>>>> Both of those are "functional interfaces", that is, they are basically
>>>> named functions.
>>>>
>>>> It seems like we should preserve this property both to provide a clean
>>>> separation between the operation *logic* and the operator *config*.
>>>>
>>>> It's a good point that `Joined` doesn't appear in the KTable join.
>>> However,
>>>> KTable join does have the variant that accepts "Materialized".
>>>> This makes it similar to the grouping operators that currently accept
>>>> "Serialized", namely the operator is already configurable, but only in
>>>> a narrow way (we can only configure the materialization of the table or
>>> the
>>>> serialization of the grouping).
>>>>
>>>> Consider as an alternative the KStream.join operation that takes a
>> config
>>>> object called "Joined". This implies no more than that we are
>>>> configuring the join operation. If we are deciding to allow adding a
>> name
>>>> to the operation, it's natural to add it to the operation's config.
>>>>
>>>> Conversely, we also want to name the KTable.join operation, not the
>>>> materialization itself (which already has a name with separate
>>> semantics).
>>>>
>>>> To me, this suggests that, just like we propose to subsume the
>> Serialized
>>>> config for grouping into a more general config called Grouped, we
>> should
>>>> also want to do something similar and replace `KTable#join(KTable<>,
>>>> ValueJoiner<>, Materialized<>)` with one taking a more general
>>>> configuration, maybe:
>>>> `KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?
>>>
>>> I'm not sure I completely follow what you are suggesting.  But it seems
>>> like you are in favor of combining the naming of join
>>>
>>> and grouping operation and processors with this KIP.
>>>
>>> I'll say that I don't agree we should do the renaming in this KIP and
>>> restrict it to naming repartition topics
>>>
>>>
>>>    1. It gives us a good head start for users to be able to do rolling
>>>    updates once repartition topics have been named
>>>    2. Undertaking any KIP changes almost always introduces some
>> uncertainty
>>>    with respect to what changes will introduce, so a smaller scope helps
>>>    ensure the success of the KIP
>>>    3. We have an existing proposal KIP-307 for processor naming and I
>> think
>>>    it's best to do it all in one singular KIP so we don't have to undo
>> any
>>>    inconsistent previous work.
>>>
>>> So I'll have to agree with Matthias and Guozhang in limiting the scope of
>>> this KIP to naming repartition topics.
>>>
>>> To that end, I'll also say we defer adding another general configuration
>>> object `TableJoined` to the work of naming operations and processors in
>>> KIP-307
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Sep 13, 2018 at 6:49 PM John Roesler <jo...@confluent.io> wrote:
>>>
>>>> Hey all,
>>>>
>>>> I think it's slightly out of scope for this KIP, but I'm not sure it's
>>>> right to add a name to ValueJoiner or KeyValueMapper.
>>>> Both of those are "functional interfaces", that is, they are basically
>>>> named functions.
>>>>
>>>> It seems like we should preserve this property both to provide a clean
>>>> separation between the operation *logic* and the operator *config*.
>>>>
>>>> It's a good point that `Joined` doesn't appear in the KTable join.
>>> However,
>>>> KTable join does have the variant that accepts "Materialized".
>>>> This makes it similar to the grouping operators that currently accept
>>>> "Serialized", namely the operator is already configurable, but only in
>>>> a narrow way (we can only configure the materialization of the table or
>>> the
>>>> serialization of the grouping).
>>>>
>>>> Consider as an alternative the KStream.join operation that takes a
>> config
>>>> object called "Joined". This implies no more than that we are
>>>> configuring the join operation. If we are deciding to allow adding a
>> name
>>>> to the operation, it's natural to add it to the operation's config.
>>>>
>>>> Conversely, we also want to name the KTable.join operation, not the
>>>> materialization itself (which already has a name with separate
>>> semantics).
>>>>
>>>> To me, this suggests that, just like we propose to subsume the
>> Serialized
>>>> config for grouping into a more general config called Grouped, we
>> should
>>>> also want to do something similar and replace `KTable#join(KTable<>,
>>>> ValueJoiner<>, Materialized<>)` with one taking a more general
>>>> configuration, maybe:
>>>> `KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?
>>>>
>>>> Does this make sense?
>>>>
>>>> Thanks,
>>>> -John
>>>>
>>>> On Thu, Sep 13, 2018 at 3:45 PM Guozhang Wang <wa...@gmail.com>
>>> wrote:
>>>>
>>>>> Just to clarify on 2): currently KIP-307 do not have proposed APIs
>> for
>>>>> `groupBy/groupByKey` naming schemes, and for joins its current
>> proposal
>>>> is
>>>>> to extend ValueJoiner with Named and hence this part is what I meant
>> to
>>>>> have "overlaps".
>>>>>
>>>>> Thinking about it a bit more, since Joined is only used for S-S and
>> S-T
>>>>> joins but not T-T joins, having the naming schemes on Joined would
>> not
>>> be
>>>>> sufficient, and extending ValueJoiner would indeed be a good choice.
>>>>>
>>>>> As for groupBy, since it is using KeyValueMapper which is supposed to
>>> be
>>>>> extended with Named in KIP-307, it does not require extending to
>>>> processor
>>>>> nodes as well.
>>>>>
>>>>>
>>>>> Given this, I'm fine with limiting the scope to only repartition
>>> topics.
>>>>>
>>>>>
>>>>> Guozhang
>>>>>
>>>>> On Wed, Sep 12, 2018 at 10:22 PM, Matthias J. Sax <
>>> matthias@confluent.io
>>>>>
>>>>> wrote:
>>>>>
>>>>>> Follow up comments:
>>>>>>
>>>>>> 1) We should either use `[app-id]-this|other-[join-
>> name]-repartition`
>>>> or
>>>>>> `app-id]-[join-name]-left|right-repartition` but we should not
>> change
>>>>>> the pattern depending if the user specifies a name of not. I am
>> fine
>>>>>> with both patterns---just want to make sure with stick with one.
>>>>>>
>>>>>> 2) I didn't see why we would need to do this in this KIP. KIP-307
>>> seems
>>>>>> to be orthogonal, and thus KIP-372 should not change any processor
>>>>>> names, but KIP-307 should define a holistic strategy for all
>>> processor.
>>>>>> Otherwise, we might up with different strategies or revert what we
>>>>>> decide in this KIP if it's not compatible with KIP-307.
>>>>>>
>>>>>>
>>>>>> -Matthias
>>>>>>
>>>>>>
>>>>>> On 9/12/18 6:28 PM, Guozhang Wang wrote:
>>>>>>> Hello Bill,
>>>>>>>
>>>>>>> I made a pass over your proposal and here are some questions:
>>>>>>>
>>>>>>> 1. For Joined names, the current proposal is to define the
>>>> repartition
>>>>>>> topic names as
>>>>>>>
>>>>>>> * [app-id]-this-[join-name]-repartition
>>>>>>>
>>>>>>> * [app-id]-other-[join-name]-repartition
>>>>>>>
>>>>>>>
>>>>>>> And if [join-name] not specified, stay the same, which is:
>>>>>>>
>>>>>>> * [previous-processor-name]-repartition for both Stream-Stream
>>> (S-S)
>>>>>> join
>>>>>>> and S-T join
>>>>>>>
>>>>>>> I think it is more natural to rename it to
>>>>>>>
>>>>>>> * [app-id]-[join-name]-left-repartition
>>>>>>>
>>>>>>> * [app-id]-[join-name]-right-repartition
>>>>>>>
>>>>>>>
>>>>>>> 2. I'd suggest to use the name to also define the corresponding
>>>>> processor
>>>>>>> names accordingly, in addition to the repartition topic names.
>> Note
>>>>> that
>>>>>>> for joins, this may be overlapping with KIP-307
>>>>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>> 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
>>>>>>> as
>>>>>>> it also have proposals for defining processor names for join
>>>> operators
>>>>> as
>>>>>>> well.
>>>>>>>
>>>>>>> 3. Could you also specify how this would affect the optimization
>>> for
>>>>>>> merging multiple repartition topics?
>>>>>>>
>>>>>>> 4. In the "Compatibility, Deprecation, and Migration Plan"
>> section,
>>>>> could
>>>>>>> you also mention the following scenarios, if any of the upgrade
>>> path
>>>>>> would
>>>>>>> be changed:
>>>>>>>
>>>>>>>  a) changing user DSL code: under which scenarios users can now
>> do
>>> a
>>>>>>> rolling bounce instead of resetting applications.
>>>>>>>
>>>>>>>  b) upgrading from older version to new version, with all the
>> names
>>>>>>> specified, and with optimization turned on. E.g. say we have the
>>> code
>>>>>>> written in 2.1 with all names specified, and now upgrading to 2.2
>>>> with
>>>>>> new
>>>>>>> optimizations that may potentially change the repartition topics.
>>> Is
>>>>> that
>>>>>>> always safe to do?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Guozhang
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck <bb...@gmail.com>
>>>>> wrote:
>>>>>>>
>>>>>>>> All I'd like to start a discussion on KIP-372 for the naming of
>>>> joins
>>>>>> and
>>>>>>>> grouping operations in Kafka Streams.
>>>>>>>>
>>>>>>>> The KIP page can be found here:
>>>>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>>>>> 372%3A+Naming+Joins+and+Grouping
>>>>>>>>
>>>>>>>> I look forward to feedback and comments.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Bill
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> -- Guozhang
>>>>>
>>>>
>>>
>>
> 
> 
>

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Posted by Guozhang Wang <wa...@gmail.com>.

Hello Bill,

I've made another pass over the wiki page and it lgtm now. Thanks!


Guozhang


On Fri, Sep 14, 2018 at 3:05 PM, John Roesler <jo...@confluent.io> wrote:

> Hey Bill (and Guozhang),
>
> Apologies. I misunderstood what Guozhang was getting at in his earlier
> remark.
>
> I'm a +1 on the current proposal.
>
> Thanks,
> -John
>
> On Fri, Sep 14, 2018 at 3:10 PM Bill Bejeck <bb...@gmail.com> wrote:
>
> > All,
> >
> > Thanks for the comments, I'll respond to all of your comments in the
> > recieved order.
> >
> >
> > Guozhang,
> >
> > >  And if [join-name] not specified, stay the same, which is:
> > > * [previous-processor-name]-repartition for both Stream-Stream (S-S)
> join
> > and S-T join
> >
> >   I believe the current approach is
> > [appId]-[previous-processor-name]-repartition, but the main point is if
> no
> > name is provided, the current naming scheme will remain in place
> >
> > >  I think it is more natural to rename it to
> > >  * [app-id]-[join-name]-left-repartition
> > >  * [app-id]-[join-name]-right-repartition
> >
> > I agree and I'll update the KIP accordingly
> >
> > > 2 I'd suggest to use the name to also define the corresponding
> processor
> >
> > I'll address this in conjunction with comments from Matthias and your
> > second round of comments.
> >
> >
> > >3. Could you also specify how this would affect the optimization for
> >
> >                  merging multiple repartition topics?
> >
> >           With an optimization where we reduce the number of repartition
> > topics, if the user has not named the repartition topics, we'll generate
> a
> > name for
> >           the optimized repartition topic. In the case where users have
> > provided names for the repartition topics,
> >           we'll reuse the name from the first of the named repartition
> > topics.
> >           However, optimizing a topology by reducing the number of
> > repartition topics may require an application reset, but that is beyond
> the
> > scope of this KIP.
> >
> >           I'll update the KIP with this case
> >
> >           I've also addressed 4.a and 4.b in the "Compatibility,
> > Deprecation and Migration plan" section
> >
> >
> > Matthias
> >
> > > 1) We should either use `[app-id]-this|other-[join-name]-repartition`
> or
> > > `app-id]-[join-name]-left|right-repartition` but we should not change
> > > the pattern depending if the user specifies a name of not.
> >
> > I've updated the KIP to say in the case of Joins and a name is provided
> > we'll go with `app-id-name-left|right-repartition` in all cases if the
> name
> > is not provided
> > we will continue to use the current naming process.
> >
> > > 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> > > to be orthogonal, and thus KIP-372 should not change any processor
> > > names, but KIP-307 should define a holistic strategy for all processor.
> > > Otherwise, we might up with different strategies or revert what we
> > > decide in this KIP if it's not compatible with KIP-307.
> >
> > I agree and have updated the name of this KIP to be more precise.
> >
> > Eno
> >
> > > I know we don't normally have a "Related work" section in KIPs, but
> > > sometimes I find it useful to see what others have done in similar
> cases.
> > > Since this will be important for rolling re-deployments, I wonder what
> > > other frameworks like Flink (or Samza) have done in these cases.
> Perhaps
> > > they have done nothing, in which case it's fine to do this from first
> > > principles, but IMO it would be good to know just to make sure we're
> > > heading in the right direction.
> >
> > > Also I don't get a good feel for how much work this will be for an end
> > user
> > > who is doing the rolling deployment, perhaps an end-to-end example
> would
> > > help.
> >
> > I can't speak for the other projects, so I'll have to take a look and
> see.
> >
> > As for getting a feel for how much work this will be for an end user,
> could
> > you look at the
> > updated KIP and see if it clears things up?
> >
> >
> > Matthias
> >
> > > (1) For `Grouped` should we add `with(String name, Serde<K> key,
> > > Serde<V> value)` to allow specifying all parameters at once?
> >
> > > Produced/Consumed/Serialized etc follow a similar pattern. There is one
> > > static method for each config parameter, plus one static method that
> > > accepts all parameters. Might be more consistent if we follow this
> > pattern.
> >
> > > (2) It seems that `Serialized` is only used in `groupBy` and
> > > `groupByKey` -- because both methods accepting `Serialized` parameter
> > > are deprecated and replaced with methods accepting `Grouped`, it seems
> > > that we also want to deprecate `Serialized`?
> >
> > > (3) About naming repartition topics: thinking about this once more, I
> > > actually prefer to use `left|right` instead of `this|other` :)
> >
> > For 1, I agree and I'll update the KIP.
> >
> > For 2, Yes that seems to make sense and I'll update the KIP
> >
> > Part 3 addressed in earlier comments and the KIP is clarified.
> >
> > Guozhang
> >
> > > Just to clarify on 2): currently KIP-307 do not have proposed APIs for
> > > `groupBy/groupByKey` naming schemes, and for joins its current proposal
> > is
> > > to extend ValueJoiner with Named and hence this part is what I meant to
> > > have "overlaps".
> >
> > > Thinking about it a bit more, since Joined is only used for S-S and S-T
> > > joins but not T-T joins, having the naming schemes on Joined would not
> be
> > > sufficient, and extending ValueJoiner would indeed be a good choice.
> >
> > > As for groupBy, since it is using KeyValueMapper which is supposed to
> be
> > > extended with Named in KIP-307, it does not require extending to
> > processor
> > > nodes as well.
> >
> > > Given this, I'm fine with limiting the scope to only repartition
> topics.
> >
> > I agree with limiting the scope of this KIP to only repartition topics
> and
> > have updated the KIP
> >
> > John
> >
> > > I think it's slightly out of scope for this KIP, but I'm not sure it's
> > > right to add a name to ValueJoiner or KeyValueMapper.
> > > Both of those are "functional interfaces", that is, they are basically
> > > named functions.
> > >
> > > It seems like we should preserve this property both to provide a clean
> > > separation between the operation *logic* and the operator *config*.
> > >
> > > It's a good point that `Joined` doesn't appear in the KTable join.
> > However,
> > > KTable join does have the variant that accepts "Materialized".
> > > This makes it similar to the grouping operators that currently accept
> > > "Serialized", namely the operator is already configurable, but only in
> > > a narrow way (we can only configure the materialization of the table or
> > the
> > > serialization of the grouping).
> > >
> > > Consider as an alternative the KStream.join operation that takes a
> config
> > > object called "Joined". This implies no more than that we are
> > > configuring the join operation. If we are deciding to allow adding a
> name
> > > to the operation, it's natural to add it to the operation's config.
> > >
> > > Conversely, we also want to name the KTable.join operation, not the
> > > materialization itself (which already has a name with separate
> > semantics).
> > >
> > > To me, this suggests that, just like we propose to subsume the
> Serialized
> > > config for grouping into a more general config called Grouped, we
> should
> > > also want to do something similar and replace `KTable#join(KTable<>,
> > > ValueJoiner<>, Materialized<>)` with one taking a more general
> > > configuration, maybe:
> > > `KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?
> >
> > I'm not sure I completely follow what you are suggesting.  But it seems
> > like you are in favor of combining the naming of join
> >
> > and grouping operation and processors with this KIP.
> >
> > I'll say that I don't agree we should do the renaming in this KIP and
> > restrict it to naming repartition topics
> >
> >
> >    1. It gives us a good head start for users to be able to do rolling
> >    updates once repartition topics have been named
> >    2. Undertaking any KIP changes almost always introduces some
> uncertainty
> >    with respect to what changes will introduce, so a smaller scope helps
> >    ensure the success of the KIP
> >    3. We have an existing proposal KIP-307 for processor naming and I
> think
> >    it's best to do it all in one singular KIP so we don't have to undo
> any
> >    inconsistent previous work.
> >
> > So I'll have to agree with Matthias and Guozhang in limiting the scope of
> > this KIP to naming repartition topics.
> >
> > To that end, I'll also say we defer adding another general configuration
> > object `TableJoined` to the work of naming operations and processors in
> > KIP-307
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Sep 13, 2018 at 6:49 PM John Roesler <jo...@confluent.io> wrote:
> >
> > > Hey all,
> > >
> > > I think it's slightly out of scope for this KIP, but I'm not sure it's
> > > right to add a name to ValueJoiner or KeyValueMapper.
> > > Both of those are "functional interfaces", that is, they are basically
> > > named functions.
> > >
> > > It seems like we should preserve this property both to provide a clean
> > > separation between the operation *logic* and the operator *config*.
> > >
> > > It's a good point that `Joined` doesn't appear in the KTable join.
> > However,
> > > KTable join does have the variant that accepts "Materialized".
> > > This makes it similar to the grouping operators that currently accept
> > > "Serialized", namely the operator is already configurable, but only in
> > > a narrow way (we can only configure the materialization of the table or
> > the
> > > serialization of the grouping).
> > >
> > > Consider as an alternative the KStream.join operation that takes a
> config
> > > object called "Joined". This implies no more than that we are
> > > configuring the join operation. If we are deciding to allow adding a
> name
> > > to the operation, it's natural to add it to the operation's config.
> > >
> > > Conversely, we also want to name the KTable.join operation, not the
> > > materialization itself (which already has a name with separate
> > semantics).
> > >
> > > To me, this suggests that, just like we propose to subsume the
> Serialized
> > > config for grouping into a more general config called Grouped, we
> should
> > > also want to do something similar and replace `KTable#join(KTable<>,
> > > ValueJoiner<>, Materialized<>)` with one taking a more general
> > > configuration, maybe:
> > > `KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?
> > >
> > > Does this make sense?
> > >
> > > Thanks,
> > > -John
> > >
> > > On Thu, Sep 13, 2018 at 3:45 PM Guozhang Wang <wa...@gmail.com>
> > wrote:
> > >
> > > > Just to clarify on 2): currently KIP-307 do not have proposed APIs
> for
> > > > `groupBy/groupByKey` naming schemes, and for joins its current
> proposal
> > > is
> > > > to extend ValueJoiner with Named and hence this part is what I meant
> to
> > > > have "overlaps".
> > > >
> > > > Thinking about it a bit more, since Joined is only used for S-S and
> S-T
> > > > joins but not T-T joins, having the naming schemes on Joined would
> not
> > be
> > > > sufficient, and extending ValueJoiner would indeed be a good choice.
> > > >
> > > > As for groupBy, since it is using KeyValueMapper which is supposed to
> > be
> > > > extended with Named in KIP-307, it does not require extending to
> > > processor
> > > > nodes as well.
> > > >
> > > >
> > > > Given this, I'm fine with limiting the scope to only repartition
> > topics.
> > > >
> > > >
> > > > Guozhang
> > > >
> > > > On Wed, Sep 12, 2018 at 10:22 PM, Matthias J. Sax <
> > matthias@confluent.io
> > > >
> > > > wrote:
> > > >
> > > > > Follow up comments:
> > > > >
> > > > > 1) We should either use `[app-id]-this|other-[join-
> name]-repartition`
> > > or
> > > > > `app-id]-[join-name]-left|right-repartition` but we should not
> change
> > > > > the pattern depending if the user specifies a name of not. I am
> fine
> > > > > with both patterns---just want to make sure with stick with one.
> > > > >
> > > > > 2) I didn't see why we would need to do this in this KIP. KIP-307
> > seems
> > > > > to be orthogonal, and thus KIP-372 should not change any processor
> > > > > names, but KIP-307 should define a holistic strategy for all
> > processor.
> > > > > Otherwise, we might up with different strategies or revert what we
> > > > > decide in this KIP if it's not compatible with KIP-307.
> > > > >
> > > > >
> > > > > -Matthias
> > > > >
> > > > >
> > > > > On 9/12/18 6:28 PM, Guozhang Wang wrote:
> > > > > > Hello Bill,
> > > > > >
> > > > > > I made a pass over your proposal and here are some questions:
> > > > > >
> > > > > > 1. For Joined names, the current proposal is to define the
> > > repartition
> > > > > > topic names as
> > > > > >
> > > > > > * [app-id]-this-[join-name]-repartition
> > > > > >
> > > > > > * [app-id]-other-[join-name]-repartition
> > > > > >
> > > > > >
> > > > > > And if [join-name] not specified, stay the same, which is:
> > > > > >
> > > > > > * [previous-processor-name]-repartition for both Stream-Stream
> > (S-S)
> > > > > join
> > > > > > and S-T join
> > > > > >
> > > > > > I think it is more natural to rename it to
> > > > > >
> > > > > > * [app-id]-[join-name]-left-repartition
> > > > > >
> > > > > > * [app-id]-[join-name]-right-repartition
> > > > > >
> > > > > >
> > > > > > 2. I'd suggest to use the name to also define the corresponding
> > > > processor
> > > > > > names accordingly, in addition to the repartition topic names.
> Note
> > > > that
> > > > > > for joins, this may be overlapping with KIP-307
> > > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
> > > > > > as
> > > > > > it also have proposals for defining processor names for join
> > > operators
> > > > as
> > > > > > well.
> > > > > >
> > > > > > 3. Could you also specify how this would affect the optimization
> > for
> > > > > > merging multiple repartition topics?
> > > > > >
> > > > > > 4. In the "Compatibility, Deprecation, and Migration Plan"
> section,
> > > > could
> > > > > > you also mention the following scenarios, if any of the upgrade
> > path
> > > > > would
> > > > > > be changed:
> > > > > >
> > > > > >  a) changing user DSL code: under which scenarios users can now
> do
> > a
> > > > > > rolling bounce instead of resetting applications.
> > > > > >
> > > > > >  b) upgrading from older version to new version, with all the
> names
> > > > > > specified, and with optimization turned on. E.g. say we have the
> > code
> > > > > > written in 2.1 with all names specified, and now upgrading to 2.2
> > > with
> > > > > new
> > > > > > optimizations that may potentially change the repartition topics.
> > Is
> > > > that
> > > > > > always safe to do?
> > > > > >
> > > > > >
> > > > > >
> > > > > > Guozhang
> > > > > >
> > > > > >
> > > > > > On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck <bb...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > >> All I'd like to start a discussion on KIP-372 for the naming of
> > > joins
> > > > > and
> > > > > >> grouping operations in Kafka Streams.
> > > > > >>
> > > > > >> The KIP page can be found here:
> > > > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > > >> 372%3A+Naming+Joins+and+Grouping
> > > > > >>
> > > > > >> I look forward to feedback and comments.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Bill
> > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > -- Guozhang
> > > >
> > >
> >
>



-- 
-- Guozhang

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Posted by John Roesler <jo...@confluent.io>.

Hey Bill (and Guozhang),

Apologies. I misunderstood what Guozhang was getting at in his earlier
remark.

I'm a +1 on the current proposal.

Thanks,
-John

On Fri, Sep 14, 2018 at 3:10 PM Bill Bejeck <bb...@gmail.com> wrote:

> All,
>
> Thanks for the comments, I'll respond to all of your comments in the
> recieved order.
>
>
> Guozhang,
>
> >  And if [join-name] not specified, stay the same, which is:
> > * [previous-processor-name]-repartition for both Stream-Stream (S-S) join
> and S-T join
>
>   I believe the current approach is
> [appId]-[previous-processor-name]-repartition, but the main point is if no
> name is provided, the current naming scheme will remain in place
>
> >  I think it is more natural to rename it to
> >  * [app-id]-[join-name]-left-repartition
> >  * [app-id]-[join-name]-right-repartition
>
> I agree and I'll update the KIP accordingly
>
> > 2 I'd suggest to use the name to also define the corresponding processor
>
> I'll address this in conjunction with comments from Matthias and your
> second round of comments.
>
>
> >3. Could you also specify how this would affect the optimization for
>
>                  merging multiple repartition topics?
>
>           With an optimization where we reduce the number of repartition
> topics, if the user has not named the repartition topics, we'll generate a
> name for
>           the optimized repartition topic. In the case where users have
> provided names for the repartition topics,
>           we'll reuse the name from the first of the named repartition
> topics.
>           However, optimizing a topology by reducing the number of
> repartition topics may require an application reset, but that is beyond the
> scope of this KIP.
>
>           I'll update the KIP with this case
>
>           I've also addressed 4.a and 4.b in the "Compatibility,
> Deprecation and Migration plan" section
>
>
> Matthias
>
> > 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
> > `app-id]-[join-name]-left|right-repartition` but we should not change
> > the pattern depending if the user specifies a name of not.
>
> I've updated the KIP to say in the case of Joins and a name is provided
> we'll go with `app-id-name-left|right-repartition` in all cases if the name
> is not provided
> we will continue to use the current naming process.
>
> > 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> > to be orthogonal, and thus KIP-372 should not change any processor
> > names, but KIP-307 should define a holistic strategy for all processor.
> > Otherwise, we might up with different strategies or revert what we
> > decide in this KIP if it's not compatible with KIP-307.
>
> I agree and have updated the name of this KIP to be more precise.
>
> Eno
>
> > I know we don't normally have a "Related work" section in KIPs, but
> > sometimes I find it useful to see what others have done in similar cases.
> > Since this will be important for rolling re-deployments, I wonder what
> > other frameworks like Flink (or Samza) have done in these cases. Perhaps
> > they have done nothing, in which case it's fine to do this from first
> > principles, but IMO it would be good to know just to make sure we're
> > heading in the right direction.
>
> > Also I don't get a good feel for how much work this will be for an end
> user
> > who is doing the rolling deployment, perhaps an end-to-end example would
> > help.
>
> I can't speak for the other projects, so I'll have to take a look and see.
>
> As for getting a feel for how much work this will be for an end user, could
> you look at the
> updated KIP and see if it clears things up?
>
>
> Matthias
>
> > (1) For `Grouped` should we add `with(String name, Serde<K> key,
> > Serde<V> value)` to allow specifying all parameters at once?
>
> > Produced/Consumed/Serialized etc follow a similar pattern. There is one
> > static method for each config parameter, plus one static method that
> > accepts all parameters. Might be more consistent if we follow this
> pattern.
>
> > (2) It seems that `Serialized` is only used in `groupBy` and
> > `groupByKey` -- because both methods accepting `Serialized` parameter
> > are deprecated and replaced with methods accepting `Grouped`, it seems
> > that we also want to deprecate `Serialized`?
>
> > (3) About naming repartition topics: thinking about this once more, I
> > actually prefer to use `left|right` instead of `this|other` :)
>
> For 1, I agree and I'll update the KIP.
>
> For 2, Yes that seems to make sense and I'll update the KIP
>
> Part 3 addressed in earlier comments and the KIP is clarified.
>
> Guozhang
>
> > Just to clarify on 2): currently KIP-307 do not have proposed APIs for
> > `groupBy/groupByKey` naming schemes, and for joins its current proposal
> is
> > to extend ValueJoiner with Named and hence this part is what I meant to
> > have "overlaps".
>
> > Thinking about it a bit more, since Joined is only used for S-S and S-T
> > joins but not T-T joins, having the naming schemes on Joined would not be
> > sufficient, and extending ValueJoiner would indeed be a good choice.
>
> > As for groupBy, since it is using KeyValueMapper which is supposed to be
> > extended with Named in KIP-307, it does not require extending to
> processor
> > nodes as well.
>
> > Given this, I'm fine with limiting the scope to only repartition topics.
>
> I agree with limiting the scope of this KIP to only repartition topics and
> have updated the KIP
>
> John
>
> > I think it's slightly out of scope for this KIP, but I'm not sure it's
> > right to add a name to ValueJoiner or KeyValueMapper.
> > Both of those are "functional interfaces", that is, they are basically
> > named functions.
> >
> > It seems like we should preserve this property both to provide a clean
> > separation between the operation *logic* and the operator *config*.
> >
> > It's a good point that `Joined` doesn't appear in the KTable join.
> However,
> > KTable join does have the variant that accepts "Materialized".
> > This makes it similar to the grouping operators that currently accept
> > "Serialized", namely the operator is already configurable, but only in
> > a narrow way (we can only configure the materialization of the table or
> the
> > serialization of the grouping).
> >
> > Consider as an alternative the KStream.join operation that takes a config
> > object called "Joined". This implies no more than that we are
> > configuring the join operation. If we are deciding to allow adding a name
> > to the operation, it's natural to add it to the operation's config.
> >
> > Conversely, we also want to name the KTable.join operation, not the
> > materialization itself (which already has a name with separate
> semantics).
> >
> > To me, this suggests that, just like we propose to subsume the Serialized
> > config for grouping into a more general config called Grouped, we should
> > also want to do something similar and replace `KTable#join(KTable<>,
> > ValueJoiner<>, Materialized<>)` with one taking a more general
> > configuration, maybe:
> > `KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?
>
> I'm not sure I completely follow what you are suggesting.  But it seems
> like you are in favor of combining the naming of join
>
> and grouping operation and processors with this KIP.
>
> I'll say that I don't agree we should do the renaming in this KIP and
> restrict it to naming repartition topics
>
>
>    1. It gives us a good head start for users to be able to do rolling
>    updates once repartition topics have been named
>    2. Undertaking any KIP changes almost always introduces some uncertainty
>    with respect to what changes will introduce, so a smaller scope helps
>    ensure the success of the KIP
>    3. We have an existing proposal KIP-307 for processor naming and I think
>    it's best to do it all in one singular KIP so we don't have to undo any
>    inconsistent previous work.
>
> So I'll have to agree with Matthias and Guozhang in limiting the scope of
> this KIP to naming repartition topics.
>
> To that end, I'll also say we defer adding another general configuration
> object `TableJoined` to the work of naming operations and processors in
> KIP-307
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Sep 13, 2018 at 6:49 PM John Roesler <jo...@confluent.io> wrote:
>
> > Hey all,
> >
> > I think it's slightly out of scope for this KIP, but I'm not sure it's
> > right to add a name to ValueJoiner or KeyValueMapper.
> > Both of those are "functional interfaces", that is, they are basically
> > named functions.
> >
> > It seems like we should preserve this property both to provide a clean
> > separation between the operation *logic* and the operator *config*.
> >
> > It's a good point that `Joined` doesn't appear in the KTable join.
> However,
> > KTable join does have the variant that accepts "Materialized".
> > This makes it similar to the grouping operators that currently accept
> > "Serialized", namely the operator is already configurable, but only in
> > a narrow way (we can only configure the materialization of the table or
> the
> > serialization of the grouping).
> >
> > Consider as an alternative the KStream.join operation that takes a config
> > object called "Joined". This implies no more than that we are
> > configuring the join operation. If we are deciding to allow adding a name
> > to the operation, it's natural to add it to the operation's config.
> >
> > Conversely, we also want to name the KTable.join operation, not the
> > materialization itself (which already has a name with separate
> semantics).
> >
> > To me, this suggests that, just like we propose to subsume the Serialized
> > config for grouping into a more general config called Grouped, we should
> > also want to do something similar and replace `KTable#join(KTable<>,
> > ValueJoiner<>, Materialized<>)` with one taking a more general
> > configuration, maybe:
> > `KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?
> >
> > Does this make sense?
> >
> > Thanks,
> > -John
> >
> > On Thu, Sep 13, 2018 at 3:45 PM Guozhang Wang <wa...@gmail.com>
> wrote:
> >
> > > Just to clarify on 2): currently KIP-307 do not have proposed APIs for
> > > `groupBy/groupByKey` naming schemes, and for joins its current proposal
> > is
> > > to extend ValueJoiner with Named and hence this part is what I meant to
> > > have "overlaps".
> > >
> > > Thinking about it a bit more, since Joined is only used for S-S and S-T
> > > joins but not T-T joins, having the naming schemes on Joined would not
> be
> > > sufficient, and extending ValueJoiner would indeed be a good choice.
> > >
> > > As for groupBy, since it is using KeyValueMapper which is supposed to
> be
> > > extended with Named in KIP-307, it does not require extending to
> > processor
> > > nodes as well.
> > >
> > >
> > > Given this, I'm fine with limiting the scope to only repartition
> topics.
> > >
> > >
> > > Guozhang
> > >
> > > On Wed, Sep 12, 2018 at 10:22 PM, Matthias J. Sax <
> matthias@confluent.io
> > >
> > > wrote:
> > >
> > > > Follow up comments:
> > > >
> > > > 1) We should either use `[app-id]-this|other-[join-name]-repartition`
> > or
> > > > `app-id]-[join-name]-left|right-repartition` but we should not change
> > > > the pattern depending if the user specifies a name of not. I am fine
> > > > with both patterns---just want to make sure with stick with one.
> > > >
> > > > 2) I didn't see why we would need to do this in this KIP. KIP-307
> seems
> > > > to be orthogonal, and thus KIP-372 should not change any processor
> > > > names, but KIP-307 should define a holistic strategy for all
> processor.
> > > > Otherwise, we might up with different strategies or revert what we
> > > > decide in this KIP if it's not compatible with KIP-307.
> > > >
> > > >
> > > > -Matthias
> > > >
> > > >
> > > > On 9/12/18 6:28 PM, Guozhang Wang wrote:
> > > > > Hello Bill,
> > > > >
> > > > > I made a pass over your proposal and here are some questions:
> > > > >
> > > > > 1. For Joined names, the current proposal is to define the
> > repartition
> > > > > topic names as
> > > > >
> > > > > * [app-id]-this-[join-name]-repartition
> > > > >
> > > > > * [app-id]-other-[join-name]-repartition
> > > > >
> > > > >
> > > > > And if [join-name] not specified, stay the same, which is:
> > > > >
> > > > > * [previous-processor-name]-repartition for both Stream-Stream
> (S-S)
> > > > join
> > > > > and S-T join
> > > > >
> > > > > I think it is more natural to rename it to
> > > > >
> > > > > * [app-id]-[join-name]-left-repartition
> > > > >
> > > > > * [app-id]-[join-name]-right-repartition
> > > > >
> > > > >
> > > > > 2. I'd suggest to use the name to also define the corresponding
> > > processor
> > > > > names accordingly, in addition to the repartition topic names. Note
> > > that
> > > > > for joins, this may be overlapping with KIP-307
> > > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
> > > > > as
> > > > > it also have proposals for defining processor names for join
> > operators
> > > as
> > > > > well.
> > > > >
> > > > > 3. Could you also specify how this would affect the optimization
> for
> > > > > merging multiple repartition topics?
> > > > >
> > > > > 4. In the "Compatibility, Deprecation, and Migration Plan" section,
> > > could
> > > > > you also mention the following scenarios, if any of the upgrade
> path
> > > > would
> > > > > be changed:
> > > > >
> > > > >  a) changing user DSL code: under which scenarios users can now do
> a
> > > > > rolling bounce instead of resetting applications.
> > > > >
> > > > >  b) upgrading from older version to new version, with all the names
> > > > > specified, and with optimization turned on. E.g. say we have the
> code
> > > > > written in 2.1 with all names specified, and now upgrading to 2.2
> > with
> > > > new
> > > > > optimizations that may potentially change the repartition topics.
> Is
> > > that
> > > > > always safe to do?
> > > > >
> > > > >
> > > > >
> > > > > Guozhang
> > > > >
> > > > >
> > > > > On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck <bb...@gmail.com>
> > > wrote:
> > > > >
> > > > >> All I'd like to start a discussion on KIP-372 for the naming of
> > joins
> > > > and
> > > > >> grouping operations in Kafka Streams.
> > > > >>
> > > > >> The KIP page can be found here:
> > > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > > >> 372%3A+Naming+Joins+and+Grouping
> > > > >>
> > > > >> I look forward to feedback and comments.
> > > > >>
> > > > >> Thanks,
> > > > >> Bill
> > > > >>
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > -- Guozhang
> > >
> >
>

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Posted by Bill Bejeck <bb...@gmail.com>.

All,

Thanks for the comments, I'll respond to all of your comments in the
recieved order.


Guozhang,

>  And if [join-name] not specified, stay the same, which is:
> * [previous-processor-name]-repartition for both Stream-Stream (S-S) join
and S-T join

  I believe the current approach is
[appId]-[previous-processor-name]-repartition, but the main point is if no
name is provided, the current naming scheme will remain in place

>  I think it is more natural to rename it to
>  * [app-id]-[join-name]-left-repartition
>  * [app-id]-[join-name]-right-repartition

I agree and I'll update the KIP accordingly

> 2 I'd suggest to use the name to also define the corresponding processor

I'll address this in conjunction with comments from Matthias and your
second round of comments.


>3. Could you also specify how this would affect the optimization for

                 merging multiple repartition topics?

          With an optimization where we reduce the number of repartition
topics, if the user has not named the repartition topics, we'll generate a
name for
          the optimized repartition topic. In the case where users have
provided names for the repartition topics,
          we'll reuse the name from the first of the named repartition
topics.
          However, optimizing a topology by reducing the number of
repartition topics may require an application reset, but that is beyond the
scope of this KIP.

          I'll update the KIP with this case

          I've also addressed 4.a and 4.b in the "Compatibility,
Deprecation and Migration plan" section


Matthias

> 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
> `app-id]-[join-name]-left|right-repartition` but we should not change
> the pattern depending if the user specifies a name of not.

I've updated the KIP to say in the case of Joins and a name is provided
we'll go with `app-id-name-left|right-repartition` in all cases if the name
is not provided
we will continue to use the current naming process.

> 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> to be orthogonal, and thus KIP-372 should not change any processor
> names, but KIP-307 should define a holistic strategy for all processor.
> Otherwise, we might up with different strategies or revert what we
> decide in this KIP if it's not compatible with KIP-307.

I agree and have updated the name of this KIP to be more precise.

Eno

> I know we don't normally have a "Related work" section in KIPs, but
> sometimes I find it useful to see what others have done in similar cases.
> Since this will be important for rolling re-deployments, I wonder what
> other frameworks like Flink (or Samza) have done in these cases. Perhaps
> they have done nothing, in which case it's fine to do this from first
> principles, but IMO it would be good to know just to make sure we're
> heading in the right direction.

> Also I don't get a good feel for how much work this will be for an end
user
> who is doing the rolling deployment, perhaps an end-to-end example would
> help.

I can't speak for the other projects, so I'll have to take a look and see.

As for getting a feel for how much work this will be for an end user, could
you look at the
updated KIP and see if it clears things up?


Matthias

> (1) For `Grouped` should we add `with(String name, Serde<K> key,
> Serde<V> value)` to allow specifying all parameters at once?

> Produced/Consumed/Serialized etc follow a similar pattern. There is one
> static method for each config parameter, plus one static method that
> accepts all parameters. Might be more consistent if we follow this
pattern.

> (2) It seems that `Serialized` is only used in `groupBy` and
> `groupByKey` -- because both methods accepting `Serialized` parameter
> are deprecated and replaced with methods accepting `Grouped`, it seems
> that we also want to deprecate `Serialized`?

> (3) About naming repartition topics: thinking about this once more, I
> actually prefer to use `left|right` instead of `this|other` :)

For 1, I agree and I'll update the KIP.

For 2, Yes that seems to make sense and I'll update the KIP

Part 3 addressed in earlier comments and the KIP is clarified.

Guozhang

> Just to clarify on 2): currently KIP-307 do not have proposed APIs for
> `groupBy/groupByKey` naming schemes, and for joins its current proposal is
> to extend ValueJoiner with Named and hence this part is what I meant to
> have "overlaps".

> Thinking about it a bit more, since Joined is only used for S-S and S-T
> joins but not T-T joins, having the naming schemes on Joined would not be
> sufficient, and extending ValueJoiner would indeed be a good choice.

> As for groupBy, since it is using KeyValueMapper which is supposed to be
> extended with Named in KIP-307, it does not require extending to processor
> nodes as well.

> Given this, I'm fine with limiting the scope to only repartition topics.

I agree with limiting the scope of this KIP to only repartition topics and
have updated the KIP

John

> I think it's slightly out of scope for this KIP, but I'm not sure it's
> right to add a name to ValueJoiner or KeyValueMapper.
> Both of those are "functional interfaces", that is, they are basically
> named functions.
>
> It seems like we should preserve this property both to provide a clean
> separation between the operation *logic* and the operator *config*.
>
> It's a good point that `Joined` doesn't appear in the KTable join.
However,
> KTable join does have the variant that accepts "Materialized".
> This makes it similar to the grouping operators that currently accept
> "Serialized", namely the operator is already configurable, but only in
> a narrow way (we can only configure the materialization of the table or
the
> serialization of the grouping).
>
> Consider as an alternative the KStream.join operation that takes a config
> object called "Joined". This implies no more than that we are
> configuring the join operation. If we are deciding to allow adding a name
> to the operation, it's natural to add it to the operation's config.
>
> Conversely, we also want to name the KTable.join operation, not the
> materialization itself (which already has a name with separate semantics).
>
> To me, this suggests that, just like we propose to subsume the Serialized
> config for grouping into a more general config called Grouped, we should
> also want to do something similar and replace `KTable#join(KTable<>,
> ValueJoiner<>, Materialized<>)` with one taking a more general
> configuration, maybe:
> `KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?

I'm not sure I completely follow what you are suggesting.  But it seems
like you are in favor of combining the naming of join

and grouping operation and processors with this KIP.

I'll say that I don't agree we should do the renaming in this KIP and
restrict it to naming repartition topics


   1. It gives us a good head start for users to be able to do rolling
   updates once repartition topics have been named
   2. Undertaking any KIP changes almost always introduces some uncertainty
   with respect to what changes will introduce, so a smaller scope helps
   ensure the success of the KIP
   3. We have an existing proposal KIP-307 for processor naming and I think
   it's best to do it all in one singular KIP so we don't have to undo any
   inconsistent previous work.

So I'll have to agree with Matthias and Guozhang in limiting the scope of
this KIP to naming repartition topics.

To that end, I'll also say we defer adding another general configuration
object `TableJoined` to the work of naming operations and processors in
KIP-307

























On Thu, Sep 13, 2018 at 6:49 PM John Roesler <jo...@confluent.io> wrote:

> Hey all,
>
> I think it's slightly out of scope for this KIP, but I'm not sure it's
> right to add a name to ValueJoiner or KeyValueMapper.
> Both of those are "functional interfaces", that is, they are basically
> named functions.
>
> It seems like we should preserve this property both to provide a clean
> separation between the operation *logic* and the operator *config*.
>
> It's a good point that `Joined` doesn't appear in the KTable join. However,
> KTable join does have the variant that accepts "Materialized".
> This makes it similar to the grouping operators that currently accept
> "Serialized", namely the operator is already configurable, but only in
> a narrow way (we can only configure the materialization of the table or the
> serialization of the grouping).
>
> Consider as an alternative the KStream.join operation that takes a config
> object called "Joined". This implies no more than that we are
> configuring the join operation. If we are deciding to allow adding a name
> to the operation, it's natural to add it to the operation's config.
>
> Conversely, we also want to name the KTable.join operation, not the
> materialization itself (which already has a name with separate semantics).
>
> To me, this suggests that, just like we propose to subsume the Serialized
> config for grouping into a more general config called Grouped, we should
> also want to do something similar and replace `KTable#join(KTable<>,
> ValueJoiner<>, Materialized<>)` with one taking a more general
> configuration, maybe:
> `KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?
>
> Does this make sense?
>
> Thanks,
> -John
>
> On Thu, Sep 13, 2018 at 3:45 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Just to clarify on 2): currently KIP-307 do not have proposed APIs for
> > `groupBy/groupByKey` naming schemes, and for joins its current proposal
> is
> > to extend ValueJoiner with Named and hence this part is what I meant to
> > have "overlaps".
> >
> > Thinking about it a bit more, since Joined is only used for S-S and S-T
> > joins but not T-T joins, having the naming schemes on Joined would not be
> > sufficient, and extending ValueJoiner would indeed be a good choice.
> >
> > As for groupBy, since it is using KeyValueMapper which is supposed to be
> > extended with Named in KIP-307, it does not require extending to
> processor
> > nodes as well.
> >
> >
> > Given this, I'm fine with limiting the scope to only repartition topics.
> >
> >
> > Guozhang
> >
> > On Wed, Sep 12, 2018 at 10:22 PM, Matthias J. Sax <matthias@confluent.io
> >
> > wrote:
> >
> > > Follow up comments:
> > >
> > > 1) We should either use `[app-id]-this|other-[join-name]-repartition`
> or
> > > `app-id]-[join-name]-left|right-repartition` but we should not change
> > > the pattern depending if the user specifies a name of not. I am fine
> > > with both patterns---just want to make sure with stick with one.
> > >
> > > 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> > > to be orthogonal, and thus KIP-372 should not change any processor
> > > names, but KIP-307 should define a holistic strategy for all processor.
> > > Otherwise, we might up with different strategies or revert what we
> > > decide in this KIP if it's not compatible with KIP-307.
> > >
> > >
> > > -Matthias
> > >
> > >
> > > On 9/12/18 6:28 PM, Guozhang Wang wrote:
> > > > Hello Bill,
> > > >
> > > > I made a pass over your proposal and here are some questions:
> > > >
> > > > 1. For Joined names, the current proposal is to define the
> repartition
> > > > topic names as
> > > >
> > > > * [app-id]-this-[join-name]-repartition
> > > >
> > > > * [app-id]-other-[join-name]-repartition
> > > >
> > > >
> > > > And if [join-name] not specified, stay the same, which is:
> > > >
> > > > * [previous-processor-name]-repartition for both Stream-Stream (S-S)
> > > join
> > > > and S-T join
> > > >
> > > > I think it is more natural to rename it to
> > > >
> > > > * [app-id]-[join-name]-left-repartition
> > > >
> > > > * [app-id]-[join-name]-right-repartition
> > > >
> > > >
> > > > 2. I'd suggest to use the name to also define the corresponding
> > processor
> > > > names accordingly, in addition to the repartition topic names. Note
> > that
> > > > for joins, this may be overlapping with KIP-307
> > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
> > > > as
> > > > it also have proposals for defining processor names for join
> operators
> > as
> > > > well.
> > > >
> > > > 3. Could you also specify how this would affect the optimization for
> > > > merging multiple repartition topics?
> > > >
> > > > 4. In the "Compatibility, Deprecation, and Migration Plan" section,
> > could
> > > > you also mention the following scenarios, if any of the upgrade path
> > > would
> > > > be changed:
> > > >
> > > >  a) changing user DSL code: under which scenarios users can now do a
> > > > rolling bounce instead of resetting applications.
> > > >
> > > >  b) upgrading from older version to new version, with all the names
> > > > specified, and with optimization turned on. E.g. say we have the code
> > > > written in 2.1 with all names specified, and now upgrading to 2.2
> with
> > > new
> > > > optimizations that may potentially change the repartition topics. Is
> > that
> > > > always safe to do?
> > > >
> > > >
> > > >
> > > > Guozhang
> > > >
> > > >
> > > > On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck <bb...@gmail.com>
> > wrote:
> > > >
> > > >> All I'd like to start a discussion on KIP-372 for the naming of
> joins
> > > and
> > > >> grouping operations in Kafka Streams.
> > > >>
> > > >> The KIP page can be found here:
> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > >> 372%3A+Naming+Joins+and+Grouping
> > > >>
> > > >> I look forward to feedback and comments.
> > > >>
> > > >> Thanks,
> > > >> Bill
> > > >>
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Posted by Guozhang Wang <wa...@gmail.com>.

Hello John,

Not sure if I completely understand your email above. Are you suggesting to
still use the proposed Joined / Grouped object to indicate the underlying
processor names *in addition to* the repartition topic names?

My reasoning is that, if we do not want to use the proposed names to
indicate underlying processor names, only the repartition topic names, then
the current proposal looks good since KTable#join would not introduce
repartition topics at all. AND regarding underlying processor names, it is
suggested that we should only consider it in KIP-307 separately. Are you
suggesting that we should consider them together?


Guozhang

On Thu, Sep 13, 2018 at 3:49 PM, John Roesler <jo...@confluent.io> wrote:

> Hey all,
>
> I think it's slightly out of scope for this KIP, but I'm not sure it's
> right to add a name to ValueJoiner or KeyValueMapper.
> Both of those are "functional interfaces", that is, they are basically
> named functions.
>
> It seems like we should preserve this property both to provide a clean
> separation between the operation *logic* and the operator *config*.
>
> It's a good point that `Joined` doesn't appear in the KTable join. However,
> KTable join does have the variant that accepts "Materialized".
> This makes it similar to the grouping operators that currently accept
> "Serialized", namely the operator is already configurable, but only in
> a narrow way (we can only configure the materialization of the table or the
> serialization of the grouping).
>
> Consider as an alternative the KStream.join operation that takes a config
> object called "Joined". This implies no more than that we are
> configuring the join operation. If we are deciding to allow adding a name
> to the operation, it's natural to add it to the operation's config.
>
> Conversely, we also want to name the KTable.join operation, not the
> materialization itself (which already has a name with separate semantics).
>
> To me, this suggests that, just like we propose to subsume the Serialized
> config for grouping into a more general config called Grouped, we should
> also want to do something similar and replace `KTable#join(KTable<>,
> ValueJoiner<>, Materialized<>)` with one taking a more general
> configuration, maybe:
> `KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?
>
> Does this make sense?
>
> Thanks,
> -John
>
> On Thu, Sep 13, 2018 at 3:45 PM Guozhang Wang <wa...@gmail.com> wrote:
>
> > Just to clarify on 2): currently KIP-307 do not have proposed APIs for
> > `groupBy/groupByKey` naming schemes, and for joins its current proposal
> is
> > to extend ValueJoiner with Named and hence this part is what I meant to
> > have "overlaps".
> >
> > Thinking about it a bit more, since Joined is only used for S-S and S-T
> > joins but not T-T joins, having the naming schemes on Joined would not be
> > sufficient, and extending ValueJoiner would indeed be a good choice.
> >
> > As for groupBy, since it is using KeyValueMapper which is supposed to be
> > extended with Named in KIP-307, it does not require extending to
> processor
> > nodes as well.
> >
> >
> > Given this, I'm fine with limiting the scope to only repartition topics.
> >
> >
> > Guozhang
> >
> > On Wed, Sep 12, 2018 at 10:22 PM, Matthias J. Sax <matthias@confluent.io
> >
> > wrote:
> >
> > > Follow up comments:
> > >
> > > 1) We should either use `[app-id]-this|other-[join-name]-repartition`
> or
> > > `app-id]-[join-name]-left|right-repartition` but we should not change
> > > the pattern depending if the user specifies a name of not. I am fine
> > > with both patterns---just want to make sure with stick with one.
> > >
> > > 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> > > to be orthogonal, and thus KIP-372 should not change any processor
> > > names, but KIP-307 should define a holistic strategy for all processor.
> > > Otherwise, we might up with different strategies or revert what we
> > > decide in this KIP if it's not compatible with KIP-307.
> > >
> > >
> > > -Matthias
> > >
> > >
> > > On 9/12/18 6:28 PM, Guozhang Wang wrote:
> > > > Hello Bill,
> > > >
> > > > I made a pass over your proposal and here are some questions:
> > > >
> > > > 1. For Joined names, the current proposal is to define the
> repartition
> > > > topic names as
> > > >
> > > > * [app-id]-this-[join-name]-repartition
> > > >
> > > > * [app-id]-other-[join-name]-repartition
> > > >
> > > >
> > > > And if [join-name] not specified, stay the same, which is:
> > > >
> > > > * [previous-processor-name]-repartition for both Stream-Stream (S-S)
> > > join
> > > > and S-T join
> > > >
> > > > I think it is more natural to rename it to
> > > >
> > > > * [app-id]-[join-name]-left-repartition
> > > >
> > > > * [app-id]-[join-name]-right-repartition
> > > >
> > > >
> > > > 2. I'd suggest to use the name to also define the corresponding
> > processor
> > > > names accordingly, in addition to the repartition topic names. Note
> > that
> > > > for joins, this may be overlapping with KIP-307
> > > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
> > > > as
> > > > it also have proposals for defining processor names for join
> operators
> > as
> > > > well.
> > > >
> > > > 3. Could you also specify how this would affect the optimization for
> > > > merging multiple repartition topics?
> > > >
> > > > 4. In the "Compatibility, Deprecation, and Migration Plan" section,
> > could
> > > > you also mention the following scenarios, if any of the upgrade path
> > > would
> > > > be changed:
> > > >
> > > >  a) changing user DSL code: under which scenarios users can now do a
> > > > rolling bounce instead of resetting applications.
> > > >
> > > >  b) upgrading from older version to new version, with all the names
> > > > specified, and with optimization turned on. E.g. say we have the code
> > > > written in 2.1 with all names specified, and now upgrading to 2.2
> with
> > > new
> > > > optimizations that may potentially change the repartition topics. Is
> > that
> > > > always safe to do?
> > > >
> > > >
> > > >
> > > > Guozhang
> > > >
> > > >
> > > > On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck <bb...@gmail.com>
> > wrote:
> > > >
> > > >> All I'd like to start a discussion on KIP-372 for the naming of
> joins
> > > and
> > > >> grouping operations in Kafka Streams.
> > > >>
> > > >> The KIP page can be found here:
> > > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > > >> 372%3A+Naming+Joins+and+Grouping
> > > >>
> > > >> I look forward to feedback and comments.
> > > >>
> > > >> Thanks,
> > > >> Bill
> > > >>
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> > --
> > -- Guozhang
> >
>



-- 
-- Guozhang

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Posted by John Roesler <jo...@confluent.io>.

Hey all,

I think it's slightly out of scope for this KIP, but I'm not sure it's
right to add a name to ValueJoiner or KeyValueMapper.
Both of those are "functional interfaces", that is, they are basically
named functions.

It seems like we should preserve this property both to provide a clean
separation between the operation *logic* and the operator *config*.

It's a good point that `Joined` doesn't appear in the KTable join. However,
KTable join does have the variant that accepts "Materialized".
This makes it similar to the grouping operators that currently accept
"Serialized", namely the operator is already configurable, but only in
a narrow way (we can only configure the materialization of the table or the
serialization of the grouping).

Consider as an alternative the KStream.join operation that takes a config
object called "Joined". This implies no more than that we are
configuring the join operation. If we are deciding to allow adding a name
to the operation, it's natural to add it to the operation's config.

Conversely, we also want to name the KTable.join operation, not the
materialization itself (which already has a name with separate semantics).

To me, this suggests that, just like we propose to subsume the Serialized
config for grouping into a more general config called Grouped, we should
also want to do something similar and replace `KTable#join(KTable<>,
ValueJoiner<>, Materialized<>)` with one taking a more general
configuration, maybe:
`KTable#join(KTable<>, ValueJoiner<>, TableJoined<>)`?

Does this make sense?

Thanks,
-John

On Thu, Sep 13, 2018 at 3:45 PM Guozhang Wang <wa...@gmail.com> wrote:

> Just to clarify on 2): currently KIP-307 do not have proposed APIs for
> `groupBy/groupByKey` naming schemes, and for joins its current proposal is
> to extend ValueJoiner with Named and hence this part is what I meant to
> have "overlaps".
>
> Thinking about it a bit more, since Joined is only used for S-S and S-T
> joins but not T-T joins, having the naming schemes on Joined would not be
> sufficient, and extending ValueJoiner would indeed be a good choice.
>
> As for groupBy, since it is using KeyValueMapper which is supposed to be
> extended with Named in KIP-307, it does not require extending to processor
> nodes as well.
>
>
> Given this, I'm fine with limiting the scope to only repartition topics.
>
>
> Guozhang
>
> On Wed, Sep 12, 2018 at 10:22 PM, Matthias J. Sax <ma...@confluent.io>
> wrote:
>
> > Follow up comments:
> >
> > 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
> > `app-id]-[join-name]-left|right-repartition` but we should not change
> > the pattern depending if the user specifies a name of not. I am fine
> > with both patterns---just want to make sure with stick with one.
> >
> > 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> > to be orthogonal, and thus KIP-372 should not change any processor
> > names, but KIP-307 should define a holistic strategy for all processor.
> > Otherwise, we might up with different strategies or revert what we
> > decide in this KIP if it's not compatible with KIP-307.
> >
> >
> > -Matthias
> >
> >
> > On 9/12/18 6:28 PM, Guozhang Wang wrote:
> > > Hello Bill,
> > >
> > > I made a pass over your proposal and here are some questions:
> > >
> > > 1. For Joined names, the current proposal is to define the repartition
> > > topic names as
> > >
> > > * [app-id]-this-[join-name]-repartition
> > >
> > > * [app-id]-other-[join-name]-repartition
> > >
> > >
> > > And if [join-name] not specified, stay the same, which is:
> > >
> > > * [previous-processor-name]-repartition for both Stream-Stream (S-S)
> > join
> > > and S-T join
> > >
> > > I think it is more natural to rename it to
> > >
> > > * [app-id]-[join-name]-left-repartition
> > >
> > > * [app-id]-[join-name]-right-repartition
> > >
> > >
> > > 2. I'd suggest to use the name to also define the corresponding
> processor
> > > names accordingly, in addition to the repartition topic names. Note
> that
> > > for joins, this may be overlapping with KIP-307
> > > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
> > > as
> > > it also have proposals for defining processor names for join operators
> as
> > > well.
> > >
> > > 3. Could you also specify how this would affect the optimization for
> > > merging multiple repartition topics?
> > >
> > > 4. In the "Compatibility, Deprecation, and Migration Plan" section,
> could
> > > you also mention the following scenarios, if any of the upgrade path
> > would
> > > be changed:
> > >
> > >  a) changing user DSL code: under which scenarios users can now do a
> > > rolling bounce instead of resetting applications.
> > >
> > >  b) upgrading from older version to new version, with all the names
> > > specified, and with optimization turned on. E.g. say we have the code
> > > written in 2.1 with all names specified, and now upgrading to 2.2 with
> > new
> > > optimizations that may potentially change the repartition topics. Is
> that
> > > always safe to do?
> > >
> > >
> > >
> > > Guozhang
> > >
> > >
> > > On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck <bb...@gmail.com>
> wrote:
> > >
> > >> All I'd like to start a discussion on KIP-372 for the naming of joins
> > and
> > >> grouping operations in Kafka Streams.
> > >>
> > >> The KIP page can be found here:
> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> > >> 372%3A+Naming+Joins+and+Grouping
> > >>
> > >> I look forward to feedback and comments.
> > >>
> > >> Thanks,
> > >> Bill
> > >>
> > >
> > >
> > >
> >
> >
>
>
> --
> -- Guozhang
>

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Posted by Guozhang Wang <wa...@gmail.com>.

Just to clarify on 2): currently KIP-307 do not have proposed APIs for
`groupBy/groupByKey` naming schemes, and for joins its current proposal is
to extend ValueJoiner with Named and hence this part is what I meant to
have "overlaps".

Thinking about it a bit more, since Joined is only used for S-S and S-T
joins but not T-T joins, having the naming schemes on Joined would not be
sufficient, and extending ValueJoiner would indeed be a good choice.

As for groupBy, since it is using KeyValueMapper which is supposed to be
extended with Named in KIP-307, it does not require extending to processor
nodes as well.


Given this, I'm fine with limiting the scope to only repartition topics.


Guozhang

On Wed, Sep 12, 2018 at 10:22 PM, Matthias J. Sax <ma...@confluent.io>
wrote:

> Follow up comments:
>
> 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
> `app-id]-[join-name]-left|right-repartition` but we should not change
> the pattern depending if the user specifies a name of not. I am fine
> with both patterns---just want to make sure with stick with one.
>
> 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> to be orthogonal, and thus KIP-372 should not change any processor
> names, but KIP-307 should define a holistic strategy for all processor.
> Otherwise, we might up with different strategies or revert what we
> decide in this KIP if it's not compatible with KIP-307.
>
>
> -Matthias
>
>
> On 9/12/18 6:28 PM, Guozhang Wang wrote:
> > Hello Bill,
> >
> > I made a pass over your proposal and here are some questions:
> >
> > 1. For Joined names, the current proposal is to define the repartition
> > topic names as
> >
> > * [app-id]-this-[join-name]-repartition
> >
> > * [app-id]-other-[join-name]-repartition
> >
> >
> > And if [join-name] not specified, stay the same, which is:
> >
> > * [previous-processor-name]-repartition for both Stream-Stream (S-S)
> join
> > and S-T join
> >
> > I think it is more natural to rename it to
> >
> > * [app-id]-[join-name]-left-repartition
> >
> > * [app-id]-[join-name]-right-repartition
> >
> >
> > 2. I'd suggest to use the name to also define the corresponding processor
> > names accordingly, in addition to the repartition topic names. Note that
> > for joins, this may be overlapping with KIP-307
> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
> > as
> > it also have proposals for defining processor names for join operators as
> > well.
> >
> > 3. Could you also specify how this would affect the optimization for
> > merging multiple repartition topics?
> >
> > 4. In the "Compatibility, Deprecation, and Migration Plan" section, could
> > you also mention the following scenarios, if any of the upgrade path
> would
> > be changed:
> >
> >  a) changing user DSL code: under which scenarios users can now do a
> > rolling bounce instead of resetting applications.
> >
> >  b) upgrading from older version to new version, with all the names
> > specified, and with optimization turned on. E.g. say we have the code
> > written in 2.1 with all names specified, and now upgrading to 2.2 with
> new
> > optimizations that may potentially change the repartition topics. Is that
> > always safe to do?
> >
> >
> >
> > Guozhang
> >
> >
> > On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck <bb...@gmail.com> wrote:
> >
> >> All I'd like to start a discussion on KIP-372 for the naming of joins
> and
> >> grouping operations in Kafka Streams.
> >>
> >> The KIP page can be found here:
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >> 372%3A+Naming+Joins+and+Grouping
> >>
> >> I look forward to feedback and comments.
> >>
> >> Thanks,
> >> Bill
> >>
> >
> >
> >
>
>


-- 
-- Guozhang

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Three more comments:

(1) For `Grouped` should we add `with(String name, Serde<K> key,
Serde<V> value)` to allow specifying all parameters at once?

Produced/Consumed/Serialized etc follow a similar pattern. There is one
static method for each config parameter, plus one static method that
accepts all parameters. Might be more consistent if we follow this pattern.

(2) It seems that `Serialized` is only used in `groupBy` and
`groupByKey` -- because both methods accepting `Serialized` parameter
are deprecated and replaced with methods accepting `Grouped`, it seems
that we also want to deprecate `Serialized`?

(3) About naming repartition topics: thinking about this once more, I
actually prefer to use `left|right` instead of `this|other` :)


-Matthias


On 9/13/18 6:45 AM, Matthias J. Sax wrote:
> I don't know what Samza does, however, Flink requires users to specify
> names similar to this proposal to be able to re-identify state in case
> the topology gets altered between deployments.
> 
> Flink only has state they need to worry about. For Kafka Streams, it's
> state plus repartition topics.
> 
> 
> -Matthias
> 
> On 9/13/18 1:48 AM, Eno Thereska wrote:
>> Hi folks,
>>
>> I know we don't normally have a "Related work" section in KIPs, but
>> sometimes I find it useful to see what others have done in similar cases.
>> Since this will be important for rolling re-deployments, I wonder what
>> other frameworks like Flink (or Samza) have done in these cases. Perhaps
>> they have done nothing, in which case it's fine to do this from first
>> principles, but IMO it would be good to know just to make sure we're
>> heading in the right direction.
>>
>> Also I don't get a good feel for how much work this will be for an end user
>> who is doing the rolling deployment, perhaps an end-to-end example would
>> help.
>>
>> Thanks
>> Eno
>>
>> On Thu, Sep 13, 2018 at 6:22 AM, Matthias J. Sax <ma...@confluent.io>
>> wrote:
>>
>>> Follow up comments:
>>>
>>> 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
>>> `app-id]-[join-name]-left|right-repartition` but we should not change
>>> the pattern depending if the user specifies a name of not. I am fine
>>> with both patterns---just want to make sure with stick with one.
>>>
>>> 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
>>> to be orthogonal, and thus KIP-372 should not change any processor
>>> names, but KIP-307 should define a holistic strategy for all processor.
>>> Otherwise, we might up with different strategies or revert what we
>>> decide in this KIP if it's not compatible with KIP-307.
>>>
>>>
>>> -Matthias
>>>
>>>
>>> On 9/12/18 6:28 PM, Guozhang Wang wrote:
>>>> Hello Bill,
>>>>
>>>> I made a pass over your proposal and here are some questions:
>>>>
>>>> 1. For Joined names, the current proposal is to define the repartition
>>>> topic names as
>>>>
>>>> * [app-id]-this-[join-name]-repartition
>>>>
>>>> * [app-id]-other-[join-name]-repartition
>>>>
>>>>
>>>> And if [join-name] not specified, stay the same, which is:
>>>>
>>>> * [previous-processor-name]-repartition for both Stream-Stream (S-S)
>>> join
>>>> and S-T join
>>>>
>>>> I think it is more natural to rename it to
>>>>
>>>> * [app-id]-[join-name]-left-repartition
>>>>
>>>> * [app-id]-[join-name]-right-repartition
>>>>
>>>>
>>>> 2. I'd suggest to use the name to also define the corresponding processor
>>>> names accordingly, in addition to the repartition topic names. Note that
>>>> for joins, this may be overlapping with KIP-307
>>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>> 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
>>>> as
>>>> it also have proposals for defining processor names for join operators as
>>>> well.
>>>>
>>>> 3. Could you also specify how this would affect the optimization for
>>>> merging multiple repartition topics?
>>>>
>>>> 4. In the "Compatibility, Deprecation, and Migration Plan" section, could
>>>> you also mention the following scenarios, if any of the upgrade path
>>> would
>>>> be changed:
>>>>
>>>>  a) changing user DSL code: under which scenarios users can now do a
>>>> rolling bounce instead of resetting applications.
>>>>
>>>>  b) upgrading from older version to new version, with all the names
>>>> specified, and with optimization turned on. E.g. say we have the code
>>>> written in 2.1 with all names specified, and now upgrading to 2.2 with
>>> new
>>>> optimizations that may potentially change the repartition topics. Is that
>>>> always safe to do?
>>>>
>>>>
>>>>
>>>> Guozhang
>>>>
>>>>
>>>> On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck <bb...@gmail.com> wrote:
>>>>
>>>>> All I'd like to start a discussion on KIP-372 for the naming of joins
>>> and
>>>>> grouping operations in Kafka Streams.
>>>>>
>>>>> The KIP page can be found here:
>>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>>> 372%3A+Naming+Joins+and+Grouping
>>>>>
>>>>> I look forward to feedback and comments.
>>>>>
>>>>> Thanks,
>>>>> Bill
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Posted by "Matthias J. Sax" <ma...@confluent.io>.

I don't know what Samza does, however, Flink requires users to specify
names similar to this proposal to be able to re-identify state in case
the topology gets altered between deployments.

Flink only has state they need to worry about. For Kafka Streams, it's
state plus repartition topics.


-Matthias

On 9/13/18 1:48 AM, Eno Thereska wrote:
> Hi folks,
> 
> I know we don't normally have a "Related work" section in KIPs, but
> sometimes I find it useful to see what others have done in similar cases.
> Since this will be important for rolling re-deployments, I wonder what
> other frameworks like Flink (or Samza) have done in these cases. Perhaps
> they have done nothing, in which case it's fine to do this from first
> principles, but IMO it would be good to know just to make sure we're
> heading in the right direction.
> 
> Also I don't get a good feel for how much work this will be for an end user
> who is doing the rolling deployment, perhaps an end-to-end example would
> help.
> 
> Thanks
> Eno
> 
> On Thu, Sep 13, 2018 at 6:22 AM, Matthias J. Sax <ma...@confluent.io>
> wrote:
> 
>> Follow up comments:
>>
>> 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
>> `app-id]-[join-name]-left|right-repartition` but we should not change
>> the pattern depending if the user specifies a name of not. I am fine
>> with both patterns---just want to make sure with stick with one.
>>
>> 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
>> to be orthogonal, and thus KIP-372 should not change any processor
>> names, but KIP-307 should define a holistic strategy for all processor.
>> Otherwise, we might up with different strategies or revert what we
>> decide in this KIP if it's not compatible with KIP-307.
>>
>>
>> -Matthias
>>
>>
>> On 9/12/18 6:28 PM, Guozhang Wang wrote:
>>> Hello Bill,
>>>
>>> I made a pass over your proposal and here are some questions:
>>>
>>> 1. For Joined names, the current proposal is to define the repartition
>>> topic names as
>>>
>>> * [app-id]-this-[join-name]-repartition
>>>
>>> * [app-id]-other-[join-name]-repartition
>>>
>>>
>>> And if [join-name] not specified, stay the same, which is:
>>>
>>> * [previous-processor-name]-repartition for both Stream-Stream (S-S)
>> join
>>> and S-T join
>>>
>>> I think it is more natural to rename it to
>>>
>>> * [app-id]-[join-name]-left-repartition
>>>
>>> * [app-id]-[join-name]-right-repartition
>>>
>>>
>>> 2. I'd suggest to use the name to also define the corresponding processor
>>> names accordingly, in addition to the repartition topic names. Note that
>>> for joins, this may be overlapping with KIP-307
>>> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
>>> as
>>> it also have proposals for defining processor names for join operators as
>>> well.
>>>
>>> 3. Could you also specify how this would affect the optimization for
>>> merging multiple repartition topics?
>>>
>>> 4. In the "Compatibility, Deprecation, and Migration Plan" section, could
>>> you also mention the following scenarios, if any of the upgrade path
>> would
>>> be changed:
>>>
>>>  a) changing user DSL code: under which scenarios users can now do a
>>> rolling bounce instead of resetting applications.
>>>
>>>  b) upgrading from older version to new version, with all the names
>>> specified, and with optimization turned on. E.g. say we have the code
>>> written in 2.1 with all names specified, and now upgrading to 2.2 with
>> new
>>> optimizations that may potentially change the repartition topics. Is that
>>> always safe to do?
>>>
>>>
>>>
>>> Guozhang
>>>
>>>
>>> On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck <bb...@gmail.com> wrote:
>>>
>>>> All I'd like to start a discussion on KIP-372 for the naming of joins
>> and
>>>> grouping operations in Kafka Streams.
>>>>
>>>> The KIP page can be found here:
>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>>>> 372%3A+Naming+Joins+and+Grouping
>>>>
>>>> I look forward to feedback and comments.
>>>>
>>>> Thanks,
>>>> Bill
>>>>
>>>
>>>
>>>
>>
>>
>

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Posted by Eno Thereska <en...@gmail.com>.

Hi folks,

I know we don't normally have a "Related work" section in KIPs, but
sometimes I find it useful to see what others have done in similar cases.
Since this will be important for rolling re-deployments, I wonder what
other frameworks like Flink (or Samza) have done in these cases. Perhaps
they have done nothing, in which case it's fine to do this from first
principles, but IMO it would be good to know just to make sure we're
heading in the right direction.

Also I don't get a good feel for how much work this will be for an end user
who is doing the rolling deployment, perhaps an end-to-end example would
help.

Thanks
Eno

On Thu, Sep 13, 2018 at 6:22 AM, Matthias J. Sax <ma...@confluent.io>
wrote:

> Follow up comments:
>
> 1) We should either use `[app-id]-this|other-[join-name]-repartition` or
> `app-id]-[join-name]-left|right-repartition` but we should not change
> the pattern depending if the user specifies a name of not. I am fine
> with both patterns---just want to make sure with stick with one.
>
> 2) I didn't see why we would need to do this in this KIP. KIP-307 seems
> to be orthogonal, and thus KIP-372 should not change any processor
> names, but KIP-307 should define a holistic strategy for all processor.
> Otherwise, we might up with different strategies or revert what we
> decide in this KIP if it's not compatible with KIP-307.
>
>
> -Matthias
>
>
> On 9/12/18 6:28 PM, Guozhang Wang wrote:
> > Hello Bill,
> >
> > I made a pass over your proposal and here are some questions:
> >
> > 1. For Joined names, the current proposal is to define the repartition
> > topic names as
> >
> > * [app-id]-this-[join-name]-repartition
> >
> > * [app-id]-other-[join-name]-repartition
> >
> >
> > And if [join-name] not specified, stay the same, which is:
> >
> > * [previous-processor-name]-repartition for both Stream-Stream (S-S)
> join
> > and S-T join
> >
> > I think it is more natural to rename it to
> >
> > * [app-id]-[join-name]-left-repartition
> >
> > * [app-id]-[join-name]-right-repartition
> >
> >
> > 2. I'd suggest to use the name to also define the corresponding processor
> > names accordingly, in addition to the repartition topic names. Note that
> > for joins, this may be overlapping with KIP-307
> > <https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
> > as
> > it also have proposals for defining processor names for join operators as
> > well.
> >
> > 3. Could you also specify how this would affect the optimization for
> > merging multiple repartition topics?
> >
> > 4. In the "Compatibility, Deprecation, and Migration Plan" section, could
> > you also mention the following scenarios, if any of the upgrade path
> would
> > be changed:
> >
> >  a) changing user DSL code: under which scenarios users can now do a
> > rolling bounce instead of resetting applications.
> >
> >  b) upgrading from older version to new version, with all the names
> > specified, and with optimization turned on. E.g. say we have the code
> > written in 2.1 with all names specified, and now upgrading to 2.2 with
> new
> > optimizations that may potentially change the repartition topics. Is that
> > always safe to do?
> >
> >
> >
> > Guozhang
> >
> >
> > On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck <bb...@gmail.com> wrote:
> >
> >> All I'd like to start a discussion on KIP-372 for the naming of joins
> and
> >> grouping operations in Kafka Streams.
> >>
> >> The KIP page can be found here:
> >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> >> 372%3A+Naming+Joins+and+Grouping
> >>
> >> I look forward to feedback and comments.
> >>
> >> Thanks,
> >> Bill
> >>
> >
> >
> >
>
>

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Posted by "Matthias J. Sax" <ma...@confluent.io>.

Follow up comments:

1) We should either use `[app-id]-this|other-[join-name]-repartition` or
`app-id]-[join-name]-left|right-repartition` but we should not change
the pattern depending if the user specifies a name of not. I am fine
with both patterns---just want to make sure with stick with one.

2) I didn't see why we would need to do this in this KIP. KIP-307 seems
to be orthogonal, and thus KIP-372 should not change any processor
names, but KIP-307 should define a holistic strategy for all processor.
Otherwise, we might up with different strategies or revert what we
decide in this KIP if it's not compatible with KIP-307.


-Matthias


On 9/12/18 6:28 PM, Guozhang Wang wrote:
> Hello Bill,
> 
> I made a pass over your proposal and here are some questions:
> 
> 1. For Joined names, the current proposal is to define the repartition
> topic names as
> 
> * [app-id]-this-[join-name]-repartition
> 
> * [app-id]-other-[join-name]-repartition
> 
> 
> And if [join-name] not specified, stay the same, which is:
> 
> * [previous-processor-name]-repartition for both Stream-Stream (S-S) join
> and S-T join
> 
> I think it is more natural to rename it to
> 
> * [app-id]-[join-name]-left-repartition
> 
> * [app-id]-[join-name]-right-repartition
> 
> 
> 2. I'd suggest to use the name to also define the corresponding processor
> names accordingly, in addition to the repartition topic names. Note that
> for joins, this may be overlapping with KIP-307
> <https://cwiki.apache.org/confluence/display/KAFKA/KIP-307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
> as
> it also have proposals for defining processor names for join operators as
> well.
> 
> 3. Could you also specify how this would affect the optimization for
> merging multiple repartition topics?
> 
> 4. In the "Compatibility, Deprecation, and Migration Plan" section, could
> you also mention the following scenarios, if any of the upgrade path would
> be changed:
> 
>  a) changing user DSL code: under which scenarios users can now do a
> rolling bounce instead of resetting applications.
> 
>  b) upgrading from older version to new version, with all the names
> specified, and with optimization turned on. E.g. say we have the code
> written in 2.1 with all names specified, and now upgrading to 2.2 with new
> optimizations that may potentially change the repartition topics. Is that
> always safe to do?
> 
> 
> 
> Guozhang
> 
> 
> On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck <bb...@gmail.com> wrote:
> 
>> All I'd like to start a discussion on KIP-372 for the naming of joins and
>> grouping operations in Kafka Streams.
>>
>> The KIP page can be found here:
>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
>> 372%3A+Naming+Joins+and+Grouping
>>
>> I look forward to feedback and comments.
>>
>> Thanks,
>> Bill
>>
> 
> 
>

Re: [DISCUSS] KIP-372: Naming Joins and Grouping

Posted by Guozhang Wang <wa...@gmail.com>.

Hello Bill,

I made a pass over your proposal and here are some questions:

1. For Joined names, the current proposal is to define the repartition
topic names as

* [app-id]-this-[join-name]-repartition

* [app-id]-other-[join-name]-repartition

And if [join-name] not specified, stay the same, which is:

* [previous-processor-name]-repartition for both Stream-Stream (S-S) join
and S-T join

I think it is more natural to rename it to

* [app-id]-[join-name]-left-repartition

* [app-id]-[join-name]-right-repartition

2. I'd suggest to use the name to also define the corresponding processor
names accordingly, in addition to the repartition topic names. Note that
for joins, this may be overlapping with KIP-307
<https://cwiki.apache.org/confluence/display/KAFKA/KIP-307%3A+Allow+to+define+custom+processor+names+with+KStreams+DSL>
as
it also have proposals for defining processor names for join operators as
well.

3. Could you also specify how this would affect the optimization for
merging multiple repartition topics?

4. In the "Compatibility, Deprecation, and Migration Plan" section, could
you also mention the following scenarios, if any of the upgrade path would
be changed:

 a) changing user DSL code: under which scenarios users can now do a
rolling bounce instead of resetting applications.

 b) upgrading from older version to new version, with all the names
specified, and with optimization turned on. E.g. say we have the code
written in 2.1 with all names specified, and now upgrading to 2.2 with new
optimizations that may potentially change the repartition topics. Is that
always safe to do?

Guozhang

On Wed, Sep 12, 2018 at 4:52 PM, Bill Bejeck <bb...@gmail.com> wrote:

> All I'd like to start a discussion on KIP-372 for the naming of joins and
> grouping operations in Kafka Streams.
>
> The KIP page can be found here:
> https://cwiki.apache.org/confluence/display/KAFKA/KIP-
> 372%3A+Naming+Joins+and+Grouping
>
> I look forward to feedback and comments.
>
> Thanks,
> Bill
>

-- 
-- Guozhang